Dedupe 25PB using dedupe and lsf possible?

What is the problem you are having with rclone?

Too much data to dedupe, not having issue per se with rclone dedupe wondering if there might be a better way by somehow combining dedupe with lsf results.

Run the command 'rclone version' and share the full output of the command.

- os/version: Microsoft Windows Server 2019 Standard 1809 (64 bit)
- os/kernel: 10.0.17763.5576 (x86_64)
- os/type: windows
- os/arch: amd64
- go/version: go1.22.1
- go/linking: static
- go/tags: cmount

Which cloud storage system are you using? (eg Google Drive)

Google Drive

The command you were trying to run (eg rclone copy /tmp remote:tmp)

rclone dedupe REMOTE:/ --buffer-size 2048M --fast-list --by-hash --dedupe-mode newest --checkers 200 --drive-chunk-size 1024M --drive-use-trash=false --config=drives2.conf

Please run 'rclone config redacted' and share the full output.

[REMOTE]
type = drive
scope = drive
service_account_file = sa\3457.json
team_drive = XXX
server_side_across_configs = true

So I am trying to dedupe 25 PB worth of data spread across a great many TD.

Running lsf like this

rclone lsf --use-json-log --recursive REMOTE: > .\MD5-5\REMOTE.txt --format psthi --separator " ||| " --config=drives2.conf

Gives me a list on everything on each drive, including the md5 value including the Gdrive Object ID's

Just running lsf on all the drives through a batch process which does 320 drives at the same time takes about 6 hours before I get full list of everything on all drives.

So I assume that trying to do dedupe across the drives would take forever.

I am wondering though IF there might be a trick to get dedupe to use say a single lsf file providing the info required for dedupe to do its thing?

So example might be feeding dedupe a file which only has the md5, date, and GDrive ID.

Then telling rclone to use that file to purge all matching md5 values leaving only the newest file.

No such functionality exist at the moment.

Size of data is irrelevant here as nothing is transferred. What matters is number of files. A rule of thumb is 1k of RAM per file so you need 1GB RAM for every 1 million files. lsf is slow on gdrive so maybe you do not have huge number of files anyway.

Not necessarily. Assuming you have enough of RAM you should try to run it with -vv flag. I guess it will take 6h+ before deduplication starts but then you will see how fast it progress. You can always use --dry-run if you do not want to delete anything when experimenting.

According to lsf there are over 30 million objects.

So then let's try it this way.

Since I have the lsf for every object. IS there any tool/script etcc out there where I could give it the full list of MD5,ObjectID,Date and tell that tool to output list of duplicates, leaving the Newest file out of the list?

Been trying wrap my head around trying to find a way to even get a lsit of just the dupes using lsf data.

If that was possible then I have tool where I can simply provide the list of ObjectID and it would feed them through the Google Delete API.

Obviously it is easy to take the lsf info and only export the lines where there is a matching MD5 value.

But that list means I still have to find a way to create another file telling me only the duplicates leaving the newest file out of that list.

Hope that makes sense

When you have a list then indeed it can be done programmatically.

Maybe somebody did it and will be so nice to share it:)

I found this thread looking for dedupe advice ... see Question about Google Drive usage & duplicates for a script that might be helpful in organizing the results of lsf to find dupes.

I also cannot run dedupe on all my files .. but with lsf I can go folder by folder and then combine my results.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.