What is the problem you are having with rclone?
Too much data to dedupe, not having issue per se with rclone dedupe wondering if there might be a better way by somehow combining dedupe with lsf results.
Run the command 'rclone version' and share the full output of the command.
- os/version: Microsoft Windows Server 2019 Standard 1809 (64 bit)
- os/kernel: 10.0.17763.5576 (x86_64)
- os/type: windows
- os/arch: amd64
- go/version: go1.22.1
- go/linking: static
- go/tags: cmount
Which cloud storage system are you using? (eg Google Drive)
Google Drive
The command you were trying to run (eg rclone copy /tmp remote:tmp
)
rclone dedupe REMOTE:/ --buffer-size 2048M --fast-list --by-hash --dedupe-mode newest --checkers 200 --drive-chunk-size 1024M --drive-use-trash=false --config=drives2.conf
Please run 'rclone config redacted' and share the full output.
[REMOTE]
type = drive
scope = drive
service_account_file = sa\3457.json
team_drive = XXX
server_side_across_configs = true
So I am trying to dedupe 25 PB worth of data spread across a great many TD.
Running lsf like this
rclone lsf --use-json-log --recursive REMOTE: > .\MD5-5\REMOTE.txt --format psthi --separator " ||| " --config=drives2.conf
Gives me a list on everything on each drive, including the md5 value including the Gdrive Object ID's
Just running lsf on all the drives through a batch process which does 320 drives at the same time takes about 6 hours before I get full list of everything on all drives.
So I assume that trying to do dedupe across the drives would take forever.
I am wondering though IF there might be a trick to get dedupe to use say a single lsf file providing the info required for dedupe to do its thing?
So example might be feeding dedupe a file which only has the md5, date, and GDrive ID.
Then telling rclone to use that file to purge all matching md5 values leaving only the newest file.