Big syncs with millions of files

lc63 · July 21, 2023, 3:24pm

Following a RAM problem in the context of several tens of millions of objects, I'd like to share a workaround with you.

The problem

Rclone syncs on a directory by directory basis. If you have 10,000,000 directories with 1,000 files in and it will sync fine, but if you have a directory with 100,000,000 files in you will a lot of RAM to process it.

Until the OOM killer kills the process, the log is then filled by, :

2023/07/06 15:30:35 INFO  :
Transferred:              0 B / 0 B, -, 0 B/s, ETA -
Elapsed time:       1m0.0s

... although HTTP REQUEST requests are made, with HTTP RESPONSE 200 in response (--dump-headers option), no copy is made.

This problem exists until at least version rclone v1.64.0-beta.7132.f1a842081.

Workaround

We can get around the problem as follows.

First read file or object names

rclone lsf --files-only -R src:bucket | sort > src
rclone lsf --files-only -R dst:bucket | sort > dst

Now use comm to find what files/objects need to be transferred

comm -23 src dst > need-to-transfer
comm -13 src dst > need-to-delete

You now have a list of files you need to transfer from src to dst and another list of files in dst that aren't in src so should likely be deleted.

Then break the need-to-transfer file up into chunks of (say) 10,000 lines with something like split -l 10000 need-to-transfer and run this on each chunk to transfer 10,000 files at a time. The --files-from and the --no-traverse means that this won't list the source or the destination:

rclone copy src:bucket dst:bucket --files-from need-to-transfer --no-traverse
rclone delete src:bucket dst:bucket --files-from need-to-delete --no-traverse

If you need to sync changes, you can include hash and/or size in the listing. For example, with hashes:

rclone lsf --files-only --format "ph" -R src:bucket | sort -t';' -k1 > src
rclone lsf --files-only --format "ph" -R dst:bucket | sort -t';' -k1 > dst

The comm tool will then filter the two fields as one.

imthenachoman · September 7, 2023, 2:35am

This is fantastic. Thank you for sharing!

imthenachoman · September 9, 2023, 2:57am

A few questions/comments:

So the last command, with the --format "ph" will include the file hash for syncing. Some backends don't support "updating" a file and will, instead, upload a second copy. How can we find files that are on both but the dst hash does not match the src hash?
Why break the need to transfer into chunks? Won't rclone handle that?
rclone delete only takes one bucket. The command should be rclone delete dst:bucket --files-from need-to-delete --no-traverse.

ncw · September 9, 2023, 9:28am

The if you use the listing with the -format "ph" then they will appear in the comm in the need-to-transfer and need-to-delete.

Note also that you can't pass these files straight to rclone any more, you need to cut the hashes off with something like

cut -d';' -f1 need-to-delete-with-hash > need-to-delete-without

(this assumes you don't have file names with ; in)

It can yes. If the chunks are too big say >10,000,000 files then you'll be into running out of memory territory again.

Yes that is correct.

PS I wrote the original version of these instructions!

imthenachoman · September 10, 2023, 4:14am

Thank you so much! I am working on a script I will use for my backups.

Google Photos does not support hash. I need to figure something else out. Will keep researching.