Rclone sync - large dataset and change rate

dpk25 · December 10, 2018, 2:05pm

I’ve been working on syncing over 5 TB of data to S3 in a cost effective manner every 4 hours. I found that --fast-list and --checksum were the most accurate and cost effective (low API calls), however, calculating the file hash on millions of files was significantly more resource intensive and taking over 14 hours! Switching to --size-only and --fast-list has reduced syncing times to 1-2 hours, but I’ve noticed some files that are changing are being missed as their size does not change…I’m looking for some thoughts and ideas aside from increasing sync frequency.

ncw · December 10, 2018, 2:38pm

You could try an intermediate sync to just copy files which have changed recently

rclone copy --max-age 4h --no-traverse /path/to/src remote:

You’ll need the latest beta for --no-traverse. This will just copy the files that have changed modification time. You can use --size-only or --checksum with that, but you might find that if you aren’t copying many files each time then the default modtime sync will work best even though it costs another transaction.

You’ll probably want to run a round up sync every day or week to make sure that files get deleted on the remote.

So how many files changed in 4h would be a useful thing to know. As would be how many files are there?

dpk25 · December 10, 2018, 3:29pm

~14 million files total with a 4 hour change rate of ~20k files and ~20Gb. I will try out your suggestion, thanks.

Can you see a use case with this setup to utilize rclone cache and/or mount? The original data is stored on a network SAN and mounted as an NFS vol on the “sync” system for processing.

ncw · December 10, 2018, 4:15pm

The --no-traverse solution should get through these very quickly.

You could try rclone cache - that should help with the metadata problem. If you can get the --no-traverse solution going I think that will be lighter weight.

Can you run rclone on the SAN directly?

dpk25 · December 10, 2018, 6:44pm

Doesn’t --max-age read against mod time, therefore causing a HEAD request? I want to limit the number of HEAD and GET requests for billing purposes.

ie. no --max-age
rclone sync --stats-log-level DEBUG --stats 1s --checkers 16 --transfers 8 --exclude-from rclone_exclude.txt /mnt/1 dest-s3:bucket1 --fast-list --checksum -vv --dump headers --log-file=sync_test

grep -o ‘HEAD /’ sync_test | wc -l
50

vs.

rclone copy --stats-log-level DEBUG --stats 1s --max-age 6h --no-traverse --checksum --fast-list --exclude-from rclone_exclude.txt /mnt/1 dest-s3:bucket1 --dump headers -vv --log-file=copy_test

grep -o ‘HEAD /’ copy_test | wc -l
9414

How could I limit HEAD and GET requests in an accurate, timely fashion?

I could run against the network path directly rather than mounting the share. If you have caching ideas on how I could accomplish this, could you point me to some documentation on setting it up as I am struggling with setting it up properly.

ncw · December 11, 2018, 9:59am

It only reads the mod time on the source.

You can still use --size-only and --checksum

Are you using the latest beta BTW because --no-traverse doesn’t do anything in v1.45

dpk25 · December 11, 2018, 1:57pm

rclone version
rclone v1.45-031-ge7684b7e-beta

ncw · December 12, 2018, 9:42pm

How many files got copied that run? I’d expect at least one HEAD request per file copied.

dpk25 · December 13, 2018, 2:13pm

Copy transferred 4706 files (half the HEAD requests)
Sync transferred 50

Why would that be?

Additional info/update:
By adding --update and --use-server-modtime, sync times are under 1 hour now! However, now with this sync vs. copy discrepancy I am concerned I may be missing files.

ncw · December 14, 2018, 11:14am

Firstly the number of HEAD requests looks correct, one before the copy and one after.

Not sure why copy and sync should be so different - did copy upload most of the files then sync just fill in a few more?

This works quite well but be aware that if the time on your computer doesn’t match that of the server then it can miss files.

dpk25 · December 14, 2018, 2:47pm

That’s possible, I don’t recall which I ran first now, that would make sense though…Sync seems to only have 1 HEAD request / file copied vs. copy which is double, checking before and after as you stated. Is there anything I can do to limit copy HEAD requests to 1 HEAD request / file copied?

I am happy with sync times using --checksum --fast-list --update --use-server-modtime flags, does it make sense to use --checksum with --update --use-server-modtime or is one of them winning out?

Lastly, would you suggest running a full --checksum sync without --update --use-server-modtime flags once a day or at some interval to ensure accuracy? Would you suggest a different method for file validation?

ncw · December 14, 2018, 3:26pm

Sync will only have 1 request (if used with --checksum/--size-only) because it learns of the object’s existence from the directory listings (which are extra requests).

A copy --no-traverse does two HEAD requests.

The first is to see whether the object exists or not
After the object is transferred we read the MD5SUM to see if it is OK

I think we need the first request - that will save a copy if the object matches already. The second is arguable though since S3 checks the md5sum on upload. I recently removed the check for the swift backend. I made an issue - about this - please subscribe to that for updates!

I think --checksum is being ignored with --update

Yes I would.

If you want to check everything is present and correct you can use rclone check