Fastest way to check for changes in 2.5+ Million files?

Hello!

I am using the latest rclone version (linux/amd64) and I want to sync a directory on my NAS to GDrive which consists of over 2.5 million files and was wondering what the fastest way to check for changes to the files through all of them.

/opt/bin/rclone sync "/share/CACHEDEV1_DATA/Username/" "gcrypt:shared/Username" --backup-dir "gcrypt:shared/Backups/Username/"`date -I`/`date +%HH%MM` --log-file /share/CACHEDEV1_DATA/rclone/sync.log -v --tpslimit 10 --transfers 10 --checkers 10 --max-backlog 3000000 --size-only --skip-links --use-mmap --config /share/CACHEDEV1_DATA/rclone/rclonegdrivebackup.conf

Is there a smarter/better/faster way than the one I am currently using?

hi,

one thing to think about is to use or not use
https://rclone.org/docs/#fast-list
as gdrive has lots of throttling, making a lot of api calls will be slow.
but with so many files, you might run out of memory.

and it depends on your use-case and the nature of the source folder; it might be possible to use a filter such as
https://rclone.org/filtering/#max-age-don-t-transfer-any-file-older-than-this

what is the total size of all the 2.5M+ files?

Here are some additional ideas to speed up the checking:

Make sure you have your own Client ID

Remove --tpslimit=10 (unless a test has shown that it is needed, if so then ignore the other ideas too)

Add --drive-pacer-min-sleep=10ms (to use the latest limits from Google)

Change --checkers=16 (to increase concurrency)

I don’t know how they will play with all the other parameters. I would perform some tests to find the 2-3 parameters that makes the most difference and then leave the rest at defaults (remember --dry-run).

I have assumed default settings for Google Drive in your config.

1 Like

The size is 2.6 TB

PS: It may also be faster to execute rclone directly on the NAS (if possible). It depends on your LAN speed, the specs of your NAS and your current client, network protocols etc.

@asdffdsa Sorry to interrupt, over to you

1 Like

This command is directly run on the NAS.
Currently I am testing the following --dry-run

/opt/bin/rclone sync "/share/CACHEDEV1_DATA/Username/" "gcrypt:shared/Username" --backup-dir "gcrypt:shared/Backups/Username/"date -I/date +%HH%MM--log-file /share/CACHEDEV1_DATA/rclone/syncdry.log -v --exclude "@Recycle/**" --dry-run --drive-pacer-min-sleep=10ms --drive-pacer-burst 200 --checkers 16 --max-backlog 3000000 --size-only --skip-links --use-mmap --config /share/CACHEDEV1_DATA/rclone/rclonegdrivebackup.conf

Rclone check the latest modified time of the file right ?
So if an old file which was like lets say 2 month old was modified the max age would pick it up as a file which is not older than x days ?

also is there something better to use than --size-only for my command?

  • i do not use gdrive and it has many quirks, so no idea how to optimize for it, perhaps @Ole knows...
  • is this a one-time sync or to be run on a schedule?

not sure your use-case, how critical the data is and if this sync is the only backup.

yes, --max-age uses mod-time.
you can run a daily sync using --max-age=24h and once a week without -max-age
this can work great if new files are added to the source.
tho rclone might not notice some files until a full sync is performed.

  • if a source file is moved.
  • if a source file is deleted.
  • perhaps other situations.

Thank you for all the info!

@Animosity022 you are the Google Drive Expert, any recommendations?

have you done the first sync or are you testing the best command for the first sync?

1 Like

for what it is worth,

  • using wasabi, a s3 clone, known for hot storage
  • for 1,000,000 files

this command took 33 seconds.
rclone sync d:\folder wasabi01:folder --size-only --transfers=64 --checkers=64 --dry-run --progress --stats-one-line --log-level=DEBUG --log-file=log.txt

1 Like

I think all the recommendations are spot on.

Best bet is to start with some big numbers, check the logs and go from there. You want to push the API hard, but not create too many pacer issues. It's a balance.

1 Like

I just tried the following command

t/opt/bin/rclone sync "/share/CACHEDEV1_DATA/Username/" "gcrypt:shared/Username" --backup-dir "gcrypt:shared/Backups/Username/"`date -I`/`date +%HH%MM` --log-file /share/CACHEDEV1_DATA/rclone/syncdry.log --exclude "@Recycle/**" --dry-run --drive-pacer-min-sleep=10ms  --transfers=10 --checkers=10 --size-only --skip-links --log-level=DEBUG --config /share/CACHEDEV1_DATA/rclone/rclonegdrivebackup.conf &

and the log is full of this:

2021/08/15 14:55:55 DEBUG : pacer: low level retry 1/10 (error googleapi: Error 403: User Rate Limit Exceeded. Rate of requests for user exceed configured project quota. You may consider re-evaluating expected per-user traffic to the API and adjust project quota limits accordingly. You may monitor aggregate quota usage and adjust limits in the API Console: https://console.developers.google.com/apis/api/drive.googleapis.com/quotas?project=5061111557159, userRateLimitExceeded)
2021/08/15 14:55:55 DEBUG : pacer: Rate limited, increasing sleep to 1.246962361s
2021/08/15 14:55:55 DEBUG : pacer: low level retry 1/10 (error googleapi: Error 403: User Rate Limit Exceeded. Rate of requests for user exceed configured project quota. You may consider re-evaluating expected per-user traffic to the API and adjust project quota limits accordingly. You may monitor aggregate quota usage and adjust limits in the API Console: https://console.developers.google.com/apis/api/drive.googleapis.com/quotas?project=5061111557159, userRateLimitExceeded)
2021/08/15 14:55:55 DEBUG : pacer: Rate limited, increasing sleep to 2.951614579s
2021/08/15 14:55:55 DEBUG : pacer: low level retry 1/10 (error googleapi: Error 403: User Rate Limit Exceeded. Rate of requests for user exceed configured project quota. You may consider re-evaluating expected per-user traffic to the API and adjust project quota limits accordingly. You may monitor aggregate quota usage and adjust limits in the API Console: https://console.developers.google.com/apis/api/drive.googleapis.com/quotas?project=5061111557159, userRateLimitExceeded)
2021/08/15 14:55:55 DEBUG : pacer: Rate limited, increasing sleep to 4.732838921s
2021/08/15 14:55:55 DEBUG : pacer: low level retry 1/10 (error googleapi: Error 403: User Rate Limit Exceeded. Rate of requests for user exceed configured project quota. You may consider re-evaluating expected per-user traffic to the API and adjust project quota limits accordingly. You may monitor aggregate quota usage and adjust limits in the API Console: https://console.developers.google.com/apis/api/drive.googleapis.com/quotas?project=5061111557159, userRateLimitExceeded)
2021/08/15 14:55:55 DEBUG : pacer: Rate limited, increasing sleep to 8.396977648s
2021/08/15 14:55:55 DEBUG : pacer: low level retry 1/10 (error googleapi: Error 403: User Rate Limit Exceeded. Rate of requests for user exceed configured project quota. You may consider re-evaluating expected per-user traffic to the API and adjust project quota limits accordingly. You may monitor aggregate quota usage and adjust limits in the API Console: https://console.developers.google.com/apis/api/drive.googleapis.com/quotas?project=5061111557159, userRateLimitExceeded)
2021/08/15 14:55:55 DEBUG : pacer: Rate limited, increasing sleep to 16.613072419s
2021/08/15 14:55:55 DEBUG : pacer: low level retry 1/10 (error googleapi: Error 403: User Rate Limit Exceeded. Rate of requests for user exceed configured project quota. You may consider re-evaluating expected per-user traffic to the API and adjust project quota limits accordingly. You may monitor aggregate quota usage and adjust limits in the API Console: https://console.developers.google.com/apis/api/drive.googleapis.com/quotas?project=5061111557159, userRateLimitExceeded)
2021/08/15 14:55:55 DEBUG : pacer: Rate limited, increasing sleep to 16.694013281s
2021/08/15 14:55:56 DEBUG : pacer: low level retry 2/10 (error googleapi: Error 403: User Rate Limit Exceeded. Rate of requests for user exceed configured project quota. You may consider re-evaluating expected per-user traffic to the API and adjust project quota limits accordingly. You may monitor aggregate quota usage and adjust limits in the API Console: https://console.developers.google.com/apis/api/drive.googleapis.com/quotas?project=5061111557159, userRateLimitExceeded)
2021/08/15 14:55:56 DEBUG : pacer: Rate limited, increasing sleep to 16.550344014s

This is the limits google shows:

Queries per day - 1.000.000.000
Queries per 100 seconds per user - 20.000
Queries per 100 seconds - 20.000

How/Why am I getting rate limited?

Lower your transfers/checkers. You get rate limited when you hit the API too hard.

I felt that my transfers checkers are very low with 10 already, but I will try it now and report back.

What you can do is do a top-up sync (note the copy below - that is important)

So use

rclone copy --min-age 1d --no-traverse /path/to/source dest:

To copy all files that were modified within the last day.

This is very quick (there is an example in the rclone copy docs)

You'd then do a full sync every now and again to delete any files which need deleting and check that nothing got missed.

Do you use --size-only because of S3's additional API call for modtime? Does it speed anything up if both remotes support it more efficiently?

Just a thought out of left field, would it be better to run rclone on the NAS? I know when I am using rclone on an SMB mount on my mac, it is just miserable. I don't think SMB is as fast as rclone. You would need to be more careful of memory but if SMB is the bottleneck, it could help.

hi,

this is the command, just run it again now, took 26 seconds.
rclone sync D:\files\source wasabi01:rclonelotsoffiles --size-only --transfers=128 --checkers=128 --dry-run --progress --stats-one-line --log-level=INFO --log-file=log.fast.nolist.txt

as for --size-only, on s3, does not require extra api calls, so that does speed things up.
as for --checksum, as per the docs.
https://rclone.org/s3/#avoiding-head-requests-to-read-the-modification-time
"If the source and destination are both S3 this is the recommended flag to use for maximum efficiency."

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.