Idea for improving performance of file checks

cowwoc2020 · February 22, 2019, 6:40pm

When debugging my backup process, I spent 90% of the time waiting for rclone to skip over (check) existing files. I am syncing millions of small files onto some remote destination. Sometimes online (B2). Other times on a network share.

In order to speed this up, we can create a cache file containing key-value pairs where:

key = hash of the command-line that triggered checking of the directory
value = timestamp when the check completed

Then, when the user runs a command we’d skip over any directories that we already processed in the past X milliseconds (with some reasonable default value). When running incremental backups, I would happily skip over any directories that have been processed in the past hour as I only run backups once a day in production (shorter time periods indicate I am debugging).

Thoughts?

Gili

calisro · February 22, 2019, 9:13pm

The file checks are pretty quick depending on what you’re syncing. Provide your command and which remote type you are having issues with. You can also use --fast-list if you have plenty of RAM as that will make the directory listings more efficient.

Alternatively, you can ‘work around’ your desire to sync files changed in the past hour by generating a list of files to sync using a command line tool that looks at modification times and then use the ‘–files-from’ parameter with that generated list to sync only those files.

calisro · February 22, 2019, 9:14pm

Also, you could implement a local cache if that fits your needs to locally store metadata to help.

cowwoc2020 · February 23, 2019, 3:50am

Hi Calisro,

I am backing up 53.4GB made up of 20,425 files. Running a check (even if there are no modifications) takes over an hour.

My command line is:

rclone sync --stats-log-level NOTICE --fast-list --delete-excluded --transfers=10 -v --exclude “/AppData/Local/" --exclude "/AppData/LocalLow/” --exclude “/My Documents/" --exclude "/NetHood/” --exclude “/Start Menu/" --exclude "/SendTo/” --exclude “/Templates/" --exclude "/Application Data/” --exclude “/PrintHood/" --exclude "/Cookies/” --exclude “/Recent/" --exclude "/Local Settings/” --exclude “/Documents/My Music/" --exclude "/Documents/My Videos/” --exclude “/Favorites/" --exclude "/Documents/My Pictures/” --exclude “/NTUSER.DAT*” --exclude “/ntuser.dat*” --exclude “/AppData/Roaming/Microsoft/Windows/Recent/" --exclude "/node_modules” [source] [target]

where [source] is a entry point I wish to back up recursively and [target] is configures as follows:

type = local
nounc = false

The machines are linked over a 802.11ac wifi connection, Windows share, connected to an external drive over a USB3 connection. When backing up large files I get high speeds (upwards of 60Mbps). When running file checks I get only about 3Mbps.

Any ideas?

Gili

calisro · February 23, 2019, 4:22am

Do you have your own API key? That would be the first thing to do to improve performance.

cowwoc2020 · February 23, 2019, 4:34am

Do you have your own API key? That would be the first thing to do to improve performance.

I don't understand. What API key are you referring to?

Animosity022 · February 23, 2019, 11:56am

He’s asking if you made your own client ID.

https://rclone.org/drive/#making-your-own-client-id

cowwoc2020 · February 23, 2019, 10:22pm

Hmm, seeing as I’m not using Google Drive (I am backing up to a network drive) I fail to see how this is relevant.

Animosity022 · February 23, 2019, 10:43pm

Have you tried the cache backend like suggested?

calisro · February 24, 2019, 12:33am

I misunderstood. You still should try cache. It'll cache the metadata of what is on the remote you are syncing to.

cowwoc2020 · February 24, 2019, 2:56am

I didn't know about this feature before now. Yes, it does what I'm looking for with one big caveat: you cannot hit CTRL+C or break out of a recursive command. The entire reason I am trying to cache is because I am debugging my backup script. I need to abort it as soon as I see something wrong.

Is this a bug or a "feature"?

calisro · February 24, 2019, 3:08am

Not sure what you mean. You can abort it with ctrl-c or whatever.

cowwoc2020 · February 24, 2019, 4:19am

Doesn't work for me under Windows 10. Running "rclone lsd -R" against a "local" target is abortable using CTRL+C, but doing the same against a "cache" target ignores CTRL+C. Not sure why.

Which platform are you testing against?

calisro · February 24, 2019, 1:45pm

Found this

Wonder is if wasn’t fully fixed or if it crept back in. Might want to open a issue.

cowwoc2020 · February 24, 2019, 5:28pm

Good catch. I opened Cache backend: Commands cannot be stopped with Ctrl-C · Issue #2997 · rclone/rclone · GitHub