Slow checksum validation

RyanH · October 13, 2020, 6:25am

#### What is the problem you are having with rclone?
When validating checksums for files that have been copied across to a remote (Microsoft OneDrive), the command is running well over 3 hours for a small number of files and directories.

The file sizes range between 200MB and 400MB. There are approximately 2,566 Files and 96 directories. The process also appears to result in increased CPU & memory utilization.

As it renders the device unusable, the process is stopped prior to completing the validation.

#### What is your rclone version (output from rclone version)
rclone v1.53.1

#### Which OS you are using and how many bits (eg Windows 7, 64 bit)
Microsoft Windows 10 Professional version 1909 Build 18636.1110

#### Which cloud storage system are you using? (eg Google Drive)
Microsoft OneDrive

#### The command you were trying to run (eg rclone copy /tmp remote:tmp)
rclone.exe check --stats 10s --progress --log-file=rclone_log.txt --log-level DEBUG /path/to/files/on/local/hard/drive remote:/path

#### The rclone config contents with secrets removed.
The configuration is standard i.e. it follows the Microsoft OneDrive rclone guide - https://rclone.org/onedrive/. There is no encryption configured.

#### A log from the command with the -vv flag
There is no additional information in the log files apart from a successful matching hash e.g.

2020/10/13 14:22:20 DEBUG : [REDACTED]: SHA-1 = 251939df1fb2ef1ae0372abbe480e8be51308780 OK

or a missing file such as;

2020/10/13 17:10:29 ERROR : [REDACTED]: File not in One drive root '[REDACTED]

Animosity022 · October 13, 2020, 10:45am

You can use more checkers possibly. Are you on slow disk? There isn't much to do with calculating checksums other than wait.

ncw · October 13, 2020, 1:54pm

If you are on a slow HDD then reducing --checkers to --checkers 1 might help so rclone reads each file sequentially.

While rclone is running if you open task manager - take a look at CPU and Disk usage. Doing checksums is both CPU and Disk intensive - but one might be worse than the other.

RyanH · October 13, 2020, 6:15pm

@Animosity022, it's a standard SATA (7200 RPM) disk. What is the time for each validation e.g. 1 millisecond?

RyanH · October 13, 2020, 6:16pm

@ncw, thanks for the suggestion. Is there an option to include CPU & memory utilization metrics in the logs for each operation as to identify the bottleneck?

Animosity022 · October 13, 2020, 6:26pm

Depends on the size of the file as a 7200 rpm would be longer as I'd say we're talking a second or two per file.

RyanH · October 13, 2020, 6:30pm

@Animosity022, the file size ranges between 200MB - 400MB and there are approximately 2,000 files which would equate to approximately 35 minutes. In this instance, after 3 hours, the validation hadn't been completed and since the device was unusable, the process had to be stopped.

Animosity022 · October 13, 2020, 6:32pm

Math is not quite that simple as if you are running multiple checkers as @ncw noted, you can be over utilizing your hard drive and making it take longer.

Depending on what the system is doing, you'd have to make adjustments and you may want to go lower.

Bigger isn't always better unfortunately.

RyanH · October 13, 2020, 6:35pm

Thanks @Animosity022. So that I am clear, reducing the checkers from the default 8 to 1 or 2 would reduce the IO on the disk?

Is there a way to output the IO for each operation in the logs so it's easier to identify the goldilocks zone?

Animosity022 · October 13, 2020, 6:36pm

Not a clue on Windows how to do that. Not hard on Linux

You'd have to test to find out the suite spot as it's very system dependent.

RyanH · October 13, 2020, 6:39pm

Would the --buffer-size and --use-mmap flags help? If yes, what is the recommended buffer size based on the physical RAM available?

Animosity022 · October 13, 2020, 6:46pm

use-nmap is for reducing the memory footprint on low end systems.

buffer-size is how rclone requests and reads ahead. I can't imagine it would impact much.

I'd test with transfers and go from there.

RyanH · October 13, 2020, 7:04pm

I would image if CPU & memory IO are affected by checksum validations, --use-nmap would assist in constraining the available memory allocation for instance as would --buffer-size.

Have I understood it incorrectly?

Animosity022 · October 13, 2020, 7:17pm

Checksums are primarily CPU as it is calculating a value for the file.
use-mmapis how rclone handles cleaning up memory and does well on low memory systems.
buffer-size is what is kept in memory and read ahead when a file is request sequentially before it's closed.

Unless you need it, I would not use mmap.

RyanH · October 14, 2020, 4:34am

@ncw, the suggestion to reduce the checkers to 1 helped reduce the CPU and memory utilization. It still took a considerable time to validate the checksum though. It took approximately 3 hours to validate 2500 files with 54 errors (which I am unsure how best to address. Does running the command again only limit it to the errors or does it attempt to process all of them)

I'll continue tweaking the option to find a suitable value. It

RyanH · October 14, 2020, 4:35am

Silly question, how do I know if I'll need it both --buffer-size and --use-nmap? For example, what are the scenarios that it should be used if doesn't affect checksum validation?

Animosity022 · October 14, 2020, 10:51am

The only reason to change either would be a very low powered server/PI or something along those lines.

ncw · October 14, 2020, 3:17pm

Great

That is better! You can try increasing --checkers until the performance drops off.

Running the sync again will address the errors. However it would be worth while working out why you got the errors first as checksum errors are quite unusual.

What kind of files are they? Are they office type files, or images? Onedrive has had problems with both of those in the past.

If not and If I had this on my computer I would first run memtest86 to check the RAM on my computer was OK.

It processes all of them, alas.

RyanH · October 14, 2020, 6:14pm

Thanks @ncw. I'll give increasing --checkers a go.

Having had the look at the logs, it seems that if a file is missing it is considered an ERROR e.g. 2020/10/14 17:09:33 ERROR : Overview.mp4: File not in One drive root 'Media'

The files are media e.g. MP4.

2020/10/14 17:22:56 NOTICE: One drive root 'Media: 53 files missing
2020/10/14 17:22:56 NOTICE: One drive root 'Media': 53 differences found
2020/10/14 17:22:56 NOTICE: One drive root 'Media': 53 errors while checking

Although it isn't clear why it's reporting 54 errors i.e. 2020/10/14 17:22:56 Failed to check with 54 errors: last error was: 53 differences found and Errors: 54 (retrying may help). How do I find out the 54th error as filtering on the log only returns 53?

Is it possible to to generate a list of files so that if commands such as copy, sync, checksum, etc are run, it skips files that have been successfully processed?

ncw · October 14, 2020, 7:11pm

RyanH:

2020/10/14 17:22:56 NOTICE: One drive root 'Media: 53 files missing
2020/10/14 17:22:56 NOTICE: One drive root 'Media': 53 differences found
2020/10/14 17:22:56 NOTICE: One drive root 'Media': 53 errors while checking

Files missing can be easily fixed

You can just run sync or copy again and they will skip files that are ok.

You can generate lists of files which have errors - check the help for rclone check. You can feed these back into rclone with --files-from