Skip checksum on identical files?

enoch85 · January 5, 2025, 3:01pm

What is the problem you are having with rclone?

I've been using rclone sync for around 2 years now to sync backups to an offsite location. All is working great, but lately I've been looking into optimizing it.

I move around 10 TB of data per backup session over 600/600 Mbit/s WAN. I do checksums (look below for complete command) on the files to be 100% sure there are no unnecessary transfers, since the files are quite large. During my testing it seems like the --checksum flag makes the checksum run every time the sync happens, and with large files, that takes time!

I've been looking around now for some time to see if it would be possible to somehow store the checksums to a file instead, and then just read that file. That would mean that every time the script is run and the checksum/mod-time/size are identical - don't checksum it again. If the mod-time/size differ, then do a new checksum and transfer if it differs.

Maybe I'm bad at explaining, but my main goal is to skip checksums if the file is identical. Maybe there's a smarter way?

I've been running rsync as well (with zstd) and it's super nice, but only single CPU, so my Xeon L-CPU caps out and I can't utilize full speed over the network, even with compress-level=1 set.

Run the command 'rclone version' and share the full output of the command.

rclone v1.68.2

os/version: debian 12.8 (64 bit)
os/kernel: 6.8.12-5-pve (x86_64)
os/type: linux
os/arch: amd64
go/version: go1.23.3
go/linking: static
go/tags: none

Which cloud storage system are you using? (eg Google Drive)

NONE. Only SSH (SFTP)

The command you were trying to run (eg `rclone copy /tmp remote:tmp`)

rclone sync \
--transfers=16 \
--checksum \
--progress \
--delete-during \
--refresh-times \
--metadata \
--fast-list \
source dest

Please run 'rclone config redacted' and share the full output. If you get command not found, please make sure to update rclone.

[PBS-MAIN]
type = sftp
host = XXX
user = XXX
key_file = ~/.ssh/id_ecdsa
known_hosts_file = ~/.ssh/known_hosts
shell_type = unix
md5sum_command = md5sum
sha1sum_command = sha1sum
use_fstat = true
concurrency = 1024

A log from the command that you were trying to run with the `-vv` flag

Not relevant

kapitainsky · January 5, 2025, 3:08pm

Maybe hasher overlay remote is what could help in your case?

It is effectively hashes database used by another rclone remote.

enoch85 · January 5, 2025, 4:25pm

Thanks for input!

So now I have created hasher out of the two destinations on the source. Or should the hasher be on the destination instead?

[hasher-VZDump-LARGE]
type = hasher
remote = PBS-MAIN:/backupstorage/VZDump-LARGE/

[hasher-VZDump-SMALL]
type = hasher
remote = PBS-MAIN:/backupstorage/VZDump-SMALL

The sync command would now look like:

do_the_backup() {
NAMEPATH="$1"
# LOG
if [ ! -f "$LOGFILE"-"$NAMEPATH" ]
then
    touch "$LOGFILE"-"$NAMEPATH"
fi
rclone sync \
--transfers=16 \
--checksum \
--progress \
--delete-during \
--refresh-times \
--metadata \
--fast-list \
hasher-$NAMEPATH: /offsitestorage/"$NAMEPATH" > "$LOGFILE"-"$NAMEPATH" 2>&1
unset NAMEPATH
}

# Do the backup
do_the_backup VZDump-SMALL
do_the_backup VZDump-LARGE

Can you please confirm?

enoch85 · January 5, 2025, 5:12pm

Actually, I think I got it going now.

Next issue, how do I force SHA1 sums?

2025/01/05 18:10:18 DEBUG : dump/vzdump-qemu-105-2024_11_02-09_11_08.vma.zst.notes: Parsed hash: 873[redacted]
2025/01/05 18:10:18 DEBUG : dump/vzdump-qemu-105-2024_11_02-09_11_08.vma.zst.notes: md5 = 873[redacted] OK
2025/01/05 18:10:18 DEBUG : dump/vzdump-qemu-105-2024_11_02-09_11_08.vma.zst.notes: Size and md5 of src and dst objects identical
2025/01/05 18:10:18 DEBUG : dump/vzdump-qemu-105-2024_11_02-09_11_08.vma.zst.notes: Unchanged skipping
2025/01/05 18:10:18 DEBUG : dump/vzdump-qemu-105-2024_12_07-09_11_15.log: getHash: database empty
2025/01/05 18:10:18 DEBUG : dump/vzdump-qemu-105-2024_12_07-09_11_15.log: slow md5
2025/01/05 18:10:18 DEBUG : sftp://[redacted]:22//backupstorage/VZDump-SMALL: Shell path "/backupstorage/VZDump-SMALL/dump/vzdump-qemu-105-2024_12_07-09_11_15.log"
2025/01/05 18:10:18 DEBUG : sftp://[redacted]:22//backupstorage/VZDump-SMALL: Running remote command: md5sum /backupstorage/VZDump-SMALL/dump/vzdump-qemu-105-2024_12_07-09_11_15.log
2025/01/05 18:10:18 DEBUG : sftp://[redacted]:22//backupstorage/VZDump-SMALL: Remote command result: 37c[redacted]  /backupstorage/VZDump-SMALL/dump/vzdump-qemu-105-2024_12_07-09_11_15.log

kapitainsky · January 5, 2025, 5:39pm

--hasher-hashes sha1

enoch85 · January 5, 2025, 5:57pm

Actually, I put this in the config already:

[PBS-MAIN]
type = sftp
host = XXX
user = XXX
key_file = ~/.ssh/id_ecdsa
known_hosts_file = ~/.ssh/known_hosts
shell_type = unix
md5sum_command = md5sum
sha1sum_command = sha1sum
use_fstat = true
chunk_size = 64Ki
concurrency = 1024

[hasher-VZDump-LARGE]
type = hasher
hashes = sha1
max_age = off
remote = PBS-MAIN:/backupstorage/VZDump-LARGE

[hasher-VZDump-SMALL]
type = hasher
hashes = sha1
max_age = off
remote = PBS-MAIN:/backupstorage/VZDump-SMALL

It still uses md5 though.

enoch85 · January 5, 2025, 6:00pm

Also, it doesn't produce a DB file. So I'm wondering if it's caching?

root@pbs-offsite:~# ls -la  ~/.cache/rclone/kv/
total 1
drwx------ 2 root root 2 Jan  5 18:51 .
drwx------ 3 root root 3 Jan  5 17:29 ..

enoch85 · January 5, 2025, 6:25pm

Mission failed. Running the same script after the first run, doesn't cache the checksums (as it seems). It takes the same amount of time, and it checks the checksum again....

Does the caching work only if I make a pre-run with something like

rclone hashsum MD5 --download Hasher:path/to/subtree > /dev/null

rclone backend dump Hasher:path/to/subtree

Currently running with this:

NAMEPATH="$1"
# LOG
if [ ! -f "$LOGFILE"-"$NAMEPATH" ]
then
    touch "$LOGFILE"-"$NAMEPATH"
fi
rclone sync \
--transfers=16 \
--checkers=8 \
--checksum \
--hash=sha1 \
--progress \
--delete-during \
--refresh-times \
--metadata \
--fast-list \
hasher-"$NAMEPATH":/ /offsitestorage/"$NAMEPATH" > "$LOGFILE"-"$NAMEPATH" 2>&1
unset NAMEPATH

...and the config from above.

kapitainsky · January 6, 2025, 5:43am

I have never used hasher myself but as docs state that it helps with "Cache checksums to help with slow hashing of large local or (S)FTP files" it sounded like perfect fit for your case.

I think as you have "slow" hashing on both end of your sync it require hasher on both local and sftp. You have to experiment a bit here.

Also I noticed you sync all data every time. You could speed things up by only syncing new files (--min-age). If you run it daily then only sync files changed in the last 25h for example. You could still run full sync sometimes to make sure that all is fine.

enoch85 · January 6, 2025, 11:20am

Yeah, to me too. But, either I'm doing it wrong or it's bugging out on me. Strange this is that it says that it's writing to bolt, but there's no file. So I'm guessing it's a bug.

2025/01/06 12:07:41 DEBUG : PBS-MAIN~hasher.bolt: Opened for writing in 74.394µs
2025/01/06 12:07:42 DEBUG : PBS-MAIN~hasher.bolt: released

root@pbs-main:~# ls -la .cache/rclone/
total 1
drwxr-xr-x 2 root root 2 Jan  6 12:08 .
drwxr-xr-x 5 root root 5 Sep 15 17:59 ..

root@pbs-offsite:~# ls -la .cache/rclone/kv/
total 34
drwx------ 2 root root      4 Jan  6 12:05 .
drwx------ 3 root root      3 Jan  5 17:29 ..
-rw------- 1 root root 262144 Jan  4 22:09 PBS-MAIN~hasher.bolt

The PBS-MAIN~hasher.bolt is from an effort last night trying to pre-check everything. Running the script again doesn't update the file, and the time is the same (15 minutes on my test files which are already synced).

This is what I want to avoid:

2025/01/06 12:14:28 DEBUG : dump/vzdump-qemu-105-2025_01_04-09_10_34.vma.zst: getHash: no record
2025/01/06 12:14:28 DEBUG : dump/vzdump-qemu-105-2025_01_04-09_10_34.vma.zst: slow md5
2025/01/06 12:14:28 DEBUG : dump/vzdump-qemu-105-2025_01_04-09_10_34.vma.zst: slow md5
2025/01/06 12:14:28 DEBUG : sftp://[redacted]:22//backupstorage/VZDump-SMALL: Shell path "/backupstorage/VZDump-SMALL/dump/vzdump-qemu-105-2025_01_04-09_10_34.vma.zst"
2025/01/06 12:14:28 DEBUG : sftp://[redacted]22//backupstorage/VZDump-SMALL: Running remote command: md5sum /backupstorage/VZDump-SMALL/dump/vzdump-qemu-105-2025_01_04-09_10_34.vma.zst

Don't re-check already checksumed files.
Force it to SHA1 (it's actually faster on newer CPU due to it's design.

Trying now with hasher on both sides. I really hope it makes a difference.

Maybe @ncw have some input here on how it should work?

enoch85 · January 6, 2025, 11:26am

Well, to my surprise - cache is now working!

These are the settings:

NAMEPATH="$1"
rclone sync -vvv \
--transfers=16 \
--checksum \
--checkers=8 \
--hash=sha1 \
--progress \
--delete-during \
--refresh-times \
--metadata \
--fast-list \
hasher-"$NAMEPATH":/ hasher-OFFSITE-"$NAMEPATH":/
unset NAMEPATH

[hasher-VZDump-LARGE]
type = hasher
hashes = md5,sha1
max_age = off
remote = PBS-MAIN:/backupstorage/VZDump-LARGE

[hasher-VZDump-SMALL]
type = hasher
hashes = md5,sha1
max_age = off
remote = PBS-MAIN:/backupstorage/VZDump-SMALL

[hasher-OFFSITE-VZDump-LARGE]
type = hasher
hashes = md5,sha1
max_age = off
remote = /offsitestorage/VZDump-LARGE

[hasher-OFFSITE-VZDump-SMALL]
type = hasher
hashes = md5,sha1
max_age = off
remote = /offsitestorage/VZDump-SMALL

I'm still missing a flag for SHA1 though. Checksums seems to be cached as MD5 still.

2025/01/06 12:23:04 DEBUG : dump/vzdump-qemu-998-2024_09_07-09_14_54.vma.zst.notes: md5 = 94c86[redacted] OK
2025/01/06 12:23:04 DEBUG : dump/vzdump-qemu-998-2024_09_07-09_14_54.vma.zst.notes: Size and md5 of src and dst objects identical
2025/01/06 12:23:04 DEBUG : dump/vzdump-qemu-998-2024_10_05-09_16_12.vma.zst.notes: cached md5 = "94c86[redacted]"

system · January 9, 2025, 11:27am

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.