Syncing many files (5M+)

Netherwhal · March 26, 2024, 11:14pm

What are good ways to sync many files (5M+) that are all within the same folder? Unfortunately I can not change folder structure.

I am using hetzner storagebox but unfortunately its taking over an hour with little activity. It works great if I wait longer or when transfering 1-3M of files (same folder).

Command that I use:

taskset -c 20-31 timeout -v -k 30s "${timeout}" \
    rclone sync "server/${world}/" "bak:${world}/" \
      --retries "${retries}" \
      --retries-sleep 5s \
      --quiet \
      --stats-log-level ERROR \
      --stats=1h \
      --transfers=9 --checkers=1 --ignore-checksum

rclone config redacted

[bak]
type = sftp
host = XXX
user = XXX
port = 23
pass = XXX
use_insecure_cipher = true
shell_type = unix
md5sum_command = md5 -r
sha1sum_command = sha1 -r

rclone version

rclone v1.65.2
- os/version: debian 12.5 (64 bit)
- os/kernel: 6.5.0-0.deb12.4-amd64 (x86_64)
- os/type: linux
- os/arch: amd64
- go/version: go1.21.6
- go/linking: static
- go/tags: none

asdffdsa · March 26, 2024, 11:31pm

hi,
when you posted, there was a template of questions???

Netherwhal · March 26, 2024, 11:32pm

what info are you missing?

asdffdsa · March 26, 2024, 11:33pm

when you posted, there was a template of questions for you to answer.
please provide all the answers?

Netherwhal · March 26, 2024, 11:36pm

you are being super passive aggressive - just curious what info are you missing

asdffdsa · March 26, 2024, 11:38pm

sorry you think that, i am just a volunteer.

asdffdsa · March 26, 2024, 11:45pm

i see that you are updating your first post.
the remote is using sftp, a very old, slow protocol.

to reduce the number of checks, can use filters. for example,
--max-age=24h

with storagebox, can use webdav. that might be quicker, but there are downsides.
sftp support checksums, webdav does not support checksums

yet another option, that i do, is run rclone inside a hetzner vm, in the same datacenter as the storagebox.

Netherwhal · March 26, 2024, 11:55pm

thanks, max-age of 3d is worth a shot since I am doing daily syncs - might even do more often.

switching to webdav I am not so sure, I am maximising NIC currently already, but I might give a try as well..

asdffdsa · March 27, 2024, 12:00am

--ignore-checksum and use_insecure_cipher = true
just curious, why use that?

for the initial transfer, could use rclone copy and some flags to greatly reduce the number of checks.
as that is the main issue with sftp is latency.

Netherwhal · March 27, 2024, 5:48am

I run the command twice with slightly different params, also because there is a max connection with hetzner.

First run is without checksum check, second run is with and increased checkers threads.

I enable insecure algo for speed improvements during transfer. The issue is not bw or datatransfer rates, but that there is no activity for a long time.

ncw · March 27, 2024, 12:52pm

This is rclone listing that directory with 5 Million files in it. It is probably taking that time to list the directory.

It might be different protocols are faster - the storage box supports lots webdav/ftp/smb so it would be worth trying those. However I suspect it might be the storage box itself.

Netherwhal · March 27, 2024, 2:11pm

Is the maxage working correctly? seems strange that data transfer of 30gb is slower than running md5 against 3.5M files:

Wed Mar 27 07:51:43 AM UTC 2024 - [world_nether] starting bulk backup...
2024/03/27 08:43:51 ERROR :
Transferred: 36.466 GiB / 36.466 GiB, 100%, 58.777 KiB/s, ETA 0s
Checks: 89219 / 89219, 100%
Deleted: 10 (files), 0 (dirs)
Transferred: 17160 / 17160, 100%
Elapsed time: 52m8.6s

Wed Mar 27 08:43:51 AM UTC 2024 - [world_nether] starting integrity backup...
e[32m2024-03-27 08:46:21 [INFO]: 2m30s silence achieved after 2m30.000155168se[39m
2024/03/27 09:04:55 ERROR :
Transferred: 2.083 GiB / 2.083 GiB, 100%, 1.140 KiB/s, ETA 0s
Checks: 3529532 / 3529532, 100%
Transferred: 1131 / 1131, 100%
Elapsed time: 18m33.7s

hmm maybe indeed worthwhile to switch protocol or IO is limited (not throughput)

ncw · March 27, 2024, 10:41pm

Rclone still has to read the entire directory before filtering it by age. Of all the backends only Google drive can filter the listing by age in the server.

asdffdsa · March 27, 2024, 11:20pm

it should work well, assuming the source is local, "server/${world}/"

an easy way to test filters is to list the files affected by the filter.
rclone ls "server/${world}/" --max-age=3d -vv

and a safe way to test sync commands with --dry-run
rclone sync "server/${world}/" "bak:${world}/" --max-age=3d --dry-run -vv

ncw · March 28, 2024, 10:05am

Adding --no-traverse will stop rclone trying to list the 5M files on the destination with --max-age=3d.

Netherwhal · March 28, 2024, 12:13pm

no-traverse does not work with sync? And 99% are sync type of transfers.

Switching to webdav immensely lowered processing time:

Thu Mar 28 07:49:18 AM UTC 2024 - [world_nether] starting bulk backup...
2024/03/28 08:17:25 ERROR :
Transferred: 62.621 GiB / 62.621 GiB, 100%, 11 B/s, ETA 0s
Checks: 74228 / 74228, 100%
Transferred: 27636 / 27636, 100%
Elapsed time: 28m6.8s

Thu Mar 28 08:17:25 AM UTC 2024 - [world_nether] starting integrity backup...
e[32m2024-03-28 08:19:55 [INFO]: 2m30s silence achieved after 2m30.000140809se[39m
timeout: sending signal TERM to command ‘rclone’
Thu Mar 28 08:39:55 AM UTC 2024 - [world_nether] integrity backup failed
Thu Mar 28 08:39:55 AM UTC 2024 - [world_nether] all backups done

However, now the secondary checksum-based run failed on timeout. My guess is its reallt the “ls” that might be cached on host or not causing longer times.

Unfortunately I need to do checksum check, so will look for alternatives.

ncw · March 29, 2024, 12:21pm

--no-traverse should work with sync but I thought you were using copy.. It doesn't make sense to use --max-age with sync normally.

If you want to check integrity you could use rclone check --download and that will give you 100% assurance on any backend. Since you aren't going over the Internet to the storage box that is probably quite a reasonable thing to do.

You could also try the smb backend also - I think that works with the storage box too.

asdffdsa · March 29, 2024, 1:15pm

hi, the OP needs checksums, which the smb backend lacks.
indeed, storagebox does support smb. tho not sure it is a good idea to expose samba over internet.

@Netherwhal, here is my how-to guide about using smb with rclone
https://forum.rclone.org/t/how-to-access-smb-samba-with-rclone/42754

Netherwhal · March 29, 2024, 3:31pm

Im actually thinking I might have outgrown hetzner. The fact that md5 hashes are calculated on-demand is not good for performance, nor that a file-list (with either webdav or sftp) are being done on-demand each time.

Seaweedfs is the only solution that stores metadata, checksum and directory listing in a db, making remote-ls/lookups fast.

So I will try that out for now.

ncw · March 30, 2024, 5:37pm

Or maybe pay just a little more per TB and use b2 or wasabi?

You could also set up minio which works very well.