Syncing many files (5M+)

What are good ways to sync many files (5M+) that are all within the same folder? Unfortunately I can not change folder structure.

I am using hetzner storagebox but unfortunately its taking over an hour with little activity. It works great if I wait longer or when transfering 1-3M of files (same folder).

Command that I use:

taskset -c 20-31 timeout -v -k 30s "${timeout}" \
    rclone sync "server/${world}/" "bak:${world}/" \
      --retries "${retries}" \
      --retries-sleep 5s \
      --quiet \
      --stats-log-level ERROR \
      --stats=1h \
      --transfers=9 --checkers=1 --ignore-checksum

rclone config redacted

[bak]
type = sftp
host = XXX
user = XXX
port = 23
pass = XXX
use_insecure_cipher = true
shell_type = unix
md5sum_command = md5 -r
sha1sum_command = sha1 -r

rclone version

rclone v1.65.2
- os/version: debian 12.5 (64 bit)
- os/kernel: 6.5.0-0.deb12.4-amd64 (x86_64)
- os/type: linux
- os/arch: amd64
- go/version: go1.21.6
- go/linking: static
- go/tags: none
1 Like

hi,
when you posted, there was a template of questions???

1 Like

what info are you missing?

1 Like

when you posted, there was a template of questions for you to answer.
please provide all the answers?

you are being super passive aggressive :slight_smile: - just curious what info are you missing

sorry you think that, i am just a volunteer.

2 Likes

i see that you are updating your first post.
the remote is using sftp, a very old, slow protocol.

to reduce the number of checks, can use filters. for example,
--max-age=24h

with storagebox, can use webdav. that might be quicker, but there are downsides.
sftp support checksums, webdav does not support checksums

yet another option, that i do, is run rclone inside a hetzner vm, in the same datacenter as the storagebox.

1 Like

thanks, max-age of 3d is worth a shot since I am doing daily syncs - might even do more often.

switching to webdav I am not so sure, I am maximising NIC currently already, but I might give a try as well..

--ignore-checksum and use_insecure_cipher = true
just curious, why use that?

for the initial transfer, could use rclone copy and some flags to greatly reduce the number of checks.
as that is the main issue with sftp is latency.

I run the command twice with slightly different params, also because there is a max connection with hetzner.

First run is without checksum check, second run is with and increased checkers threads.

I enable insecure algo for speed improvements during transfer. The issue is not bw or datatransfer rates, but that there is no activity for a long time.

This is rclone listing that directory with 5 Million files in it. It is probably taking that time to list the directory.

It might be different protocols are faster - the storage box supports lots webdav/ftp/smb so it would be worth trying those. However I suspect it might be the storage box itself.

1 Like

Is the maxage working correctly? seems strange that data transfer of 30gb is slower than running md5 against 3.5M files:

Wed Mar 27 07:51:43 AM UTC 2024 - [world_nether] starting bulk backup...
2024/03/27 08:43:51 ERROR :
Transferred: 36.466 GiB / 36.466 GiB, 100%, 58.777 KiB/s, ETA 0s
Checks: 89219 / 89219, 100%
Deleted: 10 (files), 0 (dirs)
Transferred: 17160 / 17160, 100%
Elapsed time: 52m8.6s

Wed Mar 27 08:43:51 AM UTC 2024 - [world_nether] starting integrity backup...
e[32m2024-03-27 08:46:21 [INFO]: 2m30s silence achieved after 2m30.000155168se[39m
2024/03/27 09:04:55 ERROR :
Transferred: 2.083 GiB / 2.083 GiB, 100%, 1.140 KiB/s, ETA 0s
Checks: 3529532 / 3529532, 100%
Transferred: 1131 / 1131, 100%
Elapsed time: 18m33.7s

hmm maybe indeed worthwhile to switch protocol or IO is limited (not throughput)

Rclone still has to read the entire directory before filtering it by age. Of all the backends only Google drive can filter the listing by age in the server.

it should work well, assuming the source is local, "server/${world}/"

an easy way to test filters is to list the files affected by the filter.
rclone ls "server/${world}/" --max-age=3d -vv

and a safe way to test sync commands with --dry-run
rclone sync "server/${world}/" "bak:${world}/" --max-age=3d --dry-run -vv

Adding --no-traverse will stop rclone trying to list the 5M files on the destination with --max-age=3d.

no-traverse does not work with sync? And 99% are sync type of transfers.

Switching to webdav immensely lowered processing time:

Thu Mar 28 07:49:18 AM UTC 2024 - [world_nether] starting bulk backup...
2024/03/28 08:17:25 ERROR :
Transferred: 62.621 GiB / 62.621 GiB, 100%, 11 B/s, ETA 0s
Checks: 74228 / 74228, 100%
Transferred: 27636 / 27636, 100%
Elapsed time: 28m6.8s

Thu Mar 28 08:17:25 AM UTC 2024 - [world_nether] starting integrity backup...
e[32m2024-03-28 08:19:55 [INFO]: 2m30s silence achieved after 2m30.000140809se[39m
timeout: sending signal TERM to command ‘rclone’
Thu Mar 28 08:39:55 AM UTC 2024 - [world_nether] integrity backup failed
Thu Mar 28 08:39:55 AM UTC 2024 - [world_nether] all backups done

However, now the secondary checksum-based run failed on timeout. My guess is its reallt the “ls” that might be cached on host or not causing longer times.

Unfortunately I need to do checksum check, so will look for alternatives.

--no-traverse should work with sync but I thought you were using copy.. It doesn't make sense to use --max-age with sync normally.

If you want to check integrity you could use rclone check --download and that will give you 100% assurance on any backend. Since you aren't going over the Internet to the storage box that is probably quite a reasonable thing to do.

You could also try the smb backend also - I think that works with the storage box too.

hi, the OP needs checksums, which the smb backend lacks.
indeed, storagebox does support smb. tho not sure it is a good idea to expose samba over internet.

@Netherwhal, here is my how-to guide about using smb with rclone
https://forum.rclone.org/t/how-to-access-smb-samba-with-rclone/42754

Im actually thinking I might have outgrown hetzner. The fact that md5 hashes are calculated on-demand is not good for performance, nor that a file-list (with either webdav or sftp) are being done on-demand each time.

Seaweedfs is the only solution that stores metadata, checksum and directory listing in a db, making remote-ls/lookups fast.

So I will try that out for now.

Or maybe pay just a little more per TB and use b2 or wasabi?

You could also set up minio which works very well.