Most efficient way to sync many clients with a central server

BonusLord · October 21, 2024, 10:18pm

What is the problem you are having with rclone?

I've been experimenting with using rclone on automated test "worker" machines to pull down the test "workspace" (basically just a folder with a bunch of subdirs / files in it) from the main test server via sftp. It works fine, but I'm trying to tune the performance to get it running as fast as possible since this sync step is one of the main bottlenecks of our automated test system and I'm trying to improve upon the previous syncing method (rsync).

Details:

The workspace folder is about 7.5 GB in size, with a total of 96k files and 9k folders
There are 35 worker machines, and they all run the rclone sync command roughly simultaneously
8 of the 35 workers are running Windows
Many of the files in the workspace are tiny Java .class files. The sync is actually significantly slower without the --sftp-disable-hashcheck flag, presumably because computing the hash ends up taking more time than just pushing the small file across the fast LAN
I rolled out rclone gradually to the worker machines, and noticed that syncing on machines still using rsync instead of rclone became slower and slower as I migrated other workers to rclone, presumably because rclone is able to hog more of the network resources and/or server CPU resources?
In the "slow" case where the previous sync was from a different branch (and thus a large number of files actually need to be synced), the fastest node finishes in about 5.5 minutes, and the slowest finishes in 6.5 minutes
Under the previous solution (rsync), non-Windows workers would finish the sync in 2-3 minutes, while Windows workers would finish in around 4-7 minutes.

So in terms of total cumulative time spent syncing across all nodes, rclone is currently a bit slower than rsync, but I'm not really sure why. Any tips / thoughts on other flags / approaches I could use to improve performance? Perhaps some way to compute / reuse a manifest of the directory contents up-front so all the worker nodes don't have to run a bunch of redundant dir listings?

Run the command 'rclone version' and share the full output of the command.

Windows-based workers:
rclone v1.68.1

os/version: Microsoft Windows Server 2016 Datacenter 1607 (64 bit)
os/kernel: 10.0.14393.7428 (x86_64)
os/type: windows
os/arch: amd64
go/version: go1.23.1
go/linking: static
go/tags: cmount

Ubuntu-based workers:
rclone v1.68.1

os/version: ubuntu 20.04 (64 bit)
os/kernel: 5.4.0-193-generic (x86_64)
os/type: linux
os/arch: amd64
go/version: go1.23.1
go/linking: static
go/tags: none

Which cloud storage system are you using? (eg Google Drive)

ssh/sftp

The command you were trying to run (eg `rclone copy /tmp remote:tmp`)

rclone sync --transfers 2 --sftp-disable-hashcheck --inplace --fast-list --delete-excluded --ignore-checksum --exclude /.git/** jenkins-vm:${REMOTE_REPO_PATH} .

Please run 'rclone config redacted' and share the full output. If you get command not found, please make sure to update rclone.

[jenkins-vm]
type = sftp
host = XXX
user = XXX
key_file = ~/.ssh/id_rsa
shell_type = unix
md5sum_command = md5sum
sha1sum_command = sha1sum

A log from the command that you were trying to run with the `-vv` flag

(Log output is massive and contains sensitive info, but looks pretty normal / expected to me; it's copying over files that were modified and there's no evidence that it's attempting to compute any hashes)

asdffdsa · October 21, 2024, 10:34pm

welcome to the forum,

that is a small amount of data and a lot of small files.
maybe, switch from stone-age sftp to a modern protocol such as s3 and a minio server.
which will be much faster and does support --fast-list

note: --fast-list does nothing on sftp which makes sync slow

rclone backend features sftp: | grep "ListR"
                "ListR": false,

not sure there is a great solution, but quickly, off the top of my head.

to create a list of files, do something like
rclone lsf jenkins-vm:${REMOTE_REPO_PATH} --files-only --absolute --recursive > file.lst

from each machine, to sync the files, download file.lst, then do something like
rclone sync jenkins-vm:${REMOTE_REPO_PATH} . --files-from=file.lst

maybe, to work around, sftp being slow with sync, can try
https://forum.rclone.org/t/big-syncs-with-millions-of-files/40182

nehakakar · October 23, 2024, 8:45am

Hi BonusLord,

Have you considered using rclone lsf to create a manifest of files first? This might help reduce redundant directory listings across your worker nodes. Also, switching to a faster protocol like S3 could significantly improve sync speed.

BonusLord · November 21, 2024, 10:23pm

Using lsf to pre-generate a sync manifest is an interesting idea, will definitely give that a try next time I have some time to dig deeper on this.

I'm a little skeptical that an S3-style backend would be much faster in practice, since that would require an additional up-front step of archiving / "uploading" all of the files into the S3 backend before they could start being fetched, which seems like it'd introduce a non-trivial amount of overhead.

I was able to identify one actual problem with my setup that was causing significant slowdowns. It turns out that sftp on the main server was configured to log at the INFO level, which results in 2 or 3 log messages per file transferred via sftp. So, in my use case where millions of small files were being transferred, this caused millions of log messages to be emitted, which bogged down the systemd-journal process and hogged limited CPU resources from other processes.

After reducing the sftp log level to ERROR, rclone's performance is much closer to what I'd expect, although it still ends up being a bit slower than rsync due to causing higher CPU usage on the central server for some reason (maybe it's choosing a more CPU-intensive SSL cipher than rsync...?).

system · December 21, 2024, 10:24pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.