SSH/SFTP high CPU usage

sergiuhlihor · August 24, 2021, 10:39pm

Hello,

I have two servers with 2 x 64 core AMD 7763, 2TB RAM, interconnected via 200GBit fiber optics (~25GB/s). Both have an MDADM RAID10 array with a maximum read throughput of ~70GB/s and write throughput of ~35GB/s

I have setup source as a SSH/SFTP one, with following settings:
type = sftp
host = internalIP
user = sshUser
port = 22
key_file = ssh-key
use_insecure_cipher = true
md5sum_command = md5sum
sha1sum_command = sha1sum

When trying to copy a whole folder using copy command (rclone --transfers 16 copy SOURCE:/source-folder target), I now have 60-80 cores for a total bandwidth of ~1 to 1.2GB/s when a single copy via rsync or scp achieves about 200MB/s per core. On the other side, the source server only has 6 to 8 cores used. When pushed with --transfers 128, it reaches a throughput of ~2.5GB/s at a cost of over 150 cores out of which almost half is on kernel mode. The machine is capable of reading, transferring and writing over 25GB/s over the network yet rclone is achieving only 10% of theoretical throughput, all at a huge CPU usage. File sizes are between 16MB and 5GB, where over half of the files are in the range of hundreds of MB to 2GB. Would be more than happy to try any finetuning, however, based on load, it looks like it might have some fundamental issues that prevent linear efficient scaling.

Rclone version:
rclone v1.56.0

os/version: ubuntu 20.04 (64 bit)
os/kernel: 5.14.0-051400rc6-generic (x86_64)
os/type: linux
os/arch: amd64
go/version: go1.16.5
go/linking: static
go/tags: none

Operating system is Ubuntu 20.04 LTS with kernel 5.14rc6 on target which runs rclone, 5.11 on source.

Animosity022 · August 24, 2021, 11:58pm

You can't really compare rsync and rclone as they operate very differently.

With 16 transfers, the CPU / disk usage is checksumming 16 files at a time. You should start smaller and figure out what is your best bet.

rsync/scp don't checksum prior so it's not a great compare.

sergiuhlihor · August 25, 2021, 12:09am

Checksum runs on the source server and there I have 6 to 8 cores usage. Checksum does not account for the 50+ extra cores. If this would be the case, we would be talking about 50 cores for 1GB/s => checksum that can run only at 20MB/s per core. Last time when I checked, MD5 and SHA1 were known to perform at hundreds of MB or even GB per core, orders of magnitude faster. It's physically impossible for checksum to be the bottleneck, unless implementation is buggy.

Animosity022 · August 25, 2021, 1:09am

Checks run on both as it runs on the source to get the checksum and it validates it once the transfer is completed the other side.

It can peg CPU or disk if you can share stats on the system, we can see what's going on.

sergiuhlihor · August 25, 2021, 1:41am

The point is that checksum also runs on source where operation involves reading from disk at 1GB/s, doing md5, then pushing the data over network. The source has now an average usage of about of ~3% usage (~7.5 cores) for last hours.

Right now target has 23,4% usage (~60 cores) hogged by rclone. Disk array is based on NVMe, read await and write await are under 1ms under peaks. It is definitely a CPU problem, not a disk one, as I can read at 70GB/s for less than 10% machine load. Writing is the same. If scaling linearly with source load, I should get 10GB/s at 30% but bottleneck appears to be on target on rclone side. Are there any known performance metrics of how much is expected for 100MB/s traffic?

Ole · August 25, 2021, 7:37am

Hi sergiuhlihor, welcome to the forum

What is the purpose of this post?

To display your skills and knowledge or obtain the easy searchable url (https://forum.rclone.org/t/ssh-sftp-high-cpu-usage/26146/5) to discredit rclone or something else?

I guess no, the main benefit of rclone is the capability to connect to a lot of backends - the most used ones are typically also the best optimized. It therefore wouldn't be fair to compare rclone SFTP to specialized tools like rsync or scp.

Feel free to create a pull request with SFTP optimizations if you fancy.

sergiuhlihor · August 25, 2021, 8:31am

Hi Ole,

Scope is to raise what looks to be a very big scalability issue and learn if my behavior is due to lack of special optimizations that may not be documented or rclone is indeed not scalable beyond a specific limit. I'm not here to teach people about performance metrics.

I have some very fast servers as you can see in the specs and I am more than capable of writing a simple software that can do data replication at 25GB/s for internal needs in my company. However, since rclone already exists and one of the features is parallel transfer, I am more than happy to use it and report issues like this in the hope that someone takes a look and optimizes. Everyone benefits. If this turns out to be indeed scalability and not some obscure settings that may need to be done, I'd gladly test any patched versions and report results.

Ole · August 25, 2021, 8:55am

Thanks for clarifying!

I use rclone SFTP myself to a tiny server (Realtek Semiconductor RTD1296 1.4 GHz with 4 cores) and my immediate impression is that I am limited by the disk speed (I haven’t really checked, because the speed is fine for my usage)

Just as an experiment, what happens if you disable checksums at the target? That is set:

md5sum_command = none
sha1sum_command = none
sftp_disable_hashcheck = true

sergiuhlihor · August 25, 2021, 9:16am

Just tested, no significant difference, similar throughput per core. It looks like its able to ramp up at higher speeds with 16 transfers, but it uses proportionately more. If before it was doing about ~60 cores for ~1GB/s, now it does ~110 cores for about 1.8GB/s. However first files are bigger in size, so I assume that may impact the logic.
What numbers do you get for your machine per core? if I compute now, i get an average of 16MB/s per core.

Ole · August 25, 2021, 9:28am

I have no issues, so I suggest you measure on an another SFTP server. E.g. using rclone serve sftp on your own client.

sergiuhlihor · August 25, 2021, 9:37am

Thank you Ole for replying but having or not issues is not what I asked.

Ole · August 25, 2021, 9:39am

No, you asked my to perform an effort that you are perfectly able to perform yourself.

Why should I do that?

sergiuhlihor · August 25, 2021, 9:44am

Because you can bring value by posting results obtained with an ARM based CPU which may or may not have some special instruction set that may or may not speedup significantly this tool. I only have x86 to test.

Ole · August 25, 2021, 9:48am

I can bring value in many ways and only have limited time, so I have decided that this isn't of priority to me.

If it has priority to you, then I am sure you can find a way to test it using your own time and resources.

sergiuhlihor · August 25, 2021, 9:52am

Then a honest "I do not have time to test" would have spared a few minutes of your and mine precious time. Thank you for your help!

Ole · August 25, 2021, 9:53am

You are welcome

ncw · August 25, 2021, 12:13pm

That sounds low to me.

The CPU overhead you are seeing is likely the SSH encryption. This is performed in software using the Go standard library not openssl so it can vary in speed depending on which cipher you are using.

On x86 32 and 64 bit the Go stdlib will use hardware acceleration for AES based encryption. The other very fast cipher is ChaCha20 which was designed to run fast on CPUs.

Can you find out what cipher has been negotiated for the SSH connection? This will mean looking at the ssh server logs most likely as I don't think rclone logs that info.

This might mean that you are using an old slow cipher - does it work without?

	"aes128-gcm@openssh.com",
	chacha20Poly1305ID,
	"aes128-ctr", "aes192-ctr", "aes256-ctr",

This is the ssh libraries preferred cipher suite - any of those will run very fast.

sergiuhlihor · August 25, 2021, 6:20pm

Hi Nick,
Thanks for the hint! I changed to use_insecure_cipher=false but no visible benefit. Though I was not yet able to confirm that is using aes, I'm looking now deeper. Also, I run it with --cpuprofile for one minute. First 3 entries when running go tool pprof with dot are runtime.systemstack, runtime.mallocgc and runtime.futex, though since Go is not my native language, I have no idea if this would be expected or not for the application. I can share the profile if you wish.

Ole · August 28, 2021, 8:22am

I am not a Go expert either, but a quick Googling tells me that you are seeing these modules at the top:

System Stack
Memory Allocator: - The Go Programming Language (or the Garbage Collector)
Fast User Space Locking

A quick search for "memory" in the rclone docs reveals these flags that may influence the above:
https://rclone.org/docs/#buffer-size-size
https://rclone.org/docs/#use-mmap

Have you tried them?

How do they influence your speed and CPU usage?

sergiuhlihor · August 30, 2021, 4:41pm

Hi Nick,
I have the following logged on source:
kex: client->server cipher: aes128-ctr MAC: hmac-sha2-256-etm@openssh.com compression: none [preauth]
kex: server->client cipher: aes128-ctr MAC: hmac-sha2-256-etm@openssh.com compression: none [preauth]

I am doing now some tests to see where is this CPU usage coming from. I have about 12 cores now with: rclone --progress --verbose --checkers=1 --transfers 1 --multi-thread-streams=1 --multi-thread-cutoff=64M copy Source:/folder folder. And speed is about 90MB/s. CPU is running at full speed on both source and target (which are same kind of servers). When run with multi-thread-streams=0, the speed increases to about 130MB but load increases linearly. Something definitely off. One note: I have transparent huge pages disabled on the server and of course CPUs are quite new (barely some good support in kernel). Is there any chance that binary is somehow using a deoptimized instruction set in some of the libraries?

Edit: I think I am narrowing down: I have two servers, identical and copy with rclone running from source (the other server) leads to very low CPU usage. The only difference so far is that target (the server which was running rclone previously) is running kernel 5.14rc6.