500 VMs rclone copy to 1 VM, but the files are not completed

I'm using rclone to copy each 50MB file on 500 VMs to 1 VM, so there're 500 files totally copying to 1 target VM. At the beginning, I met the ssh session problem, so I tuned the ssh parameters of the target VM as below:

MaxSessions 2000
ClientAliveInterval 1
ClientAliveCountMax 5
MaxStartups 1000

But still the files on the target VM are not completed. I got the rclone error message from one source VM.

2020/03/12 17:16:32 DEBUG : rclone: Version "v1.51.0" starting with parameters ["rclone" "copy" "-vv" "/tmp/192-168-1-64_2020-03-12-17-15-01_testfile" "target:/myramdisk/"]

2020/03/12 17:16:32 DEBUG : Using config file from "/root/.config/rclone/rclone.conf"

2020/03/12 17:16:33 DEBUG : sftp://root@192.168.1.170:22//myramdisk/: New connection 192.168.1.64:42586->192.168.1.170:22 to "SSH-2.0-OpenSSH_7.4"

2020/03/12 17:16:34 DEBUG : 192-168-1-64_2020-03-12-17-15-01_testfile: Need to transfer - File not found at Destination

2020/03/12 17:16:47 DEBUG : 192-168-1-64_2020-03-12-17-15-01_testfile: Removed after failed upload: sftp: "Failure" (SSH_FX_FAILURE)

2020/03/12 17:16:47 ERROR : 192-168-1-64_2020-03-12-17-15-01_testfile: Failed to copy: Update ReadFrom failed: sftp: "Failure" (SSH_FX_FAILURE)

2020/03/12 17:16:47 ERROR : Attempt 1/3 failed with 1 errors and: Update ReadFrom failed: sftp: "Failure" (SSH_FX_FAILURE)

2020/03/12 17:16:47 DEBUG : 192-168-1-64_2020-03-12-17-15-01_testfile: Need to transfer - File not found at Destination

2020/03/12 17:16:49 DEBUG : 192-168-1-64_2020-03-12-17-15-01_testfile: Removed after failed upload: sftp: "Failure" (SSH_FX_FAILURE)

2020/03/12 17:16:51 ERROR : 192-168-1-64_2020-03-12-17-15-01_testfile: Failed to copy: Update ReadFrom failed: sftp: "Failure" (SSH_FX_FAILURE)

2020/03/12 17:16:51 ERROR : Attempt 2/3 failed with 1 errors and: Update ReadFrom failed: sftp: "Failure" (SSH_FX_FAILURE)

2020/03/12 17:16:51 DEBUG : 192-168-1-64_2020-03-12-17-15-01_testfile: Need to transfer - File not found at Destination

Below is my rclone config file. Thanks!

[target]
type = sftp
host = 192.168.1.170
user = root
port = 22
pass = ***
use_insecure_cipher = false
md5sum_command = md5sum
sha1sum_command = sha1sum

SSH isn't particularly helpful with its error message!

I would have thought that if you are doing 500 connections at once you are likely running out of CPU, RAM or file handles on the server, so I'd check the server logs

Got some sftp server errors on the target VM. It has 96 core CPU and 96GB memory, but I gave 50GB to the ram disk as the copy destination, so maybe the memory is short. Let me try it.

The file handles have been tuned already.

Mar 12 17:16:50 192-168-1-170 sftp-server[77782]: error: process_write: write failed
Mar 12 17:16:50 192-168-1-170 sftp-server[77782]: error: process_write: write failed
Mar 12 17:16:50 192-168-1-170 sftp-server[77782]: error: process_write: write failed

The error message from rclone would be consistent with that :crossed_fingers: it works!

I increased the memory to 128GB, but still go the same issue that the files're not complete.

This time, I didn't see any error in /var/log/messages between 12:30 to 12:35.

During that 5 minutes, I did find there're 1767 sleeping threads. Not sure if it's the problem or how to debug this.

Can you try a subset of the 500 and see if that works? I think something is being overloaded on the server so it would be useful to test if it works for 5 and 50, say.

Sleeping threads could be waiting on disk or network I guess... 500 * 50MB files might be maxing the input network and causing networking delays?

300VM worked fine.

I used the ramdisk as the target folder and using FIO , 1M IO sequential write is 2.7GB/s. So I think disk won't be bottleneck. Networking probably. The bandwidth between each VM is 20Gbps, but they're all shared the 20Gbps physical ports.

You could slow down the transfers with --bwlimit. 20GB/s is 2.5 GByte/s split over 500 would be 5Mbyte/s. So if you used --bwlimit a bit less than that, say --bwlimit 3M it should keep the bandwidth not full.

Thanks for the suggestion, and I'll try it.

BTW, I want to copy the data ASAP. Hope this parameter won't slow down the transfer :slight_smile:

Great

It will slow it down a bit, hopefully enough so as not to overwhelm the network. You can tweak it up and down to find the optimum point.