I've seen a huge inverse traffic every time I download data from Box to Google Cloud using rclone copy. The size of the outbound traffic is about 1% the size of the incoming traffic (For ~500 GB data, the egress from GCP to Box is ~5 GB, which is so costly!)
My data consists of mainly large files (usually 10~20 GB each). I tried enlarging buffer size, enabling --fast-list, and even --ignore-checksum --ignore-size, but they don't help. I'm wondering why the inverse traffic is so high? Is there any way that could help reduce such egress traffic?
There will be several things happening here causing data to flow from GCP -> Box
http requests going to box to read listings
http requests going to box to download files
TCP acknowledgements going to box from the files going from Box -> GCP
http keepalives / TLS keepalives
The overhead for 2) should be pretty small for just large files.
When you are doing a copy is it just new files to copy in mybox:/data or are there other files that are being skipped because they are already copied? So are you running the copy command repeatedly to pick up new data? If so that would increase the overhead for 1)
As for 3) I think the overhead is probably about 1000:1 but I could be wrong about this!
I've no idea about the overhead for 4)
How many files are you copying and how long does the transfer take?
Why do you think it's high? You have a large amount of transfers, large amount of checkers so you are seeing all that overhead of TCP coupled with the number of transfers going on.
Thank you very much! I was trying to use local SSDs on GCP as the temporary cache for my instance and use Box as the permanent storage, so I need to copy all the data every time (the downloading speed is amazing though, up to 1 GB/s ).
I did more test on this and yes, the more precise overhead ratio is 0.1% ~ 0.3% as you suggested. I now understand that a lot of http requests will be needed to initiate the downloading process, but is there any config in rclone that could allow you to get more data downloaded per request, which could help reduce this ratio further?
By limiting those things, I doubt the ratio changes as the length of the operations change.
If you have to transfer 100GB, you get some minimal changes by tweaking, but you aren't going to change it at a macro level. To move 100GB, there is ~10% TCP overhead that occurs that you can't really tune much around when going over the Internet.