Large inverse traffic (egress) when downloading from Box to Google Cloud

I've seen a huge inverse traffic every time I download data from Box to Google Cloud using rclone copy. The size of the outbound traffic is about 1% the size of the incoming traffic (For ~500 GB data, the egress from GCP to Box is ~5 GB, which is so costly!)

My data consists of mainly large files (usually 10~20 GB each). I tried enlarging buffer size, enabling --fast-list, and even --ignore-checksum --ignore-size, but they don't help. I'm wondering why the inverse traffic is so high? Is there any way that could help reduce such egress traffic?

The command I'm trying to run is:

rclone copy -Pv mybox:/data /data --transfers 32 --checkers 64 --buffer-size 8192M --ignore-checksum --ignore-size --fast-list

Any input would be appreciated!

There will be several things happening here causing data to flow from GCP -> Box

  1. http requests going to box to read listings
  2. http requests going to box to download files
  3. TCP acknowledgements going to box from the files going from Box -> GCP
  4. http keepalives / TLS keepalives

The overhead for 2) should be pretty small for just large files.

When you are doing a copy is it just new files to copy in mybox:/data or are there other files that are being skipped because they are already copied? So are you running the copy command repeatedly to pick up new data? If so that would increase the overhead for 1)

As for 3) I think the overhead is probably about 1000:1 but I could be wrong about this!

I've no idea about the overhead for 4)

How many files are you copying and how long does the transfer take?

Why do you think it's high? You have a large amount of transfers, large amount of checkers so you are seeing all that overhead of TCP coupled with the number of transfers going on.

Thank you very much! I was trying to use local SSDs on GCP as the temporary cache for my instance and use Box as the permanent storage, so I need to copy all the data every time (the downloading speed is amazing though, up to 1 GB/s :slightly_smiling_face:).

I did more test on this and yes, the more precise overhead ratio is 0.1% ~ 0.3% as you suggested. I now understand that a lot of http requests will be needed to initiate the downloading process, but is there any config in rclone that could allow you to get more data downloaded per request, which could help reduce this ratio further?

Thank you! I also tried smaller number of checker and transfers, but I don't see significant change in the overhead ratio though.

By limiting those things, I doubt the ratio changes as the length of the operations change.

If you have to transfer 100GB, you get some minimal changes by tweaking, but you aren't going to change it at a macro level. To move 100GB, there is ~10% TCP overhead that occurs that you can't really tune much around when going over the Internet.

Some backends can control chunk size, but not gcs or box - that would be the big win if possible.

I can't think of anything else you could tweak, so you are probably stuck with that overhead.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.