Sanity check large dropbox -> box copy

Crypto60 · March 29, 2023, 4:36pm

Hello all

just wanted to triple check regarding dropbox -> box copy - lots of data (many many TB) including tens of thousands of small files, which is where in my testing it's getting hung up

rclone v1.62.2

os/version: darwin 13.2.1 (64 bit)
os/kernel: 22.3.0 (arm64)
os/type: darwin
os/arch: arm64 (ARMv8 compatible)
go/version: go1.20.2
go/linking: dynamic
go/tags: cmount

rclone copy dropbox:/ "boxapi:/User Folders/"" --dropbox-impersonate " "--transfers 25 --no-update-modtime --ignore-checksum --size-only --checkers 30 --max-backlog 300000 --no-traverse --dropbox-batch-mode sync --exclude-from exclude.txt -vP

shouldn't this usually be the best combo for a large number of small files ? (the larger files seem to do ok)
not seeing memory / CPU get maxed out

exclude list has about 25 folders

thanks

ncw · March 30, 2023, 4:11pm

Try running with -vv and check to see if you are getting retries.

Dropbox is very fussy about transactions per second - most people recommend --tpslimit 12 which will limit you to 12 transactions per second - that might be 12 files per second. If you exceed the limits with dropbox it sends punitively long timeouts to rclone which slow things down much more than --tpslimit 12. Note --tpslimit will apply to source and destination so this might not be exactly the right number (you could try double).

Note that all the clouds are bad at lots of small files as each one will need an HTTP roundtrip which can take 1s or more.

You can try upping transfers which should help - dropbox doesn't mind lots of open connections. Not sure about box.

Crypto60 · March 30, 2023, 5:33pm

doing some testing I think in this case it's box.com being fussy - I'm going to reduce tps to 12 and see how it goes, thanks!

Crypto60 · March 30, 2023, 7:47pm

I changed TPS and upped transfers - I was being rate limited by the API, but even with -vvvv I can't tell which

pacer: low level retry 1/10 (error Error "rate_limit_exceeded" (429): Request rate limit exceeded, please try again later)

might be a useful error enhancement to spit out which remote is giving me that ?

I'm going to play with TPS/Transfers to see if it helps, thanks

Animosity022 · March 30, 2023, 7:50pm

That's a google one. If you are what you've changed, I feel pretty good I can tell you what broke too.

Crypto60 · March 30, 2023, 7:50pm

this is dropbox -> box so it's not google

Crypto60 · March 30, 2023, 7:51pm

it was rate limiting before I set tpslimit, will see if it works better with

Animosity022 · March 30, 2023, 7:52pm

Sorry meant to say Box, Dropbox has a different message.

Crypto60 · March 30, 2023, 7:53pm

yes that's what I was afraid of, will see if it works better with TPS limit, testing now thanks

Crypto60 · March 30, 2023, 8:10pm

now I am pretty sure this is against best practices but I decided to rclone mount dropbox and box (tsp limit 10 on box) and just rclone "local" to "local" and will see how it goes

ncw · March 31, 2023, 10:43am

That is the only way to have different tpslimits for source and destination so maybe not so bad!

Crypto60 · April 1, 2023, 5:52pm

I've been testing with having either box or dropbox or both mounted - it doesn't seem to help much (except manage to build up a local cache which makes sense) will continue testing before I go down too many rabbit holes

Crypto60 · April 1, 2023, 8:57pm

I tried mounting both and started seeing file I/o errors and it hanging - and no throttling errors on the Mount log. But I ran it without debug so I am not sure yet what the issues is. Will do more testing . Too many factors

I’m testing with the lower end business box account also and I suspect they throttle that more than usual in some ways.

more debugging / testing soon hopefully

Crypto60 · April 1, 2023, 11:12pm

going back to basics, just re original sanity check - is there anything I'm missing here in terms of comparing a massive number of files? in most cases (I'm doing many individual users) I know that the initial sync isn't complete - I am basically replicating the settings I'm using on a different box account local (smb) to box and it seems like it can handle checking hundreds of thousands of files much faster - and DOWNLOADING dropbox shouldn't really throttle me - and I know when I use mount it certainly fills up the dropbox vfs cache quick.

I don't want to tie anybody down chasing ghosts I'm just wondering if it's coming down to having a lower end box account.

copy dropbox:/ box:/ --check-first --tpslimit 10 --tpslimit-burst 10 --check-first --max-size 4.9G --dropbox-impersonate XXXXXX --transfers 20 --log-file XXXX-transfer.log --no-update-modtime --fast-list --ignore-checksum --size-only --checkers 30 --no-traverse --dropbox-batch-mode sync --exclude-from exclude.txt

(exclude.txt has about 20 folders)

EDIT: I should have said, I just switched to using --check-first with these transfers I think it's possible that the tps limit is forcing check/transfers to step on each other, will update, thanks

thanks again

ncw · April 2, 2023, 9:19am

Check first will lower transactions needed so that's a good idea.

I think it's probably a question of running the sync as many times as it takes to run clean. Hopefully with check first and as the sync gets to nearly completed it will finish off the last remaining files with no problems.

Crypto60 · April 4, 2023, 6:33pm

quick update - decided to do a check --missing-from-dst - had it run for about 24 hours - generated a file with 200k entries (which is hardly complete) and did --files-from-raw

seems to be cranking so far, will update

Crypto60 · April 5, 2023, 10:03pm

doing the next "check" phase and had a thought - IS there a faster way that would actually work with (say) just diff etc if I do a listing of everything in source and dest ?

UPDATE, testing this (in my scenerio large dropbox -> box)

rclone lsf dropbox:/ --fast-list --recursive --files-only > dropbox.txt
rclone lsf box:/ --fast-list --recursive --files-only > box.txt

and (so far) this seems to work - will wait till the ls finishes.....

grep -Fxv -f box.txt dropbox.txt

Crypto60 · April 6, 2023, 3:17pm

so this finished, the resulting diff file had 3333730 files in it

starting this:

rclone copy dropbox:/ "box:/Team Folders" --files-from-raw diff.txt --transfers 25 --log-file teams.log --no-traverse --fast-list -vvP --checkers 25 --tpslimit 10

2023/04/06 15:08:42 DEBUG : pacer: low level retry 1/10 (error )
2023/04/06 15:08:42 DEBUG : pacer: Rate limited, increasing sleep to 20ms
2023/04/06 15:08:42 DEBUG : pacer: Reducing sleep to 15ms
2023/04/06 15:08:42 DEBUG : pacer: Reducing sleep to 11.25ms
2023/04/06 15:08:43 DEBUG : pacer: Reducing sleep to 10ms
2023/04/06 15:08:45 DEBUG : pacer: low level retry 2/10 (error )
2023/04/06 15:08:45 DEBUG : pacer: Rate limited, increasing sleep to 20ms
2023/04/06 15:08:45 DEBUG : pacer: Reducing sleep to 15ms
2023/04/06 15:08:45 DEBUG : pacer: Reducing sleep to 11.25ms
2023/04/06 15:08:45 DEBUG : pacer: Reducing sleep to 10ms

not sure how I can increase beyond debug to see what the bottleneck is ?

running about half hour not copying yet

EDIT: Found an old post and removed --no-traverse and copy started right away

EDIT 2: why does debug log still list "excluded" files if it should just be showing stuff in files from ?

thanks

ncw · April 7, 2023, 12:39pm

It looks like box is rate limiting you.

Removing the --no-traverse means that rclone chooses to search the directories for files hence the "excluded".

For some backends --no-traverse is a lot slower than searching the directories and box is one of those.

Crypto60 · April 7, 2023, 1:59pm

So far —files-from-raw seems to be cranking ok - still getting pacer errors but it SEEMS to be working faster than when I hardcode TPS limits (

And —no-traverse started the copy instantly - not sure if there is anything else to optimize the process

In parallel I’m doing a list of —Dropbox-impersonation user folders one by one into box user folders
And the same method of lsf + diff seems to work in getting the initial sync complete , not sure if there is a better / faster rclone specific mechanic to do this more efficiently - maybe sort the file list somehow (say by size) dunno just spitballing

I might try the main copy again using two mounts again and see how that goes

Thanks again