Box.com small-file (ie, file creation) performance is not great to begiin with: I'm migrating my data there and I see ~0.5 files/second (ie, about 1 file every 2 seconds), compared to GoogleDrive which had about 2.5 files/second (ie, about 5 files every 2 seconds).
But the other day I started seeing performance dropping to about 0.05 files/second (ie, a file every 20 seconds) and getting progressively lower, so I stopped to investigate and find the cause: it was copying from a directory with about 20000 files in it, and performance started getting ridiculous when about 1000 files were already in the Box destination directory.
So now I'm replacing these directories (with more than 1000 files in them) with a tar.gz or ZIP file of their contents before copying them to Box, otherwise they would take my whole life plus 6 months to copy. In the above mentioned case, ETA went from months to minutes.
My conclusion is that file creation in Box.com must depend internally on some O(n) or worse algorithm (perhaps a sequential search of the entire directory to locate duplicates). This is CP/M-like behavior and was hardly acceptable in the 1980s, let alone in the 2020s... Box.com, if you're listening, you should consider fixing this.
I have a folder with 3000 sub-folders and have found that when trying to access this folder my Box rclone mount becomes unusable. I suspect it’s due to this performance issue as stated. This behavior is not present on the web gui. I might try to deep dive into this problem as this is the only problem I’m facing with the service.
Box doesn't (or didn't last time I looked) have an API to find a single file given a path. You have to look descend through each directory listing looking up the directory IDs then list the final directory to find the file ID.
This means that rclone might be listing that 20k directory over and over again to find file IDs or check files don't exist which would make the bad performance you are seeing.
Assuming this is the problem (and it should be pretty obvious from a bit of -vv --dump bodies) there are two ways to fix this....
Get rclone not to call NewObject (this it the internal method that will cause the problem).
Research box APIs to see if they have a more efficient API - eg find this file in a directory rather than listing the whole thing.
About the point #2 (the more efficient API), they have this one. It allows to query recursively by file name inside a folder (or a group of folders) and it returns the folder tree for every file found.
The string to search for. This query is matched against item names, descriptions, text content of files, and various other fields of the different item types.
This parameter supports a variety of operators to further refine the results returns.
"" - by wrapping a query in double quotes only exact matches are returned by the API. Exact searches do not return search matches based on specific character sequences. Instead, they return matches based on phrases, that is, word sequences. For example: A search for "Blue-Box" may return search results including the sequence "blue.box", "Blue Box", and "Blue-Box"; any item containing the words Blue and Box consecutively, in the order specified.
So I think this API could be made to work, but the recursive search and fuzzy string matching will mean it does more work than it needs to. I can give it a go if we are sure that the current method is the problem.
Hi @ncw, and thanks for your great response as usual!
This would explain the behavior I'm seeing.
I tried a ton of different options with the exact same result, but the last was this: rclone -vv --transfers=16 --checkers=16 --max-size=55574528b --bwlimit=121M copy ENCRYPTED_GOOGLE_DRIVE_REMOTE: CHUNKED_ENCRYPTED_BOX_REMOTE:
--max-size is being used because I had already copied all files larger than this limit, so now I'm only copying the smaller ones; --bwlimit is to avoid the infamous 10TB/day ban at the source of the copy (Google Drive).
In case you need it, here are the relevant rclone.conf sections:
Its purpose is to see if files exist before you upload them which is why I hadn't twigged it is useful, but it does exactly what I need to avoid listing the directory as in it turns a (fileName, directoryID) into a fileID. I needed to call another api method to turn the fileID into the full file info, but I think both these API calls will be very quick compared to listing directories.
I haven't attempted to performance test these though as exactly how they perform will depend on your workload, but I'd be grateful if you'd try to do that @durval to see how much (if any!) performance gain you get when syncing large directories.
Amazing, @ncw! Thank you very much for all your great work, you and rclone are the best!
It would be my pleasure, but at this exact moment I'm STILL (for 3 days now) without both write (upload) AND read (download) access to my data. I opened a ticket back then and they just replied, basically telling me that if I want READ access to my data, I would have to first cancel the account (so they really don't want me there) and only then they will graciously provide me with 30 days to copy my data out.
So I don't think I will be able to test this I'm really sorry about that
Perhaps there's someone around who can still upload to Box to have that tested? I will give it a holler in the public thread and see if we can get some volunteers.
Agree on the "INTENTIONAL" part, as per their response to my last ticket (complaining they had cut not only write but also READ access to my data): the only alternative they gave me to have READ access back was to cancel my account (sad and incredible, but true).