PSA: Box.com has serious performance issues in directories with thousands of files

durval · August 20, 2023, 4:54pm

Box.com small-file (ie, file creation) performance is not great to begiin with: I'm migrating my data there and I see ~0.5 files/second (ie, about 1 file every 2 seconds), compared to GoogleDrive which had about 2.5 files/second (ie, about 5 files every 2 seconds).

But the other day I started seeing performance dropping to about 0.05 files/second (ie, a file every 20 seconds) and getting progressively lower, so I stopped to investigate and find the cause: it was copying from a directory with about 20000 files in it, and performance started getting ridiculous when about 1000 files were already in the Box destination directory.

So now I'm replacing these directories (with more than 1000 files in them) with a tar.gz or ZIP file of their contents before copying them to Box, otherwise they would take my whole life plus 6 months to copy. In the above mentioned case, ETA went from months to minutes.

My conclusion is that file creation in Box.com must depend internally on some O(n) or worse algorithm (perhaps a sequential search of the entire directory to locate duplicates). This is CP/M-like behavior and was hardly acceptable in the 1980s, let alone in the 2020s... Box.com, if you're listening, you should consider fixing this.

left1000 · August 22, 2023, 2:16am

What about something like --no-traverse? It might avoid the flaw?

left1000 · August 22, 2023, 5:16pm

Okay, so I was intentionally transfering large files only and getting about 40MiB/s which is bad but not awful. Then I noticed on small files it dropped to around 500KiB/s instead which is awful.

I wonder, if box.com's api is even more troublesome than dropbox's does that mean someone could design a feature like dropbox-batch-mode for box.com? That would sure be awesome if possible.

therobbiedavis · August 31, 2023, 10:02pm

I have a folder with 3000 sub-folders and have found that when trying to access this folder my Box rclone mount becomes unusable. I suspect it’s due to this performance issue as stated. This behavior is not present on the web gui. I might try to deep dive into this problem as this is the only problem I’m facing with the service.

asdffdsa · August 31, 2023, 10:19pm

welcome to the forum,

IMO, these are two somwhat different issues:

OP: transferring files.
you: rclone mount and listing files already in box.
there are workarounds/tweaks for that.

much discussed in the forum or welcome to start a new topic, answer all the questions.

ncw · September 1, 2023, 11:19am

This might be to do with box's APIs.

Box doesn't (or didn't last time I looked) have an API to find a single file given a path. You have to look descend through each directory listing looking up the directory IDs then list the final directory to find the file ID.

This means that rclone might be listing that 20k directory over and over again to find file IDs or check files don't exist which would make the bad performance you are seeing.

Assuming this is the problem (and it should be pretty obvious from a bit of -vv --dump bodies) there are two ways to fix this....

Get rclone not to call NewObject (this it the internal method that will cause the problem).
Research box APIs to see if they have a more efficient API - eg find this file in a directory rather than listing the whole thing.

What rclone command line are you using?

Kaplas · September 1, 2023, 12:04pm

About the point #2 (the more efficient API), they have this one. It allows to query recursively by file name inside a folder (or a group of folders) and it returns the folder tree for every file found.

I don't know if it is easy to use in rclone.

ncw · September 1, 2023, 2:37pm

Hmm, that API could be made to work.

I'd have to use the

ancestor_folder_idsstring arrayin queryoptional

example4535234,234123235,2654345

Limits the search results to items within the given list of folders, defined as a comma separated lists of folder IDs.

Search results will also include items within any subfolders of those ancestor folders.

The folders still need to be owned or shared with the currently authenticated user. If the folder is not accessible by this user, or it does not exist, a HTTP 404 error code will be returned instead.

To limit the folders - I don't really want a recursive search here so that would mean it was doing more work than necessary. I just want to see if a file named X is in a given folder ID.

It looks like setting quero to "filename" would do more or less what I want too

querystringin queryoptional

example"sales"

The string to search for. This query is matched against item names, descriptions, text content of files, and various other fields of the different item types.

This parameter supports a variety of operators to further refine the results returns.

"" - by wrapping a query in double quotes only exact matches are returned by the API. Exact searches do not return search matches based on specific character sequences. Instead, they return matches based on phrases, that is, word sequences. For example: A search for "Blue-Box" may return search results including the sequence "blue.box", "Blue Box", and "Blue-Box"; any item containing the words Blue and Box consecutively, in the order specified.

So I think this API could be made to work, but the recursive search and fuzzy string matching will mean it does more work than it needs to. I can give it a go if we are sure that the current method is the problem.

durval · September 1, 2023, 7:27pm

Hi @ncw, and thanks for your great response as usual!

This would explain the behavior I'm seeing.

I tried a ton of different options with the exact same result, but the last was this:
rclone -vv --transfers=16 --checkers=16 --max-size=55574528b --bwlimit=121M copy ENCRYPTED_GOOGLE_DRIVE_REMOTE: CHUNKED_ENCRYPTED_BOX_REMOTE:

--max-size is being used because I had already copied all files larger than this limit, so now I'm only copying the smaller ones; --bwlimit is to avoid the infamous 10TB/day ban at the source of the copy (Google Drive).

In case you need it, here are the relevant rclone.conf sections:

[BOX_REMOTE]
type = box
token = {"access_token":"REDACTED","token_type":"bearer","refresh_token":"REDACTED","expiry":"2023-09-01T12:03:52.831460388-04:00"}

[ENCRYPTED_BOX_REMOTE]
type = crypt
remote = BOX_REMOTE:
password = REDACTED
filename_encoding = base32768

[CHUNKED_ENCRYPTED_BOX_REMOTE]
type = chunker
remote = ENCRYPTED_BOX_REMOTE:
chunk_size = 4G
name_format = *.rcc###
hash_type = sha1all

Thanks again!

durval · September 1, 2023, 7:29pm

That would be great! Please let me know if I can help with testing, etc.

If you can minimize the number of API calls, it would surely help (besides being slow, too many API calls can lead to throttling and even an eventual ban as per their TOS).

ncw · September 9, 2023, 11:50am

I asked on the box developer forum, and it turns out there is an API which will do what I need called the preflight check.

Its purpose is to see if files exist before you upload them which is why I hadn't twigged it is useful, but it does exactly what I need to avoid listing the directory as in it turns a (fileName, directoryID) into a fileID. I needed to call another api method to turn the fileID into the full file info, but I think both these API calls will be very quick compared to listing directories.

I haven't attempted to performance test these though as exactly how they perform will depend on your workload, but I'd be grateful if you'd try to do that @durval to see how much (if any!) performance gain you get when syncing large directories.

v1.64.0-beta.7344.0299ce2a7.fix-box-metadata-for-path on branch fix-box-metadata-for-path (uploaded in 15-30 mins)

left1000 · September 9, 2023, 6:26pm

I am not sure it is possible or wise to test this today

Given that box.com seems to be suffering from some sort of problem. Too mysterious and too INTENTIONAL to diagnose.

durval · September 11, 2023, 4:34pm

Amazing, @ncw! Thank you very much for all your great work, you and rclone are the best!

It would be my pleasure, but at this exact moment I'm STILL (for 3 days now) without both write (upload) AND read (download) access to my data. I opened a ticket back then and they just replied, basically telling me that if I want READ access to my data, I would have to first cancel the account (so they really don't want me there) and only then they will graciously provide me with 30 days to copy my data out.

So I don't think I will be able to test this I'm really sorry about that

Perhaps there's someone around who can still upload to Box to have that tested? I will give it a holler in the public thread and see if we can get some volunteers.

durval · September 11, 2023, 4:36pm

Agree on the "INTENTIONAL" part, as per their response to my last ticket (complaining they had cut not only write but also READ access to my data): the only alternative they gave me to have READ access back was to cancel my account (sad and incredible, but true).

I will post a complete update to the main “Unlimited” alternatives to Google Drive, what are the options? thread and tag you so as to be sure you see it.

ncw · September 11, 2023, 4:39pm

Sorry to hear your troubles with Box. That is very frustrating.

I can probably do a test here - say upload 10k files to a single directory with and without the new code.

Though we might find out that the problem is at box's end and it is the upload preflight that is slow...

system · September 14, 2023, 4:40pm

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.

PSA: Box.com has *serious* performance issues in directories with thousands of files

PSA: Box.com has serious performance issues in directories with thousands of files