Azure Blob with million of files - slow file access

Hey Ole, NCW, thank you very much for all your help!
I didn't expected this sort of help! :slight_smile:

Okey - so i think also this long filenames take account - if they would be shorter - it would be a smaller list.
I'll discuss this with the sw company and also if we can split up these files in more folders like a folder for every month or something like this. Otherwise i see us adding 2-3 Gigs of RAM to this server evers year.

Vielen Dank - wollte erst nicht dran glauben - aber es geht :slight_smile:

As we are not starting too soon - i will have a look at your link and try your tuning suggetions.

Thats a really a good tipp - will try this on monday.
So far as it is working for me - i will "stress test" this with my colleagues for the next few days
with more parallel file openings etc. - but i really think this should be ok.

Theres a part in the DMS system which keeps the files for X Days (maybe 150) on our main local storage and azure; and will delete them after these days from our main storage - and after this time - these files shouldnt be "hot files" any more - so we do not expect to much traffic there.

Thank you!

2 Likes

You may not be able to notice a (significant) reduction with shorter file names - seems like there is a lot of overhead. I just made the remark to let you and others know that longer path and filenames (think 1000 chars) surely would increase the memory requirement.

I don't think it will not help to split in more or smaller folders, they will still be part of the directory listing to be kept in memory by the mount. A split may actually lead to a need for more memory due to a longer path and more overhead.

I therefore think you are going to need the 2-3 Gigs of RAM extra every year if sticking to a solution based on an rclone mount. Other similar tools will probably behave somewhat similar unless specifically optimized for a lot of files.

So I have been thinking of an alternative that may be viable if your DMS use an external call to open the files, or can issue an external call just before opening a file. My idea is to drop the mount and instead implement a simple read-only azure cache.

Conceptual example:

# Fetch and open a file, e.g. $file = "Data1\02000400-0000-0000-0000-0000002eaf96.dat"
rclone copyto INFP-FFFF:$file \\azureproxy\simplecache\$file
notepad  \\azureproxy\simplecache\$file

where \azureproxy\simplecache\ is a shared folder on the azureproxy.

rclone copy will then check directly with azure and only copy the file if missing or outdated.

The simplecache folder would need some periodic cleanup, but perhaps it would be fine to just delete everything in it every night/weekend.

This idea would be able to hold all your data/year folders in a single share and doesn't require all the CPU or RAM to handle a local copy of the entire azure directory listing.

It will require a request to azur for every file being opened, but doesn't require the nightly traversal of the entire azure directory. This could increase or decrease API costs depending on the amount of documents opened.

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.