Azure Blob with million of files - slow file access

What is the problem you are having with rclone?

Hello, we have an archive software which produces millions of files. We are planning
to source "old" files out of our main infrastructur to azure blob storage.
We`ve installed a server with rclone for uploading and accessing these files.

The software creates folders like this:

Data1 -> 1,0 million files
Data2 -> 600k files
DataX -> ~same amount of files
For Each year one Folder

Unfortunately we can not change this behaviour.
We've uploaded the first two data folders to azure blob storage ( ~ 1 TB / 1,6 million files )

When we start the service to mount this - it takes up to 10 (fast-list) or 20 (without fast) minutes till we can access the data.
I think it is creating the filelist then - because we are seeing traffic ~20 Mbits all the time.
When its done - we can access the files - but it is realy very slow - takes up to 60 seconds
till it opens a small textfile - while your opening only one file directly with the unc path the cpu load
rises to 100% - also the ram rises to 6-8 GB.

Is there any chance to get this working in a acceptable time with some parameters or do you think this is the wrong task for a rclone mount?
I`ve set the dir-cache-time to 24h because it wouldn be a problem - but even when the filelist is finished accessing these files takes too much time and produces ~90/100% cpu for a single file access.

The process is rly only DMS -> upload to azure and later access these files from azure
with direct path and filename - no directory scan. These files also do net get changed anymore.
And the DMS is the only system writing into this container. So we do not need a sync or anything like this.
Is there some option where the directory list is stored localy and not in ram which has fast access times?
Or some other parameters we could try?

Run the command 'rclone version' and share the full output of the command.

rclone v1.59.0

  • os/version: Microsoft Windows Server 2019 Datacenter 1809 (64 bit)
  • os/kernel: 10.0.17763.3165 (x86_64)
  • os/type: windows
  • os/arch: amd64
  • go/version: go1.18.3
  • go/linking: static
  • go/tags: cmount

Which cloud storage system are you using? (eg Google Drive)

Azure Blob Storage

The command you were trying to run (eg rclone copy /tmp remote:tmp)

mount INFP-FFFF c:\LIVE-FFFF --log-file=C:\rclonelog.txt --log-level NOTICE  --fast-list --dir-cache-time 24h --vfs-cache-mode full --vfs-cache-max-size 90G

The rclone config contents with secrets removed.

[NAME]
type = azureblob
account = INFP

A log from the command with the -vv flag

https://zerobin.net/?67706070d3e82511#hkqNZ2eCpAzzG2UV79QRsJ+/sjh8P4kp84DCfoHyo+g=

Thank you

Hey there, anyone an idea?

Hi PiLoT650,

That is a lot of data, but not more than I would expect possible to handle with rclone (assuming sufficient bandwidth and hardware).

I guess your primary issue is the time it takes to open a (small text) file. Is this correct?

If yes, how do you open the small text file?
Do you navigate to the folder in Windows Explorer on your Windows Server and then double-click to open in NotePad?
If not, do you see the same issue if trying that?

Hi Ole,

i think the "hardware" should be good to go - something like 12 gb ram and two cores 3,2 GHz the bandwith is something like 100-200 mbit.

That is right - the issue is - that it takes way too long to open up a single and small file.
In live operations there would be many users trying to open some standard sized pdf files for example.

So we are using this server lets call him azureproxy only for this - this is a windows server with rclone and the mapped blob containers on it. In this container is a root folder - which we share to the network - under this rootfolder are the files.

We are trying to open these files then via network share with a direct unc path.
So from my client: notepad.exe \azureproxy\Data1\OneTextfile.txt
or the DMS system later - would also only open a file directly with path.

We are not browsing through it with a windows explorer.
I think if i would try browsing this dir from my client over network - it would crash my explorer.
But i`ll try this tomorrow.

While doing this - the cpu rises as mentioned above - and it takes something like 60 seconds to open like 5kb txt file.
It its like the server is looking up the virtual file / folder list and takes so much ressources and time for it.
When i first mount the blob and access the folder directly on the server it takes 10 or more minutes before you can browse the folder in explorer.

Thank you

I think we need to reduce complexity to be able to find the root cause. I would therefore like to check the functioning of the rclone mount without the added complexity of the networkshare, network, DMS, clients, multiple users etc. etc.

To to this I would like you to find a timeslot to execute the following simple commands on the azureproxy, while there is no other activity on the mount, that is the network share is offline or you are outside office hours:

# Assuming the mount has been started minimum 20 minutes before (that is the directory is fully loaded from Azure)
dir c:\LIVE-FFFF\Data1\OneTextfile.txt
dir c:\LIVE-FFFF\Data1\OneTextfile.txt
notepad c:\LIVE-FFFF\Data1\OneTextfile.txt
notepad c:\LIVE-FFFF\Data1\OneTextfile.txt

I intentionally perform the commands twice to see the effect of the cache. It would be ideal if you can replace OneTextfile.txt with another text file that isn't already in the rclone file cache.

If the mount is still slow, then I would like to see the rclone debug log for the timespan of the two commands. If the server is heavily loaded (RAM,CPU), then I would also like to see the CPU and Memory Performance graph/profile from the Task Manager.

If the mount is fine on the server, then you can try the similar commands on your client to check if it has something to do with the network share, network or client (still without other users of the mount).

If the simple commands are fine on the client then try the DMS on your client (still without other users of the mount).

If this is also fine, then I guess you can see where I am heading and possibly continue adding complexity, e.g. more users.

Please ask if an doubt and keep me posted as your progress then I will try to digest/comment along the way (when available).

Hi Ole, thank you for your assistance,

it's no problem to take all these tests because we are not using it in production right now.
I am the only one testing it.

commands directly on the azureproxy
So what i see - if i run this commands directly on the azureproxy (rclone) server
it executes it immediately - so there's no wait time and no cpu rise.
Files will open up like its a normal local file. Tried it with different files each time.
Its always the same behaviour - fast and wihout cpu rises. Also tried it via explorer
and notepad - its also fast.

commands on my client via share
if i run these commands on my client its diffrent - sometimes it opens the file
immediately wihout a cpu rise on the server and sometimes very slow with the 100% cpu rise
on the server. I'm really trying it each time with a different file.
Wenn i run it with a same file i openend before on the server or client - it's fast.
After the dir command with a cpu rise - the file also opens fast then in notepad.

Heres a graph of the azureproxy server when its a slow access from the client - the file was 87 KiB
The dir command took long (cpu rise) and the notepad opening then was fast.

So - seems like it has something to do with the share / accessing via/over it.

Thank you

Hi PiLoT650,

Very good tests and information!

I agree it seems have something to do with accessing over the share. Now let's see if we can get closer to find the pattern.

If you select a new file and then perform the 4 commands without pauses for the selected file then I understand that you sometimes see a delay for one of the commands, whereas the other 3 commands are fast. Does the delay always happen at a specific command? I guess it is always the 1st (or maybe the 3rd)

Try monitoring the in and outgoing network traffic (Ethernet) in Task Manager on both the azureproxy and your client when there is a delay. I am trying to determine if 1) the client is requesting/receiving a full directory list from the azureproxy 2) the azureproxy is requesting a full directory listing from azure 3) the client is requesting/receiving a full directory list from the azureproxy that it requests from azure. (I try to avoid exchanging and reading huge debug logs)

Are you saying that there is 1M files directly in the Data1 directory? Eg

Data1/file000001.bin
Data1/file000002.bin
...
Data1/file999999.bin

That might be causing the problems you are seeing. Rclone needs to list and keep in memory 1M objects which will use lots of memory.

I suspect it will be causing the OS lots of problems too, but I'm less familiar with that side of it.

We see no issues when accessing the mount directly on the server (azureproxy).

The server has 12GB RAM, how much memory does rclone need per directory entry? (I guesstimated 1K, that is 1GB in total)

My currently best guess is that it has to do with the SMB2 client side directory caching:
SMB2 Client Redirector Caches Explained | Microsoft Docs

I don't have much experiece with Windows file servers, but it seems to be configurable:
Performance tuning for file servers | Microsoft Docs

Hi Ole,

when it's slow - then it is always the first command - which one ( dir / or open with notepad )
doesn't matter - always the first command - when the first command is finished the second operation is fast.
What i saw now - the next 4-5 commands are also fast - even with other files.
But after ~5-8 different files - its slow again for this one file with 100% cpu.

The CPU Load on the server is clearly always rclone.exe.
While opening a file from my client via azureproxy - i can not see any big peaks in the network
section - theres only a short 2,1 Mbit peak - when the cpu drops - i think this is only transmitting
the file - so i don't think the client gets a full directory list nor the server.

Because when i restart the service on the azureproxy it will sync - and then it takes
~ 20 Mbit for 10-20 minutes as described above. And i assume it would do the same speed
between my client and azureproxy - or? Or do you think its these constant ~200 Kbits?
So i think it`s not 1/2 or 3 - otherwise we should see more traffic and it should take way
longer than these ~40-60 seconds?

Hi ncw, yes unfortunately there are so many files in there and we have more of these folders.
Acutally it`s 2/3 but we will have one of these each year. We even don't like this nor do we know why a software
company is doint something like this - but anyway - we have to deal with it.

If we don't find a solution via this proxy - i think the sw company has to implement a direct azure api.

Yes, that could get a problem. If i remove fast-list it doesn't need so much ram - but then the initial mount takes way longer.

This is the ram usage for 1,6 million files:

image

That`s why we want to source this OUT of our main infrastructure - in cases of restores - backups etc.

It is a wild guess, but what happens if you disable the client directory cache like described in this link (please reboot after the registry changes):

while I am thinking on other possibilities...

1 Like

I agree, there seems to be no (significant) network activity, so it could also be an inefficient/linear search algorithm in the azureproxys directory cache (caused by Windows, winfsp or rclone), but then we should also have seen issues while you performed the commands on the azureproxy :thinking:

Can I get you to repeat it 10 times over a period of minimum 30 minutes with new files each time, just to make absolutely sure this never happens when accessing the directory or files directly on the azureproxy.

Edit: Try using filenames that would be both in the start, end and middle of an alphabetic listing.

Hi Ole,

i will try this - i've noticed something else - i prepared a few filenames and started opening the first file - and it was even slow on the azureproxy. Also the second and third... - so now i can tell you if i open a file on the azureproxy

With CMD being in the folder big data folder on azureproxy

C:\LIVE-XX\XXAzure_Live\XXDEFData1>notepad C:\LIVE-XX\XXAzure_Live\XXDEFData1\02000400-0000-0000-0000-000000000a96.dat

Opening a file is reproducable everytime slow - even if i open the file right after closing it again - i opend this file more than 10 times everytime after closing it - and everytime it is slow - and everytime 100% cpu.

With CMD beeing only under C:\Users on azureproxy

C:\Users\>notepad C:\LIVE-XX\XXAzure_Live\XXDEFData1\02000400-0000-0000-0000-000000000a96.dat

Opening the files is everytime fast - like local.

Maybe this info helps?

I will try this again later with another 10 files - every 3 min one file only under C:\Users\notepad FILE
but so far i think it will open it fast. Will also test your link tomorrow.

I took files from every range of the folder:

02000400-0000-0000-0000-000000000a96.dat
02000400-0000-0000-0000-0000000ebfaf.dat
02000400-0000-0000-0000-0000001aaa55.dat
02000400-0000-0000-0000-0000001bdde3.dat
02000400-0000-0000-0000-0000002eaf96.dat
02000400-0000-0000-0000-00000014bb4c.dat
02000400-0000-0000-0000-000000123ce1.dat
02000400-0000-0000-0000-0000001990da.dat
02000400-0000-0000-0000-000000083682.dat
02000400-0000-0000-0000-000000329390.dat

Thank you

Ah OK - missed that!

1K is the estimate I usually give to people. It seems to be more like 1.7k in this case, but there is an overhead for storing things in the VFS cache (its quite inefficient in memory use really).

How are the names of the files in the big folder structured?

It would be possible to artificially make them into subdirectories - say using the first 3 characters - and I've thought of doing this before in cases like this. It all depends on how the file names are structured as to whether this can be done efficiently. The azure blob API can list files with a common prefix only.

Another idea might be to export the bucket as a webdav share rclone serve webdav. Windows can use webdav shares directly and its possible they might not behave so badly. This will still likely have the very slow startup but the individual file reads may be quicker.

I guess I can explain this:

You are probably using CMD and it first tries to find an executable (exe,bat,ps1) named "notepad" in the current folder which makes WinFSP ask rclone for a complete list of all files in the folder (Readdir). The list is passed back one item at the time by rclone (in this fill loop). Next WinFSP (or Windows) will scan the entire list to see if there is any files matching "notepad.*". This will take a long time with 1 million files. @ncw please correct me, if mistaken

Lessons learned:

  • Use "C:\Windows\notepad.exe" instead if just "notepad"; especially when using CMD from an rclone mounted folder with a lot of entries.
  • A complete directory list/scan takes a lot of time, CPU and RAM when performed on an rclone mount with 1 million entries in the folder being searched.

Yes, this was very helpful indeed and made me realize the above.

I agree, this isn't needed anymore.

Instead please test from C:\Users\ on the client in case the earlier test was performed from the folder shared from azureproxy (the folder with all the files).

Not needed if the above test from C:\Users\ on the client is fast.

I did a quick test and rclone still sends all the directory items when Windows searches the webdav folder, so I guess it still needs to collect and send a 2GB directory listing. It may be faster, but probably not fast.

1 Like

Hi Ole, that makes sense :see_no_evil:
So i can confirm there s no problem on the azureproxy.
As the tests on the client were under C:\Users - this wasn't the problem on the client.

So i tried your the solution from this article - set a DWORD under

Computer\HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\LanmanWorkstation\Parameters

with DirectoryCacheLifetime 0 - rebootet my client and till now i can not see this issue again.
So it seems like this should solve this issue - did a test with over 40 files now in the last hour.

I will test it a little bit more and report again - and we also have to think about if we want to do this change
to all clients.

May i ask another question about RAM - if this rclone process takes 2,6 GB RAM for this 1,6 million files - can we say than it will take 5,2 GB for 3,2 million files?

Thank you

Hey there - as posted above - didn't get again this issue :slight_smile:

I think that is a fair estimate, yes

1 Like

Perfekt, freut mich zu hören, das war wirklich ein Glückstreffer :sweat_smile:

I did a quick googling and found this [MS-SRVS]: Per Share | Microsoft Docs which seems to describe a possibility to disallow client directory caching using AllowNamespaceCaching=FALSE on the server/share. I know too little about Windows servers/shares to fully understand the context or details, but perhaps worth some additional googling, reading and experiments.

Fully a agree (assuming similar length of path and file names).

I have a few suggestions that may improve your mount command:

  • Use --log-level=INFO during tests and the first weeks/months after go-live. It will give you a better view of the mount activity. The current status can quickly be checked with this PowerShell command: Get-Content onedrive_mount.log -Tail 10 -Wait
  • Add --stats=1m to also get some useful stats in the log, may e.g. help you see if the WAN becomes a bottleneck.

Tuning option:

  • Add --transfers=8 to allow more concurrent downloads, I guess 8 is (more than) enough to saturate your bandwidth (100-200 mbit/s)

Advanced option:

  • Add --rc to make the mount remote controllable. You can then use the Task Scheduler to refresh the directory listing from azure every midnight with this command: rclone rc vfs/refresh recursive=true

Happy testing and good luck :slightly_smiling_face: