A big memory consumption during copy from Azure blob storage to another Azure blob storage

cynepco3hahue · September 29, 2024, 8:06am

What is the problem you are having with rclone?

The rclone consumes a lot of memory during running copy from Azure blob storage to another azure blob storage.

IIUC from my basic calculation I should have in memory:

TRANSFERS * RCLONE_AZUREBLOB_UPLOAD_CONCURRENCY * RCLONE_AZUREBLOB_CHUNK_SIZE = 1280Mi

Run the command 'rclone version' and share the full output of the command.

# rclone --version
rclone v1.67.0
- os/version: amazon 2 (64 bit)
- os/kernel: 3.10.0-1160.95.1.el7.x86_64 (x86_64)
- os/type: linux
- os/arch: amd64
- go/version: go1.22.4
- go/linking: static
- go/tags: none

Which cloud storage system are you using? (eg Google Drive)

azureblob

The command you were trying to run (eg `rclone copy /tmp remote:tmp`)

/usr/local/bin/rclone copy --metadata --stats 60m --stats-log-level ERROR --transfers 20 --log-level ERROR --use-json-log --retries 1 src:<container_name> dst:<container_name>

The rclone config contents with secrets removed.

I do not use any special config, all variables are paste or as command arguments or env vars, so I will add here env variables that I am using:

# related to src and dst
'AZURE_SRC_STORAGE_ACCOUNT'
'AZURE_SRC_CONTAINER'
'AZURE_SRC_STORAGE_SAS_TOKEN'
'AZURE_DST_STORAGE_ACCOUNT'
'AZURE_DST_CONTAINER'
'AZURE_DST_STORAGE_SAS_TOKEN'

# related to rclone configuration
'RCLONE_CONFIG_SRC_TYPE': 'azureblob'
'RCLONE_CONFIG_SRC_SAS_URL': <src_sas_url>
'RCLONE_CONFIG_DST_TYPE': 'azureblob'
'RCLONE_CONFIG_DST_SAS_URL': <dst_sas_url>
'RCLONE_AZUREBLOB_NO_CHECK_CONTAINER': 'true'

A log from the command with the `-vv` flag

Do not have logs, because it happening for one of our customers and contains some confidential data, but I have output from rclone stats

{
  "message": "rclone stats",
  "timestamp": "2024-09-28T04:45:39.481965Z",
  "level": "INFO",
  "pathname": "/cloud_migration_agent/cloud_agent/cloud_agent.py",
  "line_number": 232,
  "process_id": 1,
  "threadName": "MainThread",
  "bytes": 64232164651661,
  "checks": 63159142,
  "deletedDirs": 0,
  "deletes": 0,
  "elapsedTime": 248400.017623065,
  "errors": 0,
  "eta": null,
  "fatalError": false,
  "renames": 0,
  "retryError": false,
  "serverSideCopies": 0,
  "serverSideCopyBytes": 0,
  "serverSideMoveBytes": 0,
  "serverSideMoves": 0,
  "speed": 341611303.33405966,
  "totalBytes": 64221630675261,
  "totalChecks": 63169158,
  "totalTransfers": 495339,
  "transferTime": 248388.317982617,
  "transfers": 495338
}
!!!! here was a first jump in memory consumption from from 1Gb to 15Gb, see the metrics !!!!
{
  "message": "rclone stats",
  "timestamp": "2024-09-28T03:45:39.579526Z",
  "level": "INFO",
  "pathname": "/cloud_migration_agent/cloud_agent/cloud_agent.py",
  "line_number": 232,
  "process_id": 1,
  "threadName": "MainThread",
  "bytes": 63436603913425,
  "checks": 62210374,
  "deletedDirs": 0,
  "deletes": 0,
  "elapsedTime": 244800.022861132,
  "errors": 0,
  "eta": null,
  "fatalError": false,
  "renames": 0,
  "retryError": false,
  "serverSideCopies": 0,
  "serverSideCopyBytes": 0,
  "serverSideMoveBytes": 0,
  "serverSideMoves": 0,
  "speed": 395717898.7707056,
  "totalBytes": 63433873851477,
  "totalChecks": 62220389,
  "totalTransfers": 494910,
  "transferTime": 244788.393254615,
  "transfers": 494907
}

IIUC once rclone just checked the files that was already exist, it did not really consume any memory, but once did it come to files that should be transferred it started to consume a lot of memory, but to be honest it also shows some transfers before memory jump, so I unsure what happened here.

Our code restarted the pod with rclone for the same storage account and it from the beginning stated to consume a lot of memory, see next graph in timeline.

It the output from the next run:

{
  "message": "rclone stats",
  "timestamp": "2024-09-29T06:26:24.507395Z",
  "level": "INFO",
  "pathname": "/cloud_migration_agent/cloud_agent/cloud_agent.py",
  "line_number": 232,
  "process_id": 1,
  "threadName": "MainThread",
  "bytes": 372150918280,
  "checks": 84822876,
  "deletedDirs": 0,
  "deletes": 0,
  "elapsedTime": 18000.007430341,
  "errors": 0,
  "eta": 0,
  "fatalError": false,
  "renames": 0,
  "retryError": false,
  "serverSideCopies": 0,
  "serverSideCopyBytes": 0,
  "serverSideMoveBytes": 0,
  "serverSideMoves": 0,
  "speed": 209433916.89574507,
  "totalBytes": 372150918280,
  "totalChecks": 84832890,
  "totalTransfers": 81273,
  "transferTime": 5012.228343695,
  "transfers": 81273
}

kapitainsky · September 29, 2024, 8:28am

Might be the manifestation of very old rclone limitation which has not been tackled yet. Have a look at this github issue:

github.com/rclone/rclone

Excess memory use when syncing millions of files in one directory

opened 08:44AM - 25 Jul 24 UTC

zackees

bug

## Background - Amazon S3 rclone problems I'm trying to backup a datalake wit…h 100 million files at the root. They are mostly small files < 1mb. rclone was simply not designed for this use case and will eat up all available memory and then crash. There was no machine instance that I could throw at it that would fix this issue. Even running locally in a docker instance would eat up all available memory and then crash. All advice in the forums did nothing to help the situation. And a lot of people seem to be running into this. Therefore I wanted to post this here so that anyone searching for this problem can try our solution. ## Solution in a nutshell: PUT YOUR FILES INTO FOLDERS!!!! What's interesting is the behavior: rclone would never start transferring files, it would always sit there saying 0 files transferred, 0 bytes transferred, eat up all available memory before crashing with an OOM. I tried all the suggestions in the forums, reducing the buffering memory, reducing the number of checkers and transfers. Nothing worked. ## Cause & Fix Without looking at the code or doing any profiling, my hypothesis was that rclone scans an all files in a"directory" into RAM before executing on it. This seems true whether not `--fast-scan` is used or not. Obviously, having 100 million files at the root was causing our org a whole bunch of problems anyway and it's been something that I've wanted to fix for a while, so this problem gave me enough reason to go ahead and re-organize our entire datalakes. Since each file is referenced in our database with a datestamp, I was able to write python scripts that would move these files from the root into folders by the service and year-month (for example name.html -> service/2023-04/name.html) This worked extremely well and I was able to now run rclone and have it at least start transferring some files. However, there were still folders with 5+ million files, and eventually ran into the same out-of-memory error. So again, I further re-organized the files in our datalake into service/yrmo/day. And now that seems to have done the trick. rclone now consistently runs under 2GB memory and I've been able to increase the number of transfers and checkers up to 100 each and have 3mb of buffer per transfer. ## Dead ends All the advice about adjust memory buffers and number of transfers is mostly wrong. They will only cut your minimum memory usage by a constant factor, but will do very little to prevent the absolute unbounded memory that rclone uses for extremely large "directories". If you have this same problem, no amount of setting tweaking will work... you MUST re-organize your data into folders or rclone will just run out of memory every single time. If you have too many files at the root, rclone will simply never start transferring anything and just crash. If one of your subdirectories is too big, you'll get a memory pattern that looks like this: ![image](https://github.com/user-attachments/assets/cad7926e-48b7-46f7-be81-543edc896460) ## Recommendations to the Devs of rclone Please serialize your directory scans to disk if you start exceeding a certain threshold of memory or files in the current directory. You can probably just get away with just always doing that since the disk is so much faster than network anyway. I'm currently doing an inventory scan of our datalakes and 50 million files entries is only taking up 12 gb of disk without any fancy compression. I know you are storing a lot more file information, like metadata, so it could easily be double or triple that. But it is simply so much easier and cheaper to allocate disk space to a docker instance than it is to get a machine with much more ram. An additional pain point about an out of memory issue crash is that when the rclone process gets a kill signal, it will **exit 0** making it look like it succeeded. According to this thread https://github.com/rclone/rclone/issues/7966 this is a feature of linux and you must get the exit code from the operating system instead of the return value of the exited rclone process. This is super scary if you are relying on rclone to backup your datalake but in reality, it starts failing because one of your directories has millions of files in it. I know on Digital Ocean it's easy to see that a docker instance has failed, on Render.com however you'll get a "Run Succeeded" and it's not until you look at the run history that you'll see that in fact your instance ran out of memory. I'm not sure about the other hosting providers. Anyway, I'm glad this huge task is finally over with, and we have started syncing up our data for redundancy and backup purposes. So far so good! ![image](https://github.com/user-attachments/assets/5c2b7096-ade2-45d7-993e-4fd0b0f52196)

It also contains possible workarounds.

Consider rclone sponsorship if your company needs it. It can speed things up:)

cynepco3hahue · September 29, 2024, 10:12am

Thanks for such quick response. It really looks similar to GitHub issue, I will try to check if we have really folders with such amount of files.
Sponsorship is a good idea Will check internally how many
circles of hell should I pass to make it possible.

cynepco3hahue · September 29, 2024, 2:56pm

Regarding to the issue that you specified, I just want to be sure, when we are talking about number of files under the folder, to which level are we referring?
For example if I have:

container
--folder1
----file11
----file12
----file13
--folder2
----file21
----file22
--folder3
----file31
--file1
--file2
...

And I am running copy on container name level: src:container
Does it count number of files separately for each folder or does it count for whole container?

{container: [file1, file2, file11, file12, file13, file21, file22, file31]}
or
{folder1: [file11, file12, file13], folder2: [file21, file22], folder3: [file31], container: [file1, file2]}

Because from looking on our folders hierarchy under container it very slim chances that we will have single folder that will contain more that 1 million files.

cynepco3hahue · September 30, 2024, 12:30pm

Hi folks, I read a little bit the code and found that for Azure we do not use Server-Side copy and instead we are using manualCopy method. Can it affect the memory consumption?

ncw · October 2, 2024, 5:13pm

You can see that in the stats

It shouldn't but sometimes there are bugs!

Try using this flag

  --use-mmap   Use mmap allocator (see docs)

And see if that helps.

Or alternatively ask the garbage collector to work a bit harder - set the environment variable GOGC=20

cynepco3hahue · October 6, 2024, 10:33am

Thanks for the answer!
I tried also to enable server-side copy under my fork of rclone, and looks like it is working. Under azureblob.go I changed to

	f.features = (&fs.Features{
		ReadMimeType:            true,
		WriteMimeType:           true,
		BucketBased:             true,
		BucketBasedRootOK:       true,
		SetTier:                 true,
		GetTier:                 true,
		ServerSideAcrossConfigs: true,
	}).Fill(ctx, f)

and it worked fine, the only limitation that I can see it firewall configuration on the source should available access from destination, see - Copy fails (CannotVerifyCopySource) when storage account firewall is enabled · Issue #755 · Azure/azure-storage-azcopy · GitHub.
Let me know if it worth to make it possible to enable this option via some flag under azureblob, and if it worth I can prepare some PR.

ncw · October 8, 2024, 2:31pm

Using the flag --server-side-across-configs should do this for you without a patch

cynepco3hahue · October 8, 2024, 2:48pm

Oh, it is great Thanks
Another question, I saw that copy with server-side copy is much slower than regular copy with the same parameters:

--transfers 20
--checkers 8(default one)

Maybe do you know why it can be, like the difference was huge 90m vs 2m in the same time frame

ncw · October 8, 2024, 3:33pm

Weird! That must be azure being slow as rclone hands the data copying off to Azure for server side copies.

If you run with debug do you see messages about it waiting for the copy to complete?

cynepco3hahue · November 1, 2024, 12:40pm

Hi folks, we did some investigations regarding our internal folder organization, and it definitely looks like the bug specified under A big memory consumption during copy from Azure blob storage to another Azure blob storage - #2 by kapitainsky, the important thing that in our case the problem that we do not have a lot files under a single prefix, but have a lot of prefixes that each contain a small amount of files, like we have prefix that contains around 50m of sub-prefixes Currently we do not really have a way to change our folder structure, so I just increased a memory for our pod that is running rclone and it helped. I hope we will the fix under v1.69 as specified under GitHub issue.

cynepco3hahue · November 1, 2024, 12:44pm

And I have some additional question regarding to W/A specified under Big syncs with millions of files · rclone/rclone Wiki · GitHub, will the memory issue affect the lsf command. I did not really check the code, but if it works in the same way as walk method for different programming languages, it still can be a memory issue because it will have all sub-prefixes in memory until it will finish to pass over prefix.

ncw · November 14, 2024, 4:11pm

No it won't. That uses a different listing primitive.

cynepco3hahue · December 4, 2024, 4:25pm

Thanks for the answer. Sorry that I am hijack the topic, but I am trying to implement now the W/A specified here Big syncs with millions of files · rclone/rclone Wiki · GitHub and looks like just list files take much more time than copy files. I can try to fetch exact metrics for it, but I am curious if we have any kind of parallelism during rclone lsf command?

kapitainsky · December 4, 2024, 5:42pm

This works very well with S3 where rclone can list up to 1000 objects per request as supported by AWS S3 spec.

Now question would be whether Azure supports something similar and if we can implement it. The former part of this question is to some Azure gurus on this forum.

cynepco3hahue · December 5, 2024, 7:20am

In general we have two clouds, and I ran it for AWS s3

rclone lsf -vvv --files-only --format 'pst' -R remote:<remote>

and it was pretty slow, after I tried to run without modification time in format and it was much faster

rclone lsf -vvv --files-only --format 'ps' -R remote:<remote>

From what I can see both ListObjects - Amazon Simple Storage Service and
ListObjectsV2 - Amazon Simple Storage Service should return LastModified for each object as part of response, so I unsure why with t format parameter is much slower

system · January 4, 2025, 7:20am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.