A big memory consumption during copy from Azure blob storage to another Azure blob storage

What is the problem you are having with rclone?

The rclone consumes a lot of memory during running copy from Azure blob storage to another azure blob storage.

IIUC from my basic calculation I should have in memory:

TRANSFERS * RCLONE_AZUREBLOB_UPLOAD_CONCURRENCY * RCLONE_AZUREBLOB_CHUNK_SIZE = 1280Mi

Run the command 'rclone version' and share the full output of the command.

# rclone --version
rclone v1.67.0
- os/version: amazon 2 (64 bit)
- os/kernel: 3.10.0-1160.95.1.el7.x86_64 (x86_64)
- os/type: linux
- os/arch: amd64
- go/version: go1.22.4
- go/linking: static
- go/tags: none

Which cloud storage system are you using? (eg Google Drive)

azureblob

The command you were trying to run (eg rclone copy /tmp remote:tmp)

/usr/local/bin/rclone copy --metadata --stats 60m --stats-log-level ERROR --transfers 20 --log-level ERROR --use-json-log --retries 1 src:<container_name> dst:<container_name>

The rclone config contents with secrets removed.

I do not use any special config, all variables are paste or as command arguments or env vars, so I will add here env variables that I am using:

# related to src and dst
'AZURE_SRC_STORAGE_ACCOUNT'
'AZURE_SRC_CONTAINER'
'AZURE_SRC_STORAGE_SAS_TOKEN'
'AZURE_DST_STORAGE_ACCOUNT'
'AZURE_DST_CONTAINER'
'AZURE_DST_STORAGE_SAS_TOKEN'

# related to rclone configuration
'RCLONE_CONFIG_SRC_TYPE': 'azureblob'
'RCLONE_CONFIG_SRC_SAS_URL': <src_sas_url>
'RCLONE_CONFIG_DST_TYPE': 'azureblob'
'RCLONE_CONFIG_DST_SAS_URL': <dst_sas_url>
'RCLONE_AZUREBLOB_NO_CHECK_CONTAINER': 'true'

A log from the command with the -vv flag

Do not have logs, because it happening for one of our customers and contains some confidential data, but I have output from rclone stats

{
  "message": "rclone stats",
  "timestamp": "2024-09-28T04:45:39.481965Z",
  "level": "INFO",
  "pathname": "/cloud_migration_agent/cloud_agent/cloud_agent.py",
  "line_number": 232,
  "process_id": 1,
  "threadName": "MainThread",
  "bytes": 64232164651661,
  "checks": 63159142,
  "deletedDirs": 0,
  "deletes": 0,
  "elapsedTime": 248400.017623065,
  "errors": 0,
  "eta": null,
  "fatalError": false,
  "renames": 0,
  "retryError": false,
  "serverSideCopies": 0,
  "serverSideCopyBytes": 0,
  "serverSideMoveBytes": 0,
  "serverSideMoves": 0,
  "speed": 341611303.33405966,
  "totalBytes": 64221630675261,
  "totalChecks": 63169158,
  "totalTransfers": 495339,
  "transferTime": 248388.317982617,
  "transfers": 495338
}
!!!! here was a first jump in memory consumption from from 1Gb to 15Gb, see the metrics !!!!
{
  "message": "rclone stats",
  "timestamp": "2024-09-28T03:45:39.579526Z",
  "level": "INFO",
  "pathname": "/cloud_migration_agent/cloud_agent/cloud_agent.py",
  "line_number": 232,
  "process_id": 1,
  "threadName": "MainThread",
  "bytes": 63436603913425,
  "checks": 62210374,
  "deletedDirs": 0,
  "deletes": 0,
  "elapsedTime": 244800.022861132,
  "errors": 0,
  "eta": null,
  "fatalError": false,
  "renames": 0,
  "retryError": false,
  "serverSideCopies": 0,
  "serverSideCopyBytes": 0,
  "serverSideMoveBytes": 0,
  "serverSideMoves": 0,
  "speed": 395717898.7707056,
  "totalBytes": 63433873851477,
  "totalChecks": 62220389,
  "totalTransfers": 494910,
  "transferTime": 244788.393254615,
  "transfers": 494907
}

IIUC once rclone just checked the files that was already exist, it did not really consume any memory, but once did it come to files that should be transferred it started to consume a lot of memory, but to be honest it also shows some transfers before memory jump, so I unsure what happened here.

Our code restarted the pod with rclone for the same storage account and it from the beginning stated to consume a lot of memory, see next graph in timeline.

It the output from the next run:

{
  "message": "rclone stats",
  "timestamp": "2024-09-29T06:26:24.507395Z",
  "level": "INFO",
  "pathname": "/cloud_migration_agent/cloud_agent/cloud_agent.py",
  "line_number": 232,
  "process_id": 1,
  "threadName": "MainThread",
  "bytes": 372150918280,
  "checks": 84822876,
  "deletedDirs": 0,
  "deletes": 0,
  "elapsedTime": 18000.007430341,
  "errors": 0,
  "eta": 0,
  "fatalError": false,
  "renames": 0,
  "retryError": false,
  "serverSideCopies": 0,
  "serverSideCopyBytes": 0,
  "serverSideMoveBytes": 0,
  "serverSideMoves": 0,
  "speed": 209433916.89574507,
  "totalBytes": 372150918280,
  "totalChecks": 84832890,
  "totalTransfers": 81273,
  "transferTime": 5012.228343695,
  "transfers": 81273
}

Might be the manifestation of very old rclone limitation which has not been tackled yet. Have a look at this github issue:

It also contains possible workarounds.

Consider rclone sponsorship if your company needs it. It can speed things up:)

Thanks for such quick response. It really looks similar to GitHub issue, I will try to check if we have really folders with such amount of files.
Sponsorship is a good idea :slight_smile: Will check internally how many
circles of hell should I pass to make it possible.

1 Like

Regarding to the issue that you specified, I just want to be sure, when we are talking about number of files under the folder, to which level are we referring?
For example if I have:

container
--folder1
----file11
----file12
----file13
--folder2
----file21
----file22
--folder3
----file31
--file1
--file2
...

And I am running copy on container name level: src:container
Does it count number of files separately for each folder or does it count for whole container?

{container: [file1, file2, file11, file12, file13, file21, file22, file31]}
or
{folder1: [file11, file12, file13], folder2: [file21, file22], folder3: [file31], container: [file1, file2]}

Because from looking on our folders hierarchy under container it very slim chances that we will have single folder that will contain more that 1 million files.

Hi folks, I read a little bit the code and found that for Azure we do not use Server-Side copy and instead we are using manualCopy method. Can it affect the memory consumption?

You can see that in the stats

It shouldn't but sometimes there are bugs!

Try using this flag

  --use-mmap   Use mmap allocator (see docs)

And see if that helps.

Or alternatively ask the garbage collector to work a bit harder - set the environment variable GOGC=20

Thanks for the answer!
I tried also to enable server-side copy under my fork of rclone, and looks like it is working. Under azureblob.go I changed to

	f.features = (&fs.Features{
		ReadMimeType:            true,
		WriteMimeType:           true,
		BucketBased:             true,
		BucketBasedRootOK:       true,
		SetTier:                 true,
		GetTier:                 true,
		ServerSideAcrossConfigs: true,
	}).Fill(ctx, f)

and it worked fine, the only limitation that I can see it firewall configuration on the source should available access from destination, see - Copy fails (CannotVerifyCopySource) when storage account firewall is enabled · Issue #755 · Azure/azure-storage-azcopy · GitHub.
Let me know if it worth to make it possible to enable this option via some flag under azureblob, and if it worth I can prepare some PR.

Using the flag --server-side-across-configs should do this for you without a patch

Oh, it is great :slight_smile: Thanks
Another question, I saw that copy with server-side copy is much slower than regular copy with the same parameters:

--transfers 20
--checkers 8(default one)

Maybe do you know why it can be, like the difference was huge 90m vs 2m in the same time frame

Weird! That must be azure being slow as rclone hands the data copying off to Azure for server side copies.

If you run with debug do you see messages about it waiting for the copy to complete?

Hi folks, we did some investigations regarding our internal folder organization, and it definitely looks like the bug specified under A big memory consumption during copy from Azure blob storage to another Azure blob storage - #2 by kapitainsky, the important thing that in our case the problem that we do not have a lot files under a single prefix, but have a lot of prefixes that each contain a small amount of files, like we have prefix that contains around 50m of sub-prefixes :frowning: Currently we do not really have a way to change our folder structure, so I just increased a memory for our pod that is running rclone and it helped. I hope we will the fix under v1.69 as specified under GitHub issue.

And I have some additional question regarding to W/A specified under Big syncs with millions of files · rclone/rclone Wiki · GitHub, will the memory issue affect the lsf command. I did not really check the code, but if it works in the same way as walk method for different programming languages, it still can be a memory issue because it will have all sub-prefixes in memory until it will finish to pass over prefix.

No it won't. That uses a different listing primitive.

Thanks for the answer. Sorry that I am hijack the topic, but I am trying to implement now the W/A specified here Big syncs with millions of files · rclone/rclone Wiki · GitHub and looks like just list files take much more time than copy files. I can try to fetch exact metrics for it, but I am curious if we have any kind of parallelism during rclone lsf command?

This works very well with S3 where rclone can list up to 1000 objects per request as supported by AWS S3 spec.

Now question would be whether Azure supports something similar and if we can implement it. The former part of this question is to some Azure gurus on this forum.

In general we have two clouds, and I ran it for AWS s3

rclone lsf -vvv --files-only --format 'pst' -R remote:<remote>

and it was pretty slow, after I tried to run without modification time in format and it was much faster

rclone lsf -vvv --files-only --format 'ps' -R remote:<remote>

From what I can see both ListObjects - Amazon Simple Storage Service and
ListObjectsV2 - Amazon Simple Storage Service should return LastModified for each object as part of response, so I unsure why with t format parameter is much slower