How to sync S3 with millions of files at root

What is the problem you are having with rclone?

Hi,

I have the same problem that it has been already reported several times... Yet may-be a workaround exist that I am not aware:
I am trying to sync a S3 bucket to a loacl SSD. Problem is that there are millions of files (about 4 Millions) in root remote directory. RAM is growing and about 1 hour after command launch, the process is killed.
What is the way to synchronize in such situation (ton of files in 1 directory)

thanks

Run the command 'rclone version' and share the full output of the command.

rclone v1.61.1

  • os/version: ubuntu 20.04 (64 bit)
  • os/kernel: 4.4.180+ (x86_64)
  • os/type: linux
  • os/arch: amd64
  • go/version: go1.19.4
  • go/linking: static
  • go/tags: none

Which cloud storage system are you using? (eg Google Drive)

S3 (remote) - SSD (local)

The command you were trying to run (eg rclone copy /tmp remote:tmp)

sudo rclone sync --s3-disable-checksum --size-only minio:xxx/media/tracks /xxx/Tracks/

The rclone config contents with secrets removed.

Paste config here

A log from the command with the -vv flag

Paste  log here

hello and welcome to the forum,

this should not have issues with out of memory.

https://forum.rclone.org/t/recommendations-for-using-rclone-with-a-minio-10m-files/14472/4

rclone lsf -R source:bucket | sort > source-sorted
rclone lsf -R dest:bucket | sort > dest-sorted
comm -23 source-sorted dest-sorted > to-transfer
comm -12 source-sorted dest-sorted > to-delete
rclone copy --files-from to-transfer --no-traverse source:bucket dest:bucket
rclone delete --files-from to-delete --no-traverse dest:bucket

note: for testing,

  1. add --dry-run to these two commands, rclone copy|delete --dry-run ...
  2. use a debug log --log-level=DEBUG --log-file=/path/to/rclone.log

Very interesting, many thanks. Indeed such "manual" sync should do the trick.

Just a small remark

Should be
comm -13 source-sorted dest-sorted > to-delete

yes, i like that wording, sounds ok to me.

sorry, above my skill level....
tho, if you tweak the script, then please post it here, i can link to that.....

Note that rclone will use roughly 1G of ram per million files, so you'll need 4G of RAM (maybe twice that) to sync 4 million files.

Otherwise it should work fine.

Here is the corrected script

I think this command can be optimized because it is currently taking hours to "copy" less than 400 small files.
Isn't any flag that can be used to optimized it?

--- hard to be sure, without seeing the debug log.
--- perhaps split to-transfer into smaller files, run rclone copy against that.
--- perhaps do not use --no-traverse
"if you are copying a large number of files, especially if you are doing a copy where lots of the files under consideration haven't changed and won't need copying then you shouldn't use --no-traverse."

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.