I need to copy / sync 10 Million + files

What is the problem you are having with rclone?

I need to copy several millions of files from ceph vfs to google cloud storage. I have completed TB of data migration already but now I have found one folder that has at current count over 40 Million files in it. When I tried this with sync and copy it just seems to be pending the operation for an extremely large amount of time which i expected, I let it run for 18hrs and still no copy had started. I am assuming this must be a known problem and was hoping i could find some guildance. 

Run the command 'rclone version' and share the full output of the command.

rclone v1.65.0
- os/version: debian 11.8 (64 bit)
- os/kernel: 5.10.0-26-cloud-amd64 (x86_64)
- os/type: linux
- os/arch: amd64
- go/version: go1.21.4
- go/linking: static
- go/tags: none

Which cloud storage system are you using? (eg Google Drive)

Copying from Ceph VFS to Google Cloud Storage (buckets)

The command you were trying to run (eg rclone copy /tmp remote:tmp)

rclone -vvv copy -P --transfers 255 --checkers 255  --gcs-bucket-policy-only live_ceph:pf-live-bucketname/u/x GCS:/secrect_google_bucketname/contenido/u/x
rclone -vvv sync -P --gcs-bucket-policy-only live_ceph:sercret-ceph_bucketname/u/x gcs://secret_google_bucketname/u/x

Please run 'rclone config redacted' and share the full output. If you get command not found, please make sure to update rclone.

rclone config redacted
[GCS]
type = google cloud storage
project_number = XXX
service_account_file = /home/username/googleprojectname-67c4c00ea196.json
object_acl = bucketOwnerFullControl
bucket_acl = private
location = eur4
storage_class = MULTI_REGIONAL

[live_ceph]
type = swift
domain = XXX
key = XXX
region = stp
storage_url = http://live.vfs.ceph.internaldomain.com/swift/v1
tenant = XXX
tenant_domain = XXX
user = XXX
auth = http://live.vfs.ceph.internaldomain.com:80/auth/v1

A log from the command that you were trying to run with the -vv flag

rclone -vvv copy -P --transfers 255 --checkers 255  --gcs-bucket-policy-only live_ceph:pf-live-api-vfs-redacted_bucketnameu/x GCS:/redacted_bucketname/contenido/u/x
2024/01/11 12:56:23 DEBUG : rclone: Version "v1.65.0" starting with parameters ["rclone" "-vvv" "copy" "-P" "--transfers" "255" "--checkers" "255" "--gcs-bucket-policy-only" "live_ceph:pf-live-api-vfs-redactedbucketname/u/x" "GCS:/redactedbucketname/contenido/u/x"]
2024/01/11 12:56:23 DEBUG : Creating backend with remote "live_ceph:pf-live-api-vfs-redacted_bucketname/u/x"
2024/01/11 12:56:23 DEBUG : Using config file from "/home/erikhilland/.config/rclone/rclone.conf"
2024/01/11 12:56:23 DEBUG : Creating backend with remote "GCS:/redacted_bucketname/contenido/u/x"
2024/01/11 12:56:23 DEBUG : GCS: detected overridden config - adding "{_wgjs}" suffix to name
2024/01/11 12:56:23 DEBUG : fs cache: renaming cache item "GCS:/redactedbucketname/contenido/u/x" to be canonical "GCS{_wgjs}:redacted_bucketname/contenido/u/x"
Transferred:              0 B / 0 B, -, 0 B/s, ETA -
Elapsed time:        17hrs15min

Have a look at similar threads, e.g.:

It is not unexpected that it takes long time before transfers start.

Have you tried adding the --fast-list option?

I'm testing rclone now and syncing a directory that contains 32M files to Azure Blob Storage. I'm part way through the initial sync and have about 12M files (blobs) in the destination so far. With 12M blobs in the destination it takes rclone about 2 hours to start copying.

I am using the --fast-list option. Maybe it will help you out, too.

Or maybe it won't. I read some documentation today that said it may not actually be faster in some situations. I am going to try it without the option next time I start the sync and compare the "startup" times.

Im trying this method that was used in this process and it seems solid very similar to my original theory on this but when i try this command with the --no-traverse option i get this error

Transferred: 133.210 KiB / 133.210 KiB, 100%, 126.048 KiB/s, ETA 0s
Errors: 10 (retrying may help)
Elapsed time: 1.4s
2024/01/12 09:29:28 INFO :
Transferred: 133.210 KiB / 133.210 KiB, 100%, 126.048 KiB/s, ETA 0s
Errors: 10 (retrying may help)
Elapsed time: 1.4s

2024/01/12 09:29:28 DEBUG : 25 go routines active
2024/01/12 09:29:28 Failed to copy with 10 errors: last error was: googleapi: Error 400: Cannot insert legacy ACL for an object when uniform bucket-level access is enabled. Read more at Uniform bucket-level access  |  Cloud Storage  |  Google Cloud, invalid

here im only testing with a small excised portion of the list, I can't modify the uniform bucket-level access. So now im looking for another option to pass to rclone.

To just follow up on that last message to make this public and share. I found that using the --gcs-bucket-policy-only

Solve the uniform bucket level access ACL issue. I will post again hopefully with a full solution if this works the way I think it will.

1 Like

I retested without --fast-list and rclone starts finding and syncing changes much faster. I'm going to leave it off from now on.

I finished syncing 32M files to Azure Blob. Subsequent syncs take about 20 minutes to perform the 32M "checks". I am using --checkers 64.

Maybe this doesn't apply in your case since you are using GCS.

quick question maybe you know. im using the following flags --no-traverse --fast-list --ignore-checksum --transfers 255 --checkers 255 --gcs-bucket-policy

i thought the ignore-checksum would not run the post copy check but mine is still doing that? any ideas why ??

--ignore-checksum        Skip post copy check of checksums

This is what this flag should do. Can you post some log file showing that it does not work?

might try --s3-disable-checksum or --s3-no-head

take one file and test those flag(s).

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.