Copying only new daily files to directory with 8 millions files takes too long

Hi, I use rclone to copy new files daily from a s3 bucket to a local volume in Kubernetes. They are all in one directory and it is now at 8 million files. There are only a few new files daily. Recently the process is always OOM killed. I tried a lot of things based on what I found in the forums here. I solved the memory problem by dumping the list of files (all of them, no max-age) to a text file. I then split this file into small chunks (50000) and process each chunk with:

rclone copy --ignore-existing --checksum --files-from-raw chunk.txt remote:bucket /local-volume/

It still takes too long (hours) for a few new files. I enabled dump-bodies and saw that a http call is made for each file even if they exist in the destination. Is there a way to tell rclone to only check if the path exists on the destination, and if it does, to skip any http call to the source bucket?

I guess another option would be to use max-age when listing the files. I would have to see if it's much more longer to list them with this. Right now it takes 15 minutes to list them all with lsf. That's the next thing I'm gonna try, but I would prefer to list them all. This way if for some reason the copy fails multiple days in a row, there is no need to adjust max-age value for resuming.

I'm using rclone 1.65.0 in Docker with the rclone/rclone image.

Thanks.

welcome to the forum,

maybe you have seen these flags?
https://rclone.org/s3/#reducing-costs
https://rclone.org/docs/#no-check-dest
https://rclone.org/docs/#no-traverse

Thank you for the links. I had already tested --checksum, no-check-dest, --size-only, no-traverse and --fast-list. I tested --update --use-server-modtime. All of them make a request to the source for each file in the list.

I'm testing lsf with max-age right now. It's much longer with max-age. It has been running for an hour. Even if it takes 2 hours to list with max-age, it would be better than now, which takes 15 hours when processing the whole list.

Hmm, were these files uploaded with multpart upload? If so rclone will be HEAD-ing them to attempt to read the MD5SUM metadata for --checksum.

To advise exactly which flags you'll need - can you paste a log with -vv. I don't need to see the whole thing, but I'd like to see the DEBUG messages for a few files which exist both on the source and the destination.

If you use lsf -R with --use-server-modtime and --max-age then rclone won't need to HEAD each file and it should list them without storing them in memory so it should be quick. Note that rclone can only list the files in chunks of 1000 so it is going to take 8,000 requests to list 8,000,000 files and rclone has to do those requests sequentially.

Thank you. I used --use-server-modtime to list the files with max-age 7d and it took 25 minutes. Not bad.

Here is the output with 3 files, 2 of which exists on the destination. I would like to get rid of the HEAD requests if possible.

2023-12-05T22:32:35.170852970Z + rclone version
2023-12-05T22:32:35.232580876Z rclone v1.65.0
2023-12-05T22:32:35.232605359Z - os/version: alpine 3.18.4 (64 bit)
2023-12-05T22:32:35.232612631Z - os/kernel: 6.1.0-12-amd64 (x86_64)
2023-12-05T22:32:35.232617611Z - os/type: linux
2023-12-05T22:32:35.232622464Z - os/arch: amd64
2023-12-05T22:32:35.232627755Z - go/version: go1.21.4
2023-12-05T22:32:35.232632573Z - go/linking: static
2023-12-05T22:32:35.232637333Z - go/tags: none
2023-12-05T22:32:35.235646874Z + cat test.txt
2023-12-05T22:32:35.236290059Z /bucket/000affac-3da5-4679-8541-759912521d48.jpeg
2023-12-05T22:32:35.236304717Z /bucket/000b38b8-cee9-4356-a14e-01c68174424c.jpeg
2023-12-05T22:32:35.236310492Z /bucket/002e2648-0805-4e09-95cc-58a5cf0d0a17.jpeg
2023-12-05T22:32:35.236412674Z + rclone copy --dry-run --no-traverse --stats 0 --dump bodies --log-level DEBUG --ignore-existing --files-from-raw test.txt ams3:bucket /volume/bucket
2023-12-05T22:32:35.301487155Z 2023/12/05 22:32:35 DEBUG : rclone: Version "v1.65.0" starting with parameters ["rclone" "copy" "--dry-run" "--no-traverse" "--stats" "0" "--dump" "bodies" "--log-level" "DEBUG" "--ignore-existing" "--files-from-raw" "test.txt" "ams3:bucket" "/volume/bucket"]
2023-12-05T22:32:35.301854866Z 2023/12/05 22:32:35 DEBUG : Creating backend with remote "ams3:bucket"
2023-12-05T22:32:35.301944067Z 2023/12/05 22:32:35 DEBUG : Using config file from "/config/rclone/rclone.conf"
2023-12-05T22:32:35.302515865Z 2023/12/05 22:32:35 DEBUG : You have specified to dump information. Please be noted that the Accept-Encoding as shown may not be correct in the request and the response may not show Content-Encoding if the go standard libraries auto gzip encoding was in effect. In this case the body of the request will be gunzipped before showing it.
2023-12-05T22:32:35.302711389Z 2023/12/05 22:32:35 DEBUG : Resolving service "s3" region "us-east-1"
2023-12-05T22:32:35.302726914Z 2023/12/05 22:32:35 DEBUG : You have specified to dump information. Please be noted that the Accept-Encoding as shown may not be correct in the request and the response may not show Content-Encoding if the go standard libraries auto gzip encoding was in effect. In this case the body of the request will be gunzipped before showing it.
2023-12-05T22:32:35.302797156Z 2023/12/05 22:32:35 DEBUG : Creating backend with remote "/volume/bucket"
2023-12-05T22:32:35.306988436Z 2023/12/05 22:32:35 DEBUG : >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2023-12-05T22:32:35.307006576Z 2023/12/05 22:32:35 DEBUG : HTTP REQUEST (req 0xc000a98200)
2023-12-05T22:32:35.307011896Z 2023/12/05 22:32:35 DEBUG : HEAD /bucket/002e2648-0805-4e09-95cc-58a5cf0d0a17.jpeg HTTP/1.1
2023-12-05T22:32:35.307036033Z Host: bucket.ams3.digitaloceanspaces.com
2023-12-05T22:32:35.307042575Z User-Agent: rclone/v1.65.0
2023-12-05T22:32:35.307047376Z Authorization: XXXX
2023-12-05T22:32:35.307065654Z X-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
2023-12-05T22:32:35.307069903Z X-Amz-Date: 20231205T223235Z
2023-12-05T22:32:35.307073928Z 
2023-12-05T22:32:35.307078325Z 2023/12/05 22:32:35 DEBUG : >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2023-12-05T22:32:35.307154951Z 2023/12/05 22:32:35 DEBUG : >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2023-12-05T22:32:35.307163311Z 2023/12/05 22:32:35 DEBUG : HTTP REQUEST (req 0xc0007fae00)
2023-12-05T22:32:35.307168597Z 2023/12/05 22:32:35 DEBUG : HEAD /bucket/000b38b8-cee9-4356-a14e-01c68174424c.jpeg HTTP/1.1
2023-12-05T22:32:35.307173975Z Host: bucket.ams3.digitaloceanspaces.com
2023-12-05T22:32:35.307179064Z User-Agent: rclone/v1.65.0
2023-12-05T22:32:35.307184346Z Authorization: XXXX
2023-12-05T22:32:35.307189324Z X-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
2023-12-05T22:32:35.307205222Z X-Amz-Date: 20231205T223235Z
2023-12-05T22:32:35.307209810Z 
2023-12-05T22:32:35.307215216Z 2023/12/05 22:32:35 DEBUG : >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2023-12-05T22:32:35.308439935Z 2023/12/05 22:32:35 DEBUG : >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2023-12-05T22:32:35.308824483Z 2023/12/05 22:32:35 DEBUG : HTTP REQUEST (req 0xc00085c200)
2023-12-05T22:32:35.308839589Z 2023/12/05 22:32:35 DEBUG : HEAD /bucket/000affac-3da5-4679-8541-759912521d48.jpeg HTTP/1.1
2023-12-05T22:32:35.308845851Z Host: bucket.ams3.digitaloceanspaces.com
2023-12-05T22:32:35.308851340Z User-Agent: rclone/v1.65.0
2023-12-05T22:32:35.308856761Z Authorization: XXXX
2023-12-05T22:32:35.308869091Z X-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
2023-12-05T22:32:35.308872853Z X-Amz-Date: 20231205T223235Z
2023-12-05T22:32:35.308895813Z 
2023-12-05T22:32:35.308902297Z 2023/12/05 22:32:35 DEBUG : >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2023-12-05T22:32:35.579443458Z 2023/12/05 22:32:35 DEBUG : <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2023-12-05T22:32:35.579478623Z 2023/12/05 22:32:35 DEBUG : HTTP RESPONSE (req 0xc0007fae00)
2023-12-05T22:32:35.579493385Z 2023/12/05 22:32:35 DEBUG : HTTP/2.0 200 OK
2023-12-05T22:32:35.579499470Z Content-Length: 879633
2023-12-05T22:32:35.579504323Z Accept-Ranges: bytes
2023-12-05T22:32:35.579509694Z Content-Type: image/png
2023-12-05T22:32:35.579515157Z Date: Tue, 05 Dec 2023 22:32:35 GMT
2023-12-05T22:32:35.579520298Z Etag: "de14be5efdc3fcdb8554ce5b7730291b"
2023-12-05T22:32:35.579525163Z Last-Modified: Tue, 05 Dec 2023 14:09:06 GMT
2023-12-05T22:32:35.579530472Z Strict-Transport-Security: max-age=15552000; includeSubDomains; preload
2023-12-05T22:32:35.579535475Z Vary: Origin, Access-Control-Request-Headers, Access-Control-Request-Method
2023-12-05T22:32:35.579545348Z X-Amz-Request-Id: tx00000193345b905802df9-00656fa503-471ae04c-ams3c
2023-12-05T22:32:35.579549630Z X-Envoy-Upstream-Healthchecked-Cluster: 
2023-12-05T22:32:35.579554321Z X-Rgw-Object-Type: Normal
2023-12-05T22:32:35.579558841Z 
2023-12-05T22:32:35.579563761Z 2023/12/05 22:32:35 DEBUG : <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2023-12-05T22:32:35.579608188Z 2023/12/05 22:32:35 DEBUG : <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2023-12-05T22:32:35.579615060Z 2023/12/05 22:32:35 DEBUG : HTTP RESPONSE (req 0xc00085c200)
2023-12-05T22:32:35.579695452Z 2023/12/05 22:32:35 DEBUG : HTTP/2.0 200 OK
2023-12-05T22:32:35.579719186Z Content-Length: 477605
2023-12-05T22:32:35.579725048Z Accept-Ranges: bytes
2023-12-05T22:32:35.579747629Z Content-Type: image/png
2023-12-05T22:32:35.579752082Z Date: Tue, 05 Dec 2023 22:32:35 GMT
2023-12-05T22:32:35.579756878Z Etag: "80b6c6e68d08d4c65d6bc9b3155b4374"
2023-12-05T22:32:35.579762258Z Last-Modified: Thu, 30 Nov 2023 06:11:51 GMT
2023-12-05T22:32:35.579767650Z Strict-Transport-Security: max-age=15552000; includeSubDomains; preload
2023-12-05T22:32:35.579772702Z Vary: Origin, Access-Control-Request-Headers, Access-Control-Request-Method
2023-12-05T22:32:35.579777169Z X-Amz-Request-Id: tx000004100ff294cdd061f-00656fa503-471b1f6a-ams3c
2023-12-05T22:32:35.579782549Z X-Envoy-Upstream-Healthchecked-Cluster: 
2023-12-05T22:32:35.579787701Z X-Rgw-Object-Type: Normal
2023-12-05T22:32:35.579792169Z 
2023-12-05T22:32:35.579814567Z 2023/12/05 22:32:35 DEBUG : <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2023-12-05T22:32:35.579829855Z 2023/12/05 22:32:35 DEBUG : <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2023-12-05T22:32:35.579836263Z 2023/12/05 22:32:35 DEBUG : HTTP RESPONSE (req 0xc000a98200)
2023-12-05T22:32:35.580428428Z 2023/12/05 22:32:35 DEBUG : HTTP/2.0 200 OK
2023-12-05T22:32:35.580438339Z Content-Length: 3274
2023-12-05T22:32:35.580442037Z Accept-Ranges: bytes
2023-12-05T22:32:35.580445001Z Content-Type: image/png
2023-12-05T22:32:35.580447856Z Date: Tue, 05 Dec 2023 22:32:35 GMT
2023-12-05T22:32:35.580450821Z Etag: "07968bd1014dc49281bc2296b2b4737b"
2023-12-05T22:32:35.580453784Z Last-Modified: Wed, 29 Nov 2023 13:06:06 GMT
2023-12-05T22:32:35.580460670Z Strict-Transport-Security: max-age=15552000; includeSubDomains; preload
2023-12-05T22:32:35.580463801Z Vary: Origin, Access-Control-Request-Headers, Access-Control-Request-Method
2023-12-05T22:32:35.580466781Z X-Amz-Request-Id: tx00000dcd1cf2331a87dc1-00656fa503-471ac284-ams3c
2023-12-05T22:32:35.580469765Z X-Envoy-Upstream-Healthchecked-Cluster: 
2023-12-05T22:32:35.580472864Z X-Rgw-Object-Type: Normal
2023-12-05T22:32:35.580475805Z 
2023-12-05T22:32:35.580479118Z 2023/12/05 22:32:35 DEBUG : <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2023-12-05T22:32:35.594772631Z 2023/12/05 22:32:35 DEBUG : bucket/000b38b8-cee9-4356-a14e-01c68174424c.jpeg: Need to transfer - File not found at Destination
2023-12-05T22:32:35.594801631Z 2023/12/05 22:32:35 NOTICE: bucket/000b38b8-cee9-4356-a14e-01c68174424c.jpeg: Skipped copy as --dry-run is set (size 859.017Ki)
2023-12-05T22:32:35.594950000Z 2023/12/05 22:32:35 DEBUG : bucket/000affac-3da5-4679-8541-759912521d48.jpeg: Destination exists, skipping
2023-12-05T22:32:35.594959515Z 2023/12/05 22:32:35 DEBUG : bucket/002e2648-0805-4e09-95cc-58a5cf0d0a17.jpeg: Destination exists, skipping
2023-12-05T22:32:35.595148729Z 2023/12/05 22:32:35 DEBUG : Local file system at /volume/bucket: Waiting for checks to finish
2023-12-05T22:32:35.595165292Z 2023/12/05 22:32:35 DEBUG : Local file system at /volume/bucket: Waiting for transfers to finish
2023-12-05T22:32:35.595170087Z 2023/12/05 22:32:35 DEBUG : 7 go routines active

Great

You are using --no-traverse so rclone is using HEAD to check the files exist before copying them.

You can try this flag

  --s3-no-head-object   If set, do not do HEAD before GET when getting objects

That did the trick. With this, I don't need to use max-age when fetching the list of files. The backup takes 35 minutes overall and does not exceed 1.5GB of memory usage. This will buy us some time until we complete the sharding of this bucket.

Thank you very much.

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.