Copy from list without checking either side before

What is the problem you are having with rclone?

We are trying to transfer a large number of objects from a swift container to an s3 bucket. There are several hundred million objects. Just doing a straight copy has not work since several days were spent just listing the source swift container. We now have a list of objects in a file on the local system that we have broken into smaller chunks. The file h_files has 100M objects. The problem is that for several days the copying is not happening as rclone is doing HEAD requests again the target side to see if the objects exists; they don't. If we use --ignore-existing then the time is spent doing HEAD requests against the source side (Swift) to see if the objects exist. We have tried all different combination of the flags below along with a few others

rclone copy SWIFT:swift-test COS:cos-test --files-from h_files --fast-list --no-traverse --ignore-existing

What we really want is to be able to provide the list of objects via the --files-from flag and have rclone to a GET on the source and a PUT against the target. Nothing else. Is there a flag or combination of flags that allow for that?

What is your rclone version (output from rclone version)

rclone v1.49.3

  • os/arch: darwin/amd64
  • go version: go1.13

Which OS you are using and how many bits (eg Windows 7, 64 bit)

MacOS

Which cloud storage system are you using? (eg Google Drive)

OpenStack Swift and IBM COS (S3)

The command you were trying to run (eg rclone copy /tmp remote:tmp)

rclone copy SWIFT:swift-test COS:cos-test --files-from h_files --fast-list --no-traverse --ignore-existing

A log from the command with the -vv flag (eg output from rclone -vv copy /tmp remote:tmp)

[mpcarl] $: rclone copy  ME_SWIFT:swift-test BMCOS:mpc-static-us-east --files-from h_files  --fast-list --no-traverse --ignore-existing --ignore-times -P --checkers 4 --transfers 2 -vv
2019/09/19 14:42:41 DEBUG : rclone: Version "v1.49.3" starting with parameters ["rclone" "copy" "ME_SWIFT:swift-test" "BMCOS:mpc-static-us-east" "--files-from" "h_files" "--fast-list" "--no-traverse" "--ignore-existing" "--ignore-times" "-P" "--checkers" "4" "--transfers" "2" "-vv"]
2019/09/19 14:42:41 DEBUG : Using config file from "/Users/mpcarl/.config/rclone/rclone.conf"

I think rclone is probably trying to read the modified time...

Try adding the flag --use-server-modtime and I think that should fix that behaviour...

Thanks for the suggestion. I added that flag with and without the others, but no luck. Although the end output reports 0 checks, there is still a HEAD request against every object in the source (Swift) then the GET (after all the heads). On the target side there is a PUT followed by the HEAD. It would be nice to get this down to just 1 GET (source) and 1 PUT (target), though I understand the extra HEAD on the target side. I am not too worried about that as I understand the reason.

Somewhat related, it does not look like any of the copies (GET/PUT) happen until all of the HEAD requests against the source complete. If the transfers could start in parallel to the source HEAD requests, at least we would be making some progress instead of spending several days just doing HEAD requests.

[mpcarl] $: rclone copy  ME_SWIFT:swift-test BMCOS:mpc-static-us-east --files-from h_files1  --fast-list --no-traverse  -P --checkers 4 --transfers 2 -vv --use-server-modtime
2019/09/20 09:18:15 DEBUG : rclone: Version "v1.49.3" starting with parameters ["rclone" "copy" "ME_SWIFT:swift-test" "BMCOS:mpc-static-us-east" "--files-from" "h_files1" "--fast-list" "--no-traverse" "-P" "--checkers" "4" "--transfers" "2" "-vv" "--use-server-modtime"]
2019/09/20 09:18:15 DEBUG : Using config file from "/Users/mpcarl/.config/rclone/rclone.conf"
2019-09-20 09:18:20 INFO  : S3 bucket mpc-static-us-east: Waiting for checks to finish
2019-09-20 09:18:20 INFO  : S3 bucket mpc-static-us-east: Waiting for transfers to finish
2019-09-20 09:18:20 DEBUG : 1: MD5 = b026324c6904b2a9cb4b88d6d61c81d1 OK
2019-09-20 09:18:20 INFO  : 1: Copied (new)
2019-09-20 09:18:20 DEBUG : 0: MD5 = 897316929176464ebc9ad085f31e7284 OK
2019-09-20 09:18:20 INFO  : 0: Copied (new)
....
2019-09-20 09:18:23 DEBUG : 110: MD5 = 2fe51daae840593fb0f4076b307ccefb OK
2019-09-20 09:18:23 INFO  : 110: Copied (new)
Transferred:   	        54 / 54 Bytes, 100%, 14 Bytes/s, ETA 0s
Errors:                 0
Checks:                 0 / 0, -
Transferred:           15 / 15, 100%
Elapsed time:        3.6s
2019/09/20 09:18:23 INFO  :
Transferred:   	        54 / 54 Bytes, 100%, 14 Bytes/s, ETA 0s
Errors:                 0
Checks:                 0 / 0, -
Transferred:           15 / 15, 100%
Elapsed time:        3.6s

2019/09/20 09:18:23 DEBUG : 18 go routines active
2019/09/20 09:18:23 DEBUG : rclone: Version "v1.49.3" finishing with parameters ["rclone" "copy" "ME_SWIFT:swift-test" "BMCOS:mpc-static-us-east" "--files-from" "h_files1" "--fast-list" "--no-traverse" "-P" "--checkers" "4" "--transfers" "2" "-vv" "--use-server-modtime"]

I was thinking about this earlier and yes I'm wrong about the HEAD requests...

In swift you need a HEAD request to figure out if a file is a dynamic or static large object. So if the file is 0 length then you'll need a HEAD request. However rclone also needs to know if it is a large object or not to produce the hash (which is used to check the upload is correct).

So if there was a mode in the swift backend which disabled large object support then we could get away with no HEAD requests with --fast-list --ignore-existing --use-server-modtime. Note that this is not using --no-traverse.

These head requests are done in parallel so you can increase the speed of them by increasing --checkers which may help.

rclone's architecture means that --files-from and --no-traverse needs to build up the objects to transfer first where as --files-from won't.

Have you tried the --files-from without --no-traverse and without --fast-list? That will do individual directory listings but it will only list the directories it needs to and it will start immediately.

Here are the combinations I tried and the results:

--files-from: gets the entire container listing from the swift side before any copies are done
--files-from --fast-list: gets the entire container listing from the swift side before any copies are done
--files-from --no-traverse: head on source followed by GET. All of the HEADs complete before the first GET, so there is no HEAD/GET parallelism.
--files-from --no-traverse --fast-list: head on the source side for all objects followed by a head on the destination for all objects. Finally a GET -> PUT, HEAD.

Have you tried the --files-from without --no-traverse and without --fast-list ? That will do individual directory listings but it will only list the directories it needs to and it will start immediately.

This one has the most potential, but the GETs/PUTs don't start until all the HEAD requests are done on both source and target sides. In this case the objects are all at the top level of the container, so there is no directory structure. Perhaps that is why. On the target side, it does do the HEAD (ony every objects) followed by the PUT+HEAD combination.

rclone copy ME_SWIFT:swift-test BMCOS:mpc-static-us-east --files-from h_files -P --checkers 4 --transfers 16 -vv --no-traverse

I used more transfers than checkers to hopefully see the transfers start while the checkers will still running. No luck.

Thanks for your investigations on the combinations.

We could probably make rclone do what you want with some more flags.. - A flag not to check a file exists when doing --files-from would do it. That would need co-operation from the backend though (and wouldn't work for all backends). Not sure what you'd call it --no-check-existence or something like that.

Yes that will be why - rclone works a directory at a time.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.