2021/11/04 08:35:05 DEBUG : rclone: Version "v1.58.0-beta.5848.454574e2c" starting with parameters ["rclone" "copy" "s3:a" "s3:b" "-P" "--files-from=segment1M" "--no-traverse" "--ignore-existing" "-vv"]
2021/11/04 08:35:05 DEBUG : Creating backend with remote "s3:a"
2021/11/04 08:35:05 DEBUG : Using config file from "/home/ubuntu/.config/rclone/rclone.conf"
2021/11/04 08:35:05 DEBUG : Creating backend with remote "s3:b"
It's clear that it takes a moment to read the file, but that should be done in under a minute. What I observe is the following:
With 1M entries it takes about 5 minutes after the last debug message for the checking to start. There's no output, not much CPU usage, no diskusage during that time, so it's unclear what's happening in those 5 minutes (after loading the file and before starting checking (and then later transfer)).
With a file that has 10M entries it took 40 minutes for the first checks to start happening.
It seems like there is something linear to the number of entries in the files-from file happening, which does not generate any log output.
I'm sure there is some limit to the number of files in a practical sense and 10M in a file seems a bit excessive. Probably need @ncw or someone to chime in as that's a bit beyond me.
Oh 10M is tiny. In total I have transfer jobs with about 200M. I was already trimming it down so that it actually starts at all. I think what would be nice if --files-from would be read in a streaming fashion as well. No need to keep the whole file im memory, fetch the first 1k, get going, and then fetch the next 1k.
What is happening is that rclone is creating internal objects for each entry in the --files-from. This involves running HEAD on each item in s3 to read info about the object.
If you don't want this then set --s3-no-head or the equivalent config file entry.
This may have some consequences (like it breaks stuff) I'm not sure! Try it on a small list.
Could @dreamflasher also use --no-check-dest for the same effect? With S3, there isn't an issue of duplicates and they are already doing --ignore-existsing
So basically if you have a files-from and say 1M or 10M files, it's going to head each file before anything prints out to the log and that's the delay as it's doing a check of every file?