--files-from --ignore-existing with many entries taking long

dreamflasher · November 4, 2021, 8:46am

What is the problem you are having with rclone?

The starting of the transfers takes a very long time, which becomes visible with a files-from file with 10M entries.

What is your rclone version (output from `rclone version`)

rclone v1.58.0-beta.5848.454574e2c

Which OS you are using and how many bits (eg Windows 7, 64 bit)

os/version: ubuntu 20.04 (64 bit)
os/kernel: 5.4.0-1045-aws (x86_64)

Which cloud storage system are you using? (eg Google Drive)

s3

The command you were trying to run (eg `rclone copy /tmp remote:tmp`)

rclone copy s3:a s3:b -P --files-from=segment1M --no-traverse --ignore-existing -vv
rclone copy s3:a s3:b -P --files-from=segment10M --no-traverse --ignore-existing -vv

The rclone config contents with secrets removed.

[s3]
type = s3
provider = AWS
env_auth = false
access_key_id = …
secret_access_key = …

A log from the command with the `-vv` flag

2021/11/04 08:35:05 DEBUG : rclone: Version "v1.58.0-beta.5848.454574e2c" starting with parameters ["rclone" "copy" "s3:a" "s3:b" "-P" "--files-from=segment1M" "--no-traverse" "--ignore-existing" "-vv"]
2021/11/04 08:35:05 DEBUG : Creating backend with remote "s3:a"
2021/11/04 08:35:05 DEBUG : Using config file from "/home/ubuntu/.config/rclone/rclone.conf"
2021/11/04 08:35:05 DEBUG : Creating backend with remote "s3:b"

It's clear that it takes a moment to read the file, but that should be done in under a minute. What I observe is the following:
With 1M entries it takes about 5 minutes after the last debug message for the checking to start. There's no output, not much CPU usage, no diskusage during that time, so it's unclear what's happening in those 5 minutes (after loading the file and before starting checking (and then later transfer)).
With a file that has 10M entries it took 40 minutes for the first checks to start happening.
It seems like there is something linear to the number of entries in the files-from file happening, which does not generate any log output.

Animosity022 · November 4, 2021, 11:49am

I'm sure there is some limit to the number of files in a practical sense and 10M in a file seems a bit excessive. Probably need @ncw or someone to chime in as that's a bit beyond me.

dreamflasher · November 4, 2021, 1:29pm

Oh 10M is tiny. In total I have transfer jobs with about 200M. I was already trimming it down so that it actually starts at all. I think what would be nice if --files-from would be read in a streaming fashion as well. No need to keep the whole file im memory, fetch the first 1k, get going, and then fetch the next 1k.

Animosity022 · November 4, 2021, 1:32pm

Size/time/etc is all relative to your situation so words like tiny/small/taking long doesn't really mean anything.

Your solution there is easy to script and go that route and feed in 1k at a time or something.

Trying to filter 10M+ lists doesn't seem like a great idea as I think you are hitting scaling issues.

dreamflasher · November 4, 2021, 1:54pm

speaking of.

everything is relative.

Animosity022 · November 4, 2021, 1:57pm

Like humor when it falls flat

ncw · November 4, 2021, 6:09pm

What is happening is that rclone is creating internal objects for each entry in the --files-from. This involves running HEAD on each item in s3 to read info about the object.

If you don't want this then set --s3-no-head or the equivalent config file entry.

This may have some consequences (like it breaks stuff) I'm not sure! Try it on a small list.

jwink3101 · November 4, 2021, 7:14pm

Could @dreamflasher also use --no-check-dest for the same effect? With S3, there isn't an issue of duplicates and they are already doing --ignore-existsing

dreamflasher · November 4, 2021, 7:32pm

According to the doc (--no-traverse vs --no-check-dest - #2 by ncw) --no-check-dest would transfer everything.

Thanks a lot @ncw I'll give that a try!

jwink3101 · November 4, 2021, 8:32pm

Yes, sorry. I was thinking of --ignore-times (which also ignores size and checksum) and always transfers.

Animosity022 · November 4, 2021, 8:43pm

So basically if you have a files-from and say 1M or 10M files, it's going to head each file before anything prints out to the log and that's the delay as it's doing a check of every file?

ncw · November 5, 2021, 12:56pm

Yes, that is right. It needs to make sure the files you want to copy in --files-from exist.

system · December 5, 2021, 12:56pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.