Inverse of --exclude-if-present

hakong · January 29, 2021, 8:55pm

What is the problem you are having with rclone?

Trying to only copy directories that have a certain file in them.
For example, if a directory should contain 3 files:

dir1/file.tar.gz.md5
dir1/file.tar.gz.gpg
dir1/file.tar.gz.gpg.md5

I would like to copy the directory only if it has the .gpg.md5 file, otherwise leave it alone.

An ideal parameter to have would be --include-if-present.
That way I could specify something like --include-if-present="*.tar.gz.gpg.md5", or if not possible with the regex, have the client uploading to the bucket mark each directory as "completed" by placing a marker file such as ".completed" (inverse of ".ignore") and then specifying --include-if-present=".completed"

The scenario is: an s3 bucket continuously being populated with directories and files.
A client retrieves these files continuously, and should avoid moving (copying and then deleting) files from a directory that has not been fully uploaded.

Am I maybe missing something with the filters that makes this possible already?

What is your rclone version (output from `rclone version`)

v1.53.4

Which OS you are using and how many bits (eg Windows 7, 64 bit)

RHEL 7 x64

Which cloud storage system are you using? (eg Google Drive)

s3

The command you were trying to run (eg `rclone copy /tmp remote:tmp`)

n/a

The rclone config contents with secrets removed.

n/a

A log from the command with the `-vv` flag

n/a

ncw · January 30, 2021, 3:43pm

I don't think you can do this with filters directly.

So I take it the .tar.gz.gpg.md5 is the last one written in the process?

So what you could do is do a two pass copy, so first pass, find all the directories you could copy then massage this into an include file for the copy, something like

rclone lsf -R --absolute --files-only /path/to/source --include "*.tar.gz.gpg.md5" | sed 's/\/[^\/]*$/\/*/g' > include-dirs

Then do the copy with that

rclone copy /path/to/source remote:dest --include-from include-dirs

Depending on exactly how many files that is, that might make a really long filter, so you might be better off making a list of directories to exclude.

You could also use the --min-age flag if you set that to 5m say, then assuming the process to generate a directory doesn't take longer than 5 minutes that might be a solution.

hakong · January 30, 2021, 5:42pm

Thanks for the quick reply!

So I take it the .tar.gz.gpg.md5 is the last one written in the process?

It probably will be in my case, but it might as well be the .completed file marker that would be uploaded after the entire batch of data is done uploading.

So what you could do is do a two pass copy, so first pass, find all the directories you could copy then massage this into an include file for the copy

Great idea! I'll give this a try. The amount of files going through this pipeline will be in the hundreds of thousands and a few PB in size, but if I pull the data away quickly enough the filter size will be manageable.

[...] you might be better off making a list of directories to exclude.

This won't work since the bucket is being continuously uploaded to

You could also use the --min-age flag

I discussed this with the client uploading the data and this might not be the best idea, since if the upload process fails on a friday and isn't fixed until monday, this would not work.

I opened this issue on github a while ago: File transfer order pattern · Issue #3975 · rclone/rclone · GitHub for the file transfer order.

In the github issue I was trying to solve:

upload files for each batch in a certain order so the receiver knows when the batch is done

Now I'm on the receiving end and need to solve:

only download data batches that have been uploaded completely

A combination of these two features would make large scale continuous file transfers much easier with rclone, without the need for scripting around it and maintaining file / directory lists.

Thank you for your work on rclone. I use it a lot!

ncw · January 31, 2021, 1:07pm

Have a go with the filter approach and see if that works.

Ah yes!

That probably isn't too difficult - the most difficult thing will be working out the command line interface probably!

A --include-if-present flag is probably quite hard though as it will subvert the directory scanning.

system · April 2, 2021, 9:08am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.