Reliably setup incremental updates

smiklos · January 24, 2020, 3:06pm

Hi there,

I would like to copy contents of a bucket to a local filsystem.

I have two folders that the data needs to end up in.
/Archive
/InUse

Archive is ro once data is in there but InUse gets data deleted from it quickly.

I need the data to end up in both so I thought I'd run copy to InUse with --compare-dest=Archive
Then copy to Archive but what's the guarantee that the second run won't find more files on s3 and therefor makes InUse not to get them in the next run. (Unfortunately the min-age param only accepts relative time like 5seconds and not a unix timestamp that could be propagated to both commands)

How should I set this up?
Is my best chance really to lsd the source bucket and the dest Archive folder, do a diff
And then do two copy operation based on the files , first InUse than Archive?

Thanks in advance

Miklos

asdffdsa · January 24, 2020, 4:31pm

hello and welcome to the forum,

you could use rclone mount and then use unix commands.

ncw · January 24, 2020, 7:28pm

I'm not sure I understand exactly what you are trying to achieve. How does InUse get stuff removed from in?

If you don't want two copies then sync with --backup-dir would do the job.

It would be a relatively easy change to make them take a timestamp too...

smiklos · January 24, 2020, 9:32pm

Well InUse is used by customers to take data and delete files once the data has been copied over in order to track the progress of the copy. (don't ask me why they can't compare to their folder etc.)

I hoped to find an easy way to copy data from s3 into both folders but to detect based on Archive what needs to end up in InUse.

Both folder gets the same data but since InUse gets data deleted from it , I need to compare what has been downloaded so far against another folder like Archive.

Withtout the timestamp feature there's no way to "atomically" copy to InUse based on both source and Archive folder and then copying those files into Archive as well.

ncw · January 26, 2020, 9:33am

If you want two copies, you'll have to run two rclone commands... There isn't anything which will duplicate a file at the moment.

Maybe you should do something like this

rclone lsf --files-only -R Archive | sort > before
rclone sync bucket: Archive
rclone lsf --files-only -R Archive | sort > after
comm -3 before after > new-files
rclone copy --files-from new-files Archive InUse # no need to copy from bucket here

This makes Archive a complete copy of the bucket and InUse files are files which have been newly created in Archive.

smiklos · January 26, 2020, 11:40am

Yes , that's what I thought as well.
I think what makes this a bit harder is that Archive also gets it's files expired after some days but I can do a max-age trick with that.
Unfortunately lfs is extremely slow compared to regular ls. On a bucket with ¬50k objects ls takes 40 seconds, lfs is like 10 minutes.

ncw · January 26, 2020, 12:43pm

That is strange it should be the same speed, it is pretty much the same code.

Can you post your rclone command line and a copy of your config (with secrets XXX-ed) out?

smiklos · January 26, 2020, 3:34pm

config
[aws]
type = s3
provider = aws
env_auth = false
access_key_id = xx
secret_access_key = xx
region = eu-west-1
endpoint =
location_constraint =
acl =
server_side_encryption =
storage_class =

it's basically
rclone ls aws:bucket/Archive vs rclone lfs aws:bucket/Archive

--fast-list doesn't make much difference either. it might be that the formatting is slow somehow inside rclone, not the listing itself.

In fact, it's more slow than I thought. It's unfeasible to work with lsf with this speed

ncw · January 26, 2020, 4:43pm

Hmm a bit of digging into that shows that lsf is reading the MimeType for each object even when the user didn't ask for it - that takes another transaction for an s3 backend.

I attempted to fix that here - can you have a go? This should make ls the same speed as lsf (provided you don't ask for the mime type).

https://beta.rclone.org/branch/v1.50.2-194-gbfd9f321-fix-ls-mime-type-beta/ (uploaded in 15-30 mins)

smiklos · January 26, 2020, 5:06pm

Wow man, that did the trick. one extra call per object I guess...

When will be this tweak released?

ncw · January 26, 2020, 5:10pm

I've merged this to master now which means it will be in the latest beta in 15-30 mins and released in v1.51

v1.51 will probably be released next weekend...

smiklos · January 26, 2020, 5:12pm

That's great. Thanks a lot for the help and the tweaking!

system · April 25, 2020, 5:12pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.