Blind S3 to SFTP Copies

alan113696 · April 28, 2020, 3:08pm

Looking for some recommendations on a unique file transfer scenario. The source is a S3 bucket with date folders (e.g., "2020-04-28") and new files are written throughout the day every day. The destination is a SFTP endpoint where new files are immediately moved to a different location when transfer is complete. This means that the destination filesystem cannot be used to maintain the state of file transfers (i.e., blind copies). The goal is to periodically transfer new files to the destination, and provide a way to redo files that failed.

I can probably write a script to periodically "detect" new files, but does rclone provide a clever way to do this? Once a list of new files is generated, would using --files-from be an efficient way to transfer the files using rclone? How can I capture failures, so that I might manually redo them later?

When using SFTP as a destination, is it possible to transfer files without the directory? For example, if the source file is "2020-04-28/unique.jpg", then transfer to destination as "unique.jpg". All of my files have unique names, so there is no chance of conflict.

asdffdsa · April 28, 2020, 3:36pm

hello and welcome to the forum,
if would be helpful if you post using the question template, as you woul be asked from basic questions.
like what is your operating system?

if you want to copy all the files from a folder and subfolders you can do this
i am no linux expert and there might be a better way

source=/mnt/c/path/to/local/folder
dest=remote:
rclone lsf --files-only -R $source > source.txt
for f in `cat source.txt`
do 
rclone copy $source/$f $dest/ -vv
done

alan113696 · April 28, 2020, 3:47pm

Apologies for not using the question template! I would be running rclone from Linux. I will review your suggested approach.

ncw · April 29, 2020, 9:34am

You can use the --max-age flag to only report files which are less than 1h old say - that could be part of the solution.

If you don't mind sometimes copying the files more than once you could (say) run this on the crontab once an hour.

rclone copy --max-age 1h10m s3:bucket sftp:server

Note the 10m overlap - decrease to decrease chance of copying twice but increase chance of not copying at all!

If you want to make a list of all files in the bucket then do

rclone lsf --files-only s3:bucket > files

You could script this, say you had an old-files from your last transfer then you could run this to discover new files which had been transferred.

rclone lsf --files-only s3:bucket > files
comm -13 old-files files > new-files

You can then transfer them like this

rclone copy --files-from new-files s3:bucket sftp:path

Then finally you'd do

mv files old-files

Not currently. If you search the forum and issues you'll see discussion of the --flatten flag which is what you'd need.

You'd probably need to fix that up in the copy phase - stealing @calisro 's shell script

for f in `cat new-files`
do 
rclone copyto s3:bucket/${f} sftp:$(basename $f) -vv
done

Note that it would be more efficient not to stop and start rclone lots of times as it will do the sftp negotiation each time so using rclone rcd and rclone rc operations/copyfile would be more efficient but that can be for phase 2!

alan113696 · April 29, 2020, 1:51pm

@ncw The strategy of capturing state using new-files/old-files will work for me, and I like the rclone rcd idea!

With respect to rclone rcd mode, does the daemon work on all async jobs in parallel, or can I control it so that only one job is worked at a time, but additional jobs are queued? Also, how do you control per job parallel transfers (i.e., --transfers) - would I use a options/set command? In my case, due to limited bandwidth, I would only want a fixed number of transfers to be active at once.

ncw · April 30, 2020, 7:47am

You can submit jobs synchronously or asynchronously - see the _async flag in the docs.

If you submit them asynchronously then the rc will not obey --transfers and will transfer as many jobs as you submit at once! Maybe it should obey --transfers - I'm not sure.

You'll have to control this yourself for the moment... You could always set --bwlimit to fix the upper bandwidth useable?

system · May 3, 2020, 7:47am

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.