I have approximately 10 million files (separated into directories of around 100 files each) in Amazon S3. I am looking for a way to copy them to another S3 location, and at the same time add a file extension. For example, current files are xyz.json, I want to make them xyz.json.gz on the target (they're already gzipped).
The "copyto" command could work fine for a few files, but I have millions, spread out over various directories. Granted, the directories do follow a naming structure, so I could script something perhaps, but even then it would likely be quite slow. Unfortunately, the "copyto" command does not do anything different than "copy" when the source is a directory.
rclone sync to local -> script to rename -> rclone sync to S3
I do not think there is better option unless you are happy to customise rclone for your specific task and make names renaming part of the code.
You need dev resource to make it automatic - does not matter rclone or some other tool. If you do not have it then do step by step - especially if it is one off task. Disks are cheap nowadays including fast SSD. Dev time is not.
Thanks... That's actually not a bad solution (not as bad as it sounds). Using 1000 transfers, I can grab an "entire day's " worth of files very quickly at 3Gbps, run a rename command locally, and send them back up.
Yeap, if you can through at this decent network speed then it will be much faster than going development path (coding, testing, bugs). Rclone can do S3 <-> local transfers heavy lifting easy - and actually 10 million of files is not so much.
My thinking though is that I never want to modify source when doing things like this - as if something goes wrong I would be toasted.
All depends - maybe it is not a problem at all. Not enough input to decide:)
Interesting point is to use EC2 cloud vm - it can be still used in my approach - AWS charges everything pro rata - so maybe it makes sense to pay for block storage for time of this job? It is not something I did - so only thinking loudly.
tho we are lacking the exact use-case and important details,
another option is do it 100% server-side, no downloading/uploading files, let AWS do the actual copy.
#mass copy the files without renaming rclone copy remote1: remote2: --server-side-across-configs
#runrclone mount rclone mount remote2: /mnt/remote2
#use whatever rename tool you want to rename the files at /mnt/remote2, in place, the operation(s) are all server-side.
you can mass rename in one-step or do it in batches.
AWS will do the actual copy operations, server-side, instead of rclone
DEBUG : xyz.json: Rename: newPath="/source/xyz.json.gz"
DEBUG : xyz.json: md5 = a00f3e567db4ad8a89fc112d695c320e OK
INFO : xyz.json: Copied (server-side copy) to: source/xyz.json.gz
INFO : xyz.json: Deleted
This might be actually the fastest approach when using EC2... I would test with e.g. 1k files to see what is the best. As theory often is not reflected by practice.
I like the rclone copy src: dst: and then use mount to rename idea.
You could also set up an rclone rc server with rclone rcd then use the operations/copyfile api to copy and rename the file in one step. This could do a server side copy directly to the new name which will save some server side copies (which are relatively expensive API hits).
You could also script rclone copyto which is equivalent but will be less efficient than using the rc as it will do extra transactions every time rclone is started.
You'd use rclone lsf -R to get the list of files to operate on.
Note that a rename operation on S3 is implemented and charged as a server side copy (that is just the way it is on S3!).