Copy and add file extension to ten million files

DanS · November 6, 2023, 6:41pm

I have approximately 10 million files (separated into directories of around 100 files each) in Amazon S3. I am looking for a way to copy them to another S3 location, and at the same time add a file extension. For example, current files are xyz.json, I want to make them xyz.json.gz on the target (they're already gzipped).

The "copyto" command could work fine for a few files, but I have millions, spread out over various directories. Granted, the directories do follow a naming structure, so I could script something perhaps, but even then it would likely be quite slow. Unfortunately, the "copyto" command does not do anything different than "copy" when the source is a directory.

Open to ideas. Thx

kapitainsky · November 6, 2023, 7:04pm

rclone sync to local -> script to rename -> rclone sync to S3

I do not think there is better option unless you are happy to customise rclone for your specific task and make names renaming part of the code.

You need dev resource to make it automatic - does not matter rclone or some other tool. If you do not have it then do step by step - especially if it is one off task. Disks are cheap nowadays including fast SSD. Dev time is not.

DanS · November 6, 2023, 7:35pm

Thanks... That's actually not a bad solution (not as bad as it sounds). Using 1000 transfers, I can grab an "entire day's " worth of files very quickly at 3Gbps, run a rename command locally, and send them back up.

Thanks!

kapitainsky · November 6, 2023, 7:41pm

Yeap, if you can through at this decent network speed then it will be much faster than going development path (coding, testing, bugs). Rclone can do S3 <-> local transfers heavy lifting easy - and actually 10 million of files is not so much.

asdffdsa · November 6, 2023, 8:47pm

welcome to the forum,

aws charges for egress including api calls. tho i think you can avoid most of those costs using EC2 cloud vm in same datacenter/region.

and in any event, might prefer to do it in one-shot, then not have to continually repeat download+rename+upload.
using something like this:

rclone mount remote: /mnt/remote
use whatever mass copy/rename tool you want to copy/rename files from /mnt/remote/folder to /mnt/remote/folder2
then rclone move remote:folder2 remote2:folder

kapitainsky · November 6, 2023, 8:52pm

It is alternative approach for sure.

My thinking though is that I never want to modify source when doing things like this - as if something goes wrong I would be toasted.

All depends - maybe it is not a problem at all. Not enough input to decide:)

Interesting point is to use EC2 cloud vm - it can be still used in my approach - AWS charges everything pro rata - so maybe it makes sense to pay for block storage for time of this job? It is not something I did - so only thinking loudly.

asdffdsa · November 6, 2023, 9:11pm

tho we are lacking the exact use-case and important details,
another option is do it 100% server-side, no downloading/uploading files, let AWS do the actual copy.

#mass copy the files without renaming
rclone copy remote1: remote2: --server-side-across-configs

#run rclone mount
rclone mount remote2: /mnt/remote2

#use whatever rename tool you want to rename the files at /mnt/remote2, in place, the operation(s) are all server-side.
you can mass rename in one-step or do it in batches.

AWS will do the actual copy operations, server-side, instead of rclone

DEBUG : xyz.json: Rename: newPath="/source/xyz.json.gz"
DEBUG : xyz.json: md5 = a00f3e567db4ad8a89fc112d695c320e OK
INFO  : xyz.json: Copied (server-side copy) to: source/xyz.json.gz
INFO  : xyz.json: Deleted

kapitainsky · November 6, 2023, 9:12pm

This might be actually the fastest approach when using EC2... I would test with e.g. 1k files to see what is the best. As theory often is not reflected by practice.

ncw · November 8, 2023, 12:16pm

I like the rclone copy src: dst: and then use mount to rename idea.

You could also set up an rclone rc server with rclone rcd then use the operations/copyfile api to copy and rename the file in one step. This could do a server side copy directly to the new name which will save some server side copies (which are relatively expensive API hits).

You could also script rclone copyto which is equivalent but will be less efficient than using the rc as it will do extra transactions every time rclone is started.

You'd use rclone lsf -R to get the list of files to operate on.

Note that a rename operation on S3 is implemented and charged as a server side copy (that is just the way it is on S3!).

system · December 8, 2023, 12:17pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.