Sync with S3 inventory files

exonintrendo · July 3, 2018, 3:22pm

Has this been talked about? I think it would be a great idea to be able to run a sync to / from S3 using its inventory files instead of making API calls for the listings. This would especially be beneficial against buckets with millions of files. Probably faster too.

ncw · July 5, 2018, 10:52am

Sounds like a great idea. Not heard of inventory files. How do you get them - can you point me at some docs?

exonintrendo · July 5, 2018, 11:35am

Sure!

https://docs.aws.amazon.com/AmazonS3/latest/dev/storage-inventory.html

I believe you have to set up which buckets you want to generate inventory files for and how often. This generates a manifest.json file into S3 which contains the S3 locations of compressed CSV files that contain the inventory audit. So it can get a bit complicated retrieving these files. Not sure if you would want rclone to go fetch all these for you or specify via an argument a local filesystem location of the files.

The CSV files provide a lot of the metadata for objects, but not all. For example, I don’t believe MD5 checksums are available in these files so some rclone functionalities would be lost.

ncw · July 5, 2018, 1:42pm

OK, lets assume for the sake of argument that we has a manifest file on disk for a bucket. We could then supply it to rclone as --s3-manifest /path/to/file. rclone would then use that file for listing the bucket purposes rather than using the API.

Does that sound like the sort of thing?

Is it possible to get the object metadata in the manifest? Rclone won’t have access to either the modtime or the checksum for large files which will make it less useful. It says “You can configure what object metadata to include in the inventory” so I thought it could but I couldn’t see how to do it.

exonintrendo · July 6, 2018, 12:26pm

That sounds like a plan.

As far as the metadata goes, I don’t think you can request additional metadata for the files. Although the inventory does provide the ETag which “may or may not be the md5 checksum” of the files. Does rclone use the ETag as the md5 check? Or does it create its own metadata value for this too?

exonintrendo · July 6, 2018, 12:30pm

But also this could still be highly effective if used with the --update and --use-server-modtime flags to sync from one remote to another.

ncw · July 8, 2018, 10:22am

It does, where appropriate (which is for small files).

For large files it puts metadata on.

Yes it could.

Do you fancy making a new issue on github about this and summarising the thread in it?

Maybe you'd like to help work on the issue?