Proposal: Metadata Remote

There has been a lot of discussion in various places, most notably #3667, about hashes on crypt. I would say the problem goes ever further; you need to be really careful about what features you use based on the remote (e.g. does it support hashes? Which ones? ModTime? ModTime resolution?).

Therefore, I propose a "metadata" remote. The idea is simple: it is a meta-remote like chunker or compression, where the file metadata is stored in a sidecar file (JSON? XML? YAML?). That sidecare file will have the hashes (likely all of them since they can be computed as the data passes through rclone. Or just the source one if it is too complicated), the modification times, and anything else rclone may need.

Bonus would be an option to also reduce the file length by using the hash (or random/uuid4) for the names and including the names in the metadata file.

This solves the hash-for-crypt issue as well as potentially the long filenames issue (esp if you make the shortened name something really short like a CRC32 or xxhash). It also means there doesn't need to be a new crypt remote with the various ideas

Past Discussion

Just doing my homework, some of these ideas were discussed before like in Unified method of preserving file and directory metadata #977 where @ncw said:

  • files should be stored in a way that other tools can retrieve them (eg web interface etc)

but he also noted that crypt already breaks that. As too does chunker and compression (the latter of which also uses a sidecar file)

[Discussion] Metadata - How, Which & Where #1336 also talks about this with a BoltDB (which I think is now dead?) but that loses some of the features and atomicity of a single sidecar file.

This would also close #1033, #1712 and the aforementioned #3667 all relating to hashes on crypt.

And #1337, and #1202 (if desired)

And it could potentially close #949.

Potential Issues

No plan is perfect so I think it is worth thinking through potential issues and mitigations. Overall, I think these are minor.

Modifications outside of rclone: There is already precedence to warn against this in the docs for things like chunker (filenames) and compression (sidecar file). So that isn't new. This can also be partially solved by validating the remote-specific metadata such as size (on all of them), hash (when supported and cheap to request), and even ModTime (even when not supported for setting, can still validate that it was the same).

Lost Sidecar: Either do not present the metadata or ignore the file

Reading file contents for listing: This one stings. To list the files (especially if the name is in the metadata file), the file has to be read. This is a new issue on some remotes but for, example, S3, getting metadata already requires an additional API call. And it is a similar issue as some of the proposed crypt changes where the header has to be read. It will hurt more for some remotes like B2 which need more API calls to download but I think that is part of the compromise.

Alternatives

Just to think through them

on-remote metadata db: Either one per directory or one per remote. This solves the reading problem but is very risky as there is no atomicity and a broken transfer means you lose the changes

on-machine metadata db: This is similar to some of the proposed methods to save checksums. I like the idea for some use cases but it also means that if you do not have the db (or lose it), you are out of luck. Worth thinking about but I suspect not a great path forward

Use remote-specific fields. This was exhaustively documented here in 3667 but that does not solve the problem of being uniform across all remotes. Just trading one issue for another.


Thoughts? I know this isn't perfect but it is a good path forward to solve the crypt issues and the filename issues. And add feature support to otherwise unsupported remotes. Plus, it is a wrapper remote so nobody has to use it.

Thanks!

1 Like