I noticed (from experimentation and documentation) that the chunker overlay's chunking procedure will stream the upload of a modified file into the remote. For an existing chunked file, this can create a massive re-uploads of chunks which may never have been modified.
For checksum-enabled remotes, it would be neat if chunker would offer a flag/option to use chunk checksums to only reupload chunks which have been modified. This could save a lot of time and bandwidth by cutting down on the total number of chunk files rclone will have to write to a remote.
The example use case is to use rclone to efficiently re-upload modified disk image files by only uploading the individual chunks which were modified.
rclone sync has the option to use the --checksum flag to only upload files which were modified - and it works great! Could this idea be expanded to chunks? It would be awesome if a mounted chunker remote can leverage a similar checksum-based chunk upload to only upload the modified/updated chunks.
Would a feature like this be technically possible? Are there any obvious hurdles that would need to be overcome for a feature like this?
I proposed an entire backend that chunks but based on content-defined chunking. See Feature Idea: Chunk-Based Dedupe backend - #3 for discussion. I still think that would be awesome and arguably even better than what you propose since, depending on how the disk image is made, it could have some data in the middle changed that would otherwise throw off all of the chunk boundaries.
It is an interesting idea... Rclone would likely have to buffer the chunks on disk so it could discover their MD5s before uploading them, but apart from that I think it is a workable idea.
Pardon my ignorance here, would resumable uploads be in the critical path to unlock the ability to be selective about which chunks need to be updated on a remote?
Are there any obvious hurdles that would need to be overcome for a feature like this?
You actually propose an extension of parallel multi-part upload. The first and foremost hurdle is a hashing algorithm that can tame parallelity. Most if not all providers of cloud storage simply stop checksumming when it comes to multipart uploads.
I discussed it with @ncw recently and expressed my ideas.