Intelligent (faster) chunker file updates on checksum-enabled remotes

I noticed (from experimentation and documentation) that the chunker overlay's chunking procedure will stream the upload of a modified file into the remote. For an existing chunked file, this can create a massive re-uploads of chunks which may never have been modified.

For checksum-enabled remotes, it would be neat if chunker would offer a flag/option to use chunk checksums to only reupload chunks which have been modified. This could save a lot of time and bandwidth by cutting down on the total number of chunk files rclone will have to write to a remote.

The example use case is to use rclone to efficiently re-upload modified disk image files by only uploading the individual chunks which were modified.

rclone sync has the option to use the --checksum flag to only upload files which were modified - and it works great! Could this idea be expanded to chunks? It would be awesome if a mounted chunker remote can leverage a similar checksum-based chunk upload to only upload the modified/updated chunks.

Would a feature like this be technically possible? Are there any obvious hurdles that would need to be overcome for a feature like this?

I proposed an entire backend that chunks but based on content-defined chunking. See Feature Idea: Chunk-Based Dedupe backend - #3 for discussion. I still think that would be awesome and arguably even better than what you propose since, depending on how the disk image is made, it could have some data in the middle changed that would otherwise throw off all of the chunk boundaries.

1 Like

It is an interesting idea... Rclone would likely have to buffer the chunks on disk so it could discover their MD5s before uploading them, but apart from that I think it is a workable idea.

@ivandeex - you have any thoughts?

Support for resumable uploads is in plans.
No ETA yet.

Pardon my ignorance here, would resumable uploads be in the critical path to unlock the ability to be selective about which chunks need to be updated on a remote?

I will unlock my ability to answer your question after I break my path through other critical tickets assigned on me.

2 Likes

Are there any obvious hurdles that would need to be overcome for a feature like this?

You actually propose an extension of parallel multi-part upload. The first and foremost hurdle is a hashing algorithm that can tame parallelity. Most if not all providers of cloud storage simply stop checksumming when it comes to multipart uploads.

I discussed it with @ncw recently and expressed my ideas.

Rclone would likely have to buffer the chunks on disk so it could discover their MD5s before uploading them.

Chunker has intrinsic mechanisms to do it more effectively: transactions and control chunks.