Filename too long

ivandeex · November 9, 2019, 5:33pm

chunker doesn't translate to unicode. it adds two suffixes: user configurable one and hardcoded temporary suffix.

the former serves two purposes - marks chunk number and helps to tell chunks from files that have numbers in the name.

the latter one keeps data integrity and allows for parallel operations on large files (and chunker was initially concieved as a way to handle extra large files and overcome storage limits).

consider a large file located on remote storage. a process on the first box starts uploading new version and has finished two chunks, more yet to come. another process on 2nd box starts downloading using a faster link. the worst thing to happen is that 2nd process gets a mix of old and new data. the remote storage can at best confirm hash sums for lone chunk objects but even that is useless here, obviously.

chunker solves this problem in two ways: (1) parallel modifications never touch the original file before they end, but create temporary chunks then use (relatively) fast server-side move to commit operation (almost) atomically; (2) chunker implements its own hash-summing.

additional goals were to avoid any external registry (eg a local index that would let us give very short names like "tmp2" to temporary chunks) and be effective ie reduce extra remote operations (chunker has some sort of remote metadata but reads it lazily) as much as possible without sacrificing major goals. so chunker tries to encode as much info as possible in the fastest metadata source - the file name. so parallel operations get temporary IDs which are appended to file names. IDs can be numbers or letters (lowercase only, since many filesystems are case insensitive), they must be unique between parallel operations running on multiple computers. the most simple / least effective way is to try "aaa" first, list remote directory to probe if "file.chunk01.aaa" already exists then try "aab" etc. i just use some randomness sources to reduce or (almost) avoid probes.

bottom line: temporary suffix can be substantially reduced but cannot be completely avoided given current goals.

thestigma · November 9, 2019, 10:23pm

Thanks for the good breakdown
And hi by the way

I think a bit of this issue comes down to that the seems like it was made to overcome file-size limitations primarily, but the OP wants it for obfuscation-purposes. Even though the answer to both of those goals might be chunking - the strategy employed may end up being a bit different.

ivandeex · November 10, 2019, 8:38am

This could be solved differently by offloading chunker's per-file commit ids (and hashsums) or crypt's key material from filenames/etc to a metadata registry if rclone provided some sort of.

That yet has to be reflected on and carefully designed, eg it needs a type system, support for local fs as well as remote registry storage or per-remote registry (with possible encrypted) since rclone can run from multiple nodes.

I saw some attempts to extend rclone's configuration for something like that (eg here https://github.com/rclone/rclone/pull/3582 and here https://github.com/rclone/rclone/issues/3706). I doubt @ncw will be merging it soon as the patches are immature, fail tests, break the "12-factor" rclone's property (configuration through environment) etc.

system · February 8, 2020, 8:38am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.