Hash on the fly with rclone?

datanxiete · October 19, 2021, 7:04pm

Every 5 years I copy ~15TB of data from one set of HDD to a fresh set of diversified but newer HDDs. This is source of extreme stress and anxiety but I have developed a set of scripts to reduce that.

Now in 2021, it's time again to do this for my 20 TB of data from HDDs I purchased in 2015. I would like to use rclone for this.

What I will do is copy the files from the old HDDs (OLD) to these new HDDs (NEW). My question is:

While rclone copies from OLD to NEW, is there anyway I can also have it generate MD5 or SHA hashes and store those in a file (since it has already read the data from OLD)?
The objective is to have a hashlist generated on the fly with rclone so that I don't have to do a second pass after the copy step, just to generate a hashlist
If this feature is not yet available - what will be involved to add this feature for local FS (local HDDs) copies only? Would I have to write plugins in Go or do I have to update rclone itself (i.e. rclone does not support external plugins?)

asdffdsa · October 19, 2021, 7:29pm

hi,

there might be another way, but this will work.

use a debug log and for each file copied, the output will be
2021/10/19 15:27:40 DEBUG : file.txt: md5 = c4ca4238a0b923820dcc509a6f75849b OK
then iterate thru the log file and regex the needed info.
in python, .*DEBUG : (.*): md5 = (.*) OK$ would match

file.txt
c4ca4238a0b923820dcc509a6f75849b

datanxiete · October 19, 2021, 8:22pm

@asdffdsa, really appreciate this! This is one way of getting the MD5 hash

2 questions:

Is there a way to obtain SHA instead (or in addition to the MD5 hash)?
That MD5 hash is of the file read from the source instead of the file written out, right? This would make sense instead of any other way, but just confirming.

Again, this response is very useful.

asdffdsa · October 19, 2021, 8:39pm

not that i know of. rclone copy local to local uses md5
for sha1 - rclone sha1sum, which as i understand, just calls rclone hashsum sha1
the hash is from the source.

ncw · October 20, 2021, 11:44am

I think the not quite merged yet hasher backend would help with this. Its job is to cache hashes so they don't get re-computed unecessarily.

github.com/rclone/rclone

Add hasher backend and KV store

rclone:master ← ivandeex:pr-hasher

opened 07:20AM - 09 Sep 21 UTC

ivandeex

+2232 -4

## What is the purpose of this change? This PR adds new `hasher` backend afte…r preliminary design proposal in https://github.com/rclone/rclone/issues/949#issuecomment-849122505. The actual design is described below and solves the following **Goals**: 1. Fix slow hashing of large files on LocalFS, FTP, SFTP 2. Emulate hash types unimplemented by backends 3. Warm up checksum cache from external SUM files for large pre-hashed data arrays 4. Implemented as a transitive backend because VFS does **NOT** need checksums but CLI operations **DO** ### Getting Started So you want to cache or otherwise handle checksums for existing `myRemote:path` or `/local/path`? Install [patched rclone](https://github.com/ivandeex/rclone/releases). Open `~/.config/rclone.conf` in a text editor and add new section(s): ``` [Hasher] type = hasher remote = myRemote:path hashes = md5 max_age = 30d [Hasher2] type = hasher remote = /local/path hashes = dropbox,sha1 max_age = 24h ``` The backend takes basically the following parameters: - `remote` (like `alias`, required) - `hashes` - comma separated list (by default `md5,sha1`) - `max_age` - maximum time to keep a checksum value, e.g. `1h30m` or `30d` `0` will disable caching completely, `off` will cache "forever" (i.e. until the files get changed) **UPDATE: hash_names renamed to `hashes`, can be separated by comma only (blanks not allowed anymore)** Use it as `Hasher2:subdir/file` instead of base remote. Hasher will transparently update cache with new checksums when a file is fully read or overwritten, like: ``` rclone copy External:path/file Hasher:dest/path rclone cat Hasher:path/to/file > /dev/null ``` The way to refresh **all** cached checksums (even unsupported by the base backend) for a subtree is to **re-download** all files in the subtree. For example, use `hashsum --download` using **any** supported hashsum on the command line (we just care to re-read): ``` rclone hashsum MD5 --download Hasher:path/to/subtree > /dev/null rclone backend dump Hasher:path/to/subtree ``` ### How It Works `rclone hashsum` (or `md5sum` or `sha1sum`): 1. if requested hash is supported by lower level, just pass it. 2. if object size is below `auto_size` then download object and calculate _requested_ hashes on the fly. 3. if unsupported and the size is big enough, build object `fingerprint` (including size, modtime if supported, first-found _other_ hash if any). 4. if the strict match is found in cache for the requested remote, return the stored hash. 5. if remote found but fingerprint mismatched, then: purge the entry and proceed to step 6. 5. if remote not found or had no requested hash type or after step 5: download object, calculate all _supported_ hashes on the fly and store in cache; return requested hash. Other operations: - whenever a file is uploaded or downloaded **in full**, capture the stream to calculate all supported hashes on the fly and update database - server-side `move` will update keys of existing cache entries - `delete` will remove cache entry - `purge` will remove all cache entries underneath - periodically prune entries (if `max_age` is not `off`) (this helps against bit-rot or database thrashing when a file got removed externally) ### Pre-Seed from a SUM File Hasher supports two backend commands: generic SUM file `import` and faster but less consistent `stickyimport`. ``` rclone backend import Hasher:dir/subdir SHA1 /path/to/SHA1SUM [--checkers 4] ``` Instead of SHA1 it can be any hash supported by the remote. The last argument can point to either a local or an `other-remote:path` text file in SUM format. The command will parse the SUM file, then walk down the path given by the first argument, snapshot current fingerprints and fill in the cache entries correspondingly. - Paths in the SUM file are treated as relative to `hasher:dir/subdir`. - The command will **not** check that supplied values are correct. You **must know** what you are doing. - This is a one-time action. The SUM file will not get "attached" to the remote. Cache entries can still be overwritten later, should the object's fingerprint change. - The tree walk can take long depending on the tree size. You can increase `--checkers` to make it faster. Or use `stickyimport` if you don't care about fingerprints and consistency. ``` rclone backend stickyimport hasher:path/to/data sha1 remote:/path/to/sum.sha1 ``` `stickyimport` is similar to `import` but works much faster because it does not need to stat existing files and skips initial tree walk. Instead of binding cache entries to file fingerprints it creates _sticky_ entries bound to the file name alone ignoring size, modification time etc. Such hash entries can be replaced only by `purge`, `delete`, `backend drop` or by full re-read/re-write of the files. ### Cache Storage **Note: Details updated after recent KV redesign** Cached checksums are stored as `bolt` database files under `~/.cache/rclone/kv/`, one per _base_ backend, named like `BaseRemote~hasher.bolt`. Checksums for multiple `alias`-es into a single base backend will be stored in the single database. All local paths are treated as aliases into the `local` backend (unless crypted or chunked) and stored in `~/.cache/rclone/kv/local~hasher.bolt`. By default database is opened in read-only mode and can be shared by multiple rclone instances. When a change is needed, rclone will reopen database in exclusive read-write mode for a moment and reopen it back for sharing after changing to let other instances access it. The `bolt` engine shows good performance for databases with up to a million entries on ordinary average hardware. Reopening is cheap and takes less than a millisecond. You can print or drop database using custom backend commands: ``` rclone backend dump Hasher:dir/subdir rclone backend drop Hasher: ``` ### Boilerplate #### Was the change discussed in an issue or in the forum before? Fixes #949 Fixes #157 Fixes #626 #### Checklist - [x] I have read the [contribution guidelines](https://github.com/rclone/rclone/blob/master/CONTRIBUTING.md#submitting-a-pull-request). - [x] I have added tests for all changes in this PR if appropriate. - [x] I have added documentation for the changes if appropriate. - [x] All commit messages are in [house style](https://github.com/rclone/rclone/blob/master/CONTRIBUTING.md#commit-messages). - [ ] I'm done, this Pull Request is ready for review :-)

@ivandeex will be along in a moment to explain more I'm sure

ivandeex · October 20, 2021, 11:48am

After refactoring his main job is to bring free bolt to her majesty vfs

Hash caching is just coincident byproduct

datanxiete · October 20, 2021, 7:57pm

I really appreciate this response. BY the way, this is not really my request. This idea was proposed years ago:

@thestigma mentioned this at When does rclone compute a file hash? - #2 by thestigma

Any time you already have to read the entire file anyway though - calculating the hash can be done almost for free (such as for the upload of a file). The only extra cost is a fairly trivial amount of CPU, so I think rclone generally tends to do this to verify a successful transfer.

… but I didn't see any followups on this matter.

I have also asked about this for ref: https://www.reddit.com/r/rclone/comments/qbgd9x/hash_on_the_fly_with_rclone/

system · November 19, 2021, 7:58pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.