Every 5 years I copy ~15TB of data from one set of HDD to a fresh set of diversified but newer HDDs. This is source of extreme stress and anxiety but I have developed a set of scripts to reduce that.
Now in 2021, it's time again to do this for my 20 TB of data from HDDs I purchased in 2015. I would like to use rclone for this.
What I will do is copy the files from the old HDDs (OLD) to these new HDDs (NEW). My question is:
While rclone copies from OLD to NEW, is there anyway I can also have it generate MD5 or SHA hashes and store those in a file (since it has already read the data from OLD)?
The objective is to have a hashlist generated on the fly with rclone so that I don't have to do a second pass after the copy step, just to generate a hashlist
If this feature is not yet available - what will be involved to add this feature for local FS (local HDDs) copies only? Would I have to write plugins in Go or do I have to update rclone itself (i.e. rclone does not support external plugins?)
asdffdsa
(jojothehumanmonkey)
October 19, 2021, 7:29pm
2
hi,
there might be another way, but this will work.
use a debug log and for each file copied, the output will be
2021/10/19 15:27:40 DEBUG : file.txt: md5 = c4ca4238a0b923820dcc509a6f75849b OK
then iterate thru the log file and regex the needed info.
in python, .*DEBUG : (.*): md5 = (.*) OK$
would match
file.txt
c4ca4238a0b923820dcc509a6f75849b
1 Like
@asdffdsa , really appreciate this! This is one way of getting the MD5 hash
2 questions:
Is there a way to obtain SHA instead (or in addition to the MD5 hash)?
That MD5 hash is of the file read from the source instead of the file written out, right? This would make sense instead of any other way, but just confirming.
Again, this response is very useful.
ncw
(Nick Craig-Wood)
October 20, 2021, 11:44am
5
I think the not quite merged yet hasher
backend would help with this. Its job is to cache hashes so they don't get re-computed unecessarily.
rclone:master
← ivandeex:pr-hasher
opened 07:20AM - 09 Sep 21 UTC
## What is the purpose of this change?
This PR adds new `hasher` backend afte… r preliminary design proposal in https://github.com/rclone/rclone/issues/949#issuecomment-849122505. The actual design is described below and solves the following **Goals**:
1. Fix slow hashing of large files on LocalFS, FTP, SFTP
2. Emulate hash types unimplemented by backends
3. Warm up checksum cache from external SUM files for large pre-hashed data arrays
4. Implemented as a transitive backend because VFS does **NOT** need checksums but CLI operations **DO**
### Getting Started
So you want to cache or otherwise handle checksums for existing `myRemote:path` or `/local/path`?
Install [patched rclone](https://github.com/ivandeex/rclone/releases).
Open `~/.config/rclone.conf` in a text editor and add new section(s):
```
[Hasher]
type = hasher
remote = myRemote:path
hashes = md5
max_age = 30d
[Hasher2]
type = hasher
remote = /local/path
hashes = dropbox,sha1
max_age = 24h
```
The backend takes basically the following parameters:
- `remote` (like `alias`, required)
- `hashes` - comma separated list (by default `md5,sha1`)
- `max_age` - maximum time to keep a checksum value, e.g. `1h30m` or `30d`
`0` will disable caching completely,
`off` will cache "forever" (i.e. until the files get changed)
**UPDATE: hash_names renamed to `hashes`, can be separated by comma only (blanks not allowed anymore)**
Use it as `Hasher2:subdir/file` instead of base remote. Hasher will transparently update cache with new checksums when a file is fully read or overwritten, like:
```
rclone copy External:path/file Hasher:dest/path
rclone cat Hasher:path/to/file > /dev/null
```
The way to refresh **all** cached checksums (even unsupported by the base backend) for a subtree is to **re-download** all files in the subtree. For example, use `hashsum --download` using **any** supported hashsum on the command line (we just care to re-read):
```
rclone hashsum MD5 --download Hasher:path/to/subtree > /dev/null
rclone backend dump Hasher:path/to/subtree
```
### How It Works
`rclone hashsum` (or `md5sum` or `sha1sum`):
1. if requested hash is supported by lower level, just pass it.
2. if object size is below `auto_size` then download object and calculate _requested_ hashes on the fly.
3. if unsupported and the size is big enough, build object `fingerprint` (including size, modtime if supported, first-found _other_ hash if any).
4. if the strict match is found in cache for the requested remote, return the stored hash.
5. if remote found but fingerprint mismatched, then: purge the entry and proceed to step 6.
5. if remote not found or had no requested hash type or after step 5: download object, calculate all _supported_ hashes on the fly and store in cache; return requested hash.
Other operations:
- whenever a file is uploaded or downloaded **in full**, capture the stream to calculate all supported hashes on the fly and update database
- server-side `move` will update keys of existing cache entries
- `delete` will remove cache entry
- `purge` will remove all cache entries underneath
- periodically prune entries (if `max_age` is not `off`)
(this helps against bit-rot or database thrashing when a file got removed externally)
### Pre-Seed from a SUM File
Hasher supports two backend commands: generic SUM file `import` and faster but less consistent `stickyimport`.
```
rclone backend import Hasher:dir/subdir SHA1 /path/to/SHA1SUM [--checkers 4]
```
Instead of SHA1 it can be any hash supported by the remote. The last argument can point to either a local or an `other-remote:path` text file in SUM format. The command will parse the SUM file, then walk down the path given by the first argument, snapshot current fingerprints and fill in the cache entries correspondingly.
- Paths in the SUM file are treated as relative to `hasher:dir/subdir`.
- The command will **not** check that supplied values are correct. You **must know** what you are doing.
- This is a one-time action. The SUM file will not get "attached" to the remote. Cache entries can still be overwritten later, should the object's fingerprint change.
- The tree walk can take long depending on the tree size. You can increase `--checkers` to make it faster. Or use `stickyimport` if you don't care about fingerprints and consistency.
```
rclone backend stickyimport hasher:path/to/data sha1 remote:/path/to/sum.sha1
```
`stickyimport` is similar to `import` but works much faster because it does not need to stat existing files and skips initial tree walk. Instead of binding cache entries to file fingerprints it creates _sticky_ entries bound to the file name alone ignoring size, modification time etc. Such hash entries can be replaced only by `purge`, `delete`, `backend drop` or by full re-read/re-write of the files.
### Cache Storage
**Note: Details updated after recent KV redesign**
Cached checksums are stored as `bolt` database files under `~/.cache/rclone/kv/`, one per _base_ backend, named like `BaseRemote~hasher.bolt`. Checksums for multiple `alias`-es into a single base backend will be stored in the single database. All local paths are treated as aliases into the `local` backend (unless crypted or chunked) and stored in `~/.cache/rclone/kv/local~hasher.bolt`.
By default database is opened in read-only mode and can be shared by multiple rclone instances. When a change is needed, rclone will reopen database in exclusive read-write mode for a moment and reopen it back for sharing after changing to let other instances access it.
The `bolt` engine shows good performance for databases with up to a million entries on ordinary average hardware. Reopening is cheap and takes less than a millisecond.
You can print or drop database using custom backend commands:
```
rclone backend dump Hasher:dir/subdir
rclone backend drop Hasher:
```
### Boilerplate
#### Was the change discussed in an issue or in the forum before?
Fixes #949
Fixes #157
Fixes #626
#### Checklist
- [x] I have read the [contribution guidelines](https://github.com/rclone/rclone/blob/master/CONTRIBUTING.md#submitting-a-pull-request).
- [x] I have added tests for all changes in this PR if appropriate.
- [x] I have added documentation for the changes if appropriate.
- [x] All commit messages are in [house style](https://github.com/rclone/rclone/blob/master/CONTRIBUTING.md#commit-messages).
- [ ] I'm done, this Pull Request is ready for review :-)
@ivandeex will be along in a moment to explain more I'm sure
1 Like
ivandeex
(Ivan Andreev)
October 20, 2021, 11:48am
6
After refactoring his main job is to bring free bolt to her majesty vfs
Hash caching is just coincident byproduct
2 Likes
I really appreciate this response. BY the way, this is not really my request. This idea was proposed years ago:
@thestigma mentioned this at When does rclone compute a file hash? - #2 by thestigma
Any time you already have to read the entire file anyway though - calculating the hash can be done almost for free (such as for the upload of a file). The only extra cost is a fairly trivial amount of CPU, so I think rclone generally tends to do this to verify a successful transfer.
… but I didn't see any followups on this matter.
I have also asked about this for ref: https://www.reddit.com/r/rclone/comments/qbgd9x/hash_on_the_fly_with_rclone/
1 Like
system
(system)
Closed
November 19, 2021, 7:58pm
8
This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.