Option to append to hashsum output file on subsequent runs

One thing that I'm kind of surprised to find that Rclone can't do yet is to append hashes to an output file created with the hashsum command.

I'd like to keep a list of hashes that I can periodically use to check for data corruption, but overwriting the file each time the hashsum command is run would defeat the purpose. If data corruption occurred, you'll just be replacing the good hash with the new bad one.

Has anyone in the same situation figured out a method of doing this? I thought about using Awk combined with the --exclude-from option in Rclone, but that leaves a lot of room for error.

That's really easy to script if you want that.

  1. Move old file
  2. Run command
  3. Cat old new > combined

I thought about that, but you'd have to re-hash every file on each run instead of only files that have been added since the last one.

hello and welcome to the forum,

not sure this is what you want, but wanted to share
https://rclone.org/hasher

I can't quite follow your flow.

If you want to check for corruption, I'm assuming you mean a local disk and not a cloud remote.
If you have corruption, you want to compare an old log file to a new log file and check for differences I'd imagine.

So you'd get a baseline in a log file, that becomes your check file and you check against that. Anything new, you'd append/add to the checksum file as your 'gold' status.

Unless I don't get your use case / flow:

felix@gemini:~/test$ rclone hashsum md5 /home/felix/test --output-file ~/checkfile
felix@gemini:~/test$ rclone hashsum md5 -C /home/felix/checkfile /home/felix/test
= four
= jellyfish-30-mbps-hd-h264.mkv
= three
= two
2022/01/03 20:13:27 NOTICE: Local file system at /home/felix/test: 0 differences found
2022/01/03 20:13:27 NOTICE: Local file system at /home/felix/test: 4 matching files
felix@gemini:~/test$ echo blah >>four
felix@gemini:~/test$ rclone hashsum md5 -C /home/felix/checkfile /home/felix/test
2022/01/03 20:13:36 ERROR : four: files differ
* four
= jellyfish-30-mbps-hd-h264.mkv
= three
= two
2022/01/03 20:13:37 NOTICE: Local file system at /home/felix/test: 1 differences found
2022/01/03 20:13:37 NOTICE: Local file system at /home/felix/test: 1 errors while checking
2022/01/03 20:13:37 NOTICE: Local file system at /home/felix/test: 3 matching files
2022/01/03 20:13:37 Failed to hashsum: 1 differences found
1 Like

Maybe that would be an easier way of doing it.

My original goal was to keep a file that contains hashes that I could frequently append new items to, and only verify those hashes every few months. I have a lot of data,, and a Raspberry Pi is doing the hashing so re-hashing all of the data and then comparing the resulting files is a slow process.

I'm finding it difficult to explain, but my end goal is to devise a way to make Rclone hash only the data that has been added since the last run. I could then add those new hashes to the base file, and use that file to verify that the data hasn't changed every few months or so.

It sounds like your data is important, maybe putting it some cloud storage (cheap) and have it lay there for backups would be a better solution.

I think @asdffdsa like's Wasabi and seems to be not that expensive as if your goal is to look for bitrot or something along those lines / corruption.

I dunno. I've had drives for years and just replace them and never noticed anything, but my data on local storage is throwaway / backed up elsewhere so data loss for me is not an issue.

If you can think of how you'd want something like it to work, it's an edge case and you can always submit a feature request on github. There are huge amounts of backlog though so being realistic, I'd look for a scripted solution along the lines above or flesh out your use case a bit more and I'm sure some folks can pitch in ideas as well.

i agree.

i have many PI, all types, use on a daily basis.
imho, would not trust it for anything other than a cheap media server.
i tend to recyle old desktop computers.

yes, in any location i support, i have found nothing better than this combo.
--- wasabi, s3 clone, known for hot storage, US$6.00/TB/month. for recent backups.
--- aws s3 deep glacier, for cold storage, US$1.00/TB/month. for older backups

that is the way to do it.

in my case, no need to that.
in addition to the wasabi/aws combo, always have a very cheap server, dedicated for backups.
a used desktop computer, new RAM, and some hard drives.
i use the free windows server 2019 hyper-v edition, the server use REFS filesystem, windows version of ZFS.
so no worries about bit-rot and/or manually computing hashes.

let's say you implement your approach, find a corrupted file, than what?

I think @asdffdsa is right and Hasher can be helpful for this workflow.
Say you keep a large archive under /mnt/archive on your box.
Add a section to your ~/.config/rclone/rclone.conf on the same box:

[archive]
type = hasher
remote = /mnt/archive
hashes = sha1
max_age = 365d

Now running

rclone sha1sum archive: --output-file archive.sha1

will produce a full sum file in standard format every time you run it... BUT it will actually rehash only new/changed files taking the rest from internal cache.
Files having the same name/modtime after last run will be rehashed just once a year.
You can even set max_age = off to prevent rehashing unchanged files completely (but beware of bitrot).

2 Likes

Thanks for that. I'd read about hasher, but didn't think of using it this way.

One more note. To validate checksums in archive you will run

rclone checksum sha1 ./archive.sha1 /mnt/archive

Don't try to rclone checksum sha1 ./archive.sha1 archive: because it will not access files, just compare internal hasher cache against the sum file.

1 Like

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.