Comparing checksums between source & destination with different directory structure

RyanH · September 12, 2021, 5:16am

The source & destination have different hierarchical structures. The aim is to ignore the structure but confirm that all files at source match the destination. How can I achieve this?

ncw · September 13, 2021, 10:08am

I'd do this by using rclone md5sum (or whatever hasher your destination supports).

You can then compare the checksums are the same.

Something like this

rclone md5sum source: > source-sums
rclone md5sum dest: > dest-sums

Then process the sums files and find differences

awk '{print $1}' < source-sums | sort > source-sums-sorted
awk '{print $1}' < dest-sums | sort > dest-sums-sorted

Then find sums that are missing

diff source-sums-sorted dest-sums-sorted

That will give you checksums which you'll need to look up in the original source-sums and dest-sums files.

dif <(aw

RyanH · September 13, 2021, 6:57pm

Thanks @ncw. This is super helpful. I'm on a Windows device and awk isn't an option. Is there an alternative?

asdffdsa · September 13, 2021, 7:27pm

hi,

if you have WSL installed, you can run that script as is.

albertony · September 13, 2021, 8:40pm

Is using PowerShell an alternative? I'm playing around with ncw's strategy, something like this seems similar:

$Source = .\rclone.exe md5sum source: | ConvertFrom-String -PropertyNames Hash,Path -Delimiter '(?<=^[\w]+)\s+' | Sort-Object -Property Hash
$Destination = .\rclone.exe md5sum destination: | ConvertFrom-String -PropertyNames Hash,Path -Delimiter '(?<=^[\w]+)\s+' | Sort-Object -Property Hash
Compare-Object $Source $Destination -Property Hash -PassThru

(I have only tested on tiny sample sets, so no idea if it blows up if you try it on huge directories)

ncw · September 15, 2021, 2:57pm

Here is a python script which does the same and should run on windows

#!/usr/bin/env python3
"""
Discover differences by checksum between two rclone paths

Run it like

python3 hashdiff.py MD5 remote1:path remote2:path
"""

import sys
import subprocess

def sums(remote, hash_type):
    cmd = ["rclone", "hashsum", hash_type, remote]
    print(f"Running: {' '.join(cmd)}")
    out = subprocess.check_output(cmd)
    return dict( line.split(b"  ", 1) for line in out.split(b"\n") if line )

def missing(remote, hashes, sums):
    print(f"Hashes changed or missing on {remote}")
    for hash in hashes:
        print(sums[hash])
    print

def main():
    if len(sys.argv) != 4:
        print(f"Syntax: {sys.argv[0]} HASH remote1:path remote2:path")
        raise SystemExit(1)
    hash_type, source, dest = sys.argv[1:]
    source_sums = sums(source, hash_type)
    dest_sums = sums(dest, hash_type)
    source_hashes = set(source_sums.keys())
    dest_hashes = set(dest_sums.keys())
    missing(source, source_hashes - dest_hashes, source_sums)
    missing(dest, dest_hashes - source_hashes, dest_sums)
    
if __name__ == "__main__":
    main()

RyanH · September 16, 2021, 6:18pm

albertony:

$Source = .\rclone.exe md5sum source: | ConvertFrom-String -PropertyNames Hash,Path -Delimiter '(?<=^[\w]+)\s+' | Sort-Object -Property Hash
$Destination = .\rclone.exe md5sum destination: | ConvertFrom-String -PropertyNames Hash,Path -Delimiter '(?<=^[\w]+)\s+' | Sort-Object -Property Hash
Compare-Object $Source $Destination -Property Hash -PassThru

Thanks @albertony. Is there a way for me to pass the configuration password? Also, I'm assuming that it's a recursive lookup? When you say large directories, do you mean deeply nested paths or large files or both?

RyanH · September 16, 2021, 6:20pm

Thanks @ncw. The device unfortunately does not have Python and may be a challenge to install (subject to policy). Assuming it can be deployed, is it recursive if I start at the root of the source & destination? Also, how can I pass the configuration password?

albertony · September 16, 2021, 6:22pm

Check out these, if you haven't already?

https://rclone.org/docs/#configuration-encryption

albertony · September 16, 2021, 9:21pm

Ran a quick test of the python script, comparing result with the PowerShell snippet: If one of the remotes contain copies of the same file (same hash) that the other remote don't have, then the python script will not show it as diff but the powershell version will. Depends what you want. Both can be adjusted to do the opposite.

system · November 16, 2021, 5:21pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.