Comparing checksums between source & destination with different directory structure

The source & destination have different hierarchical structures. The aim is to ignore the structure but confirm that all files at source match the destination. How can I achieve this?

I'd do this by using rclone md5sum (or whatever hasher your destination supports).

You can then compare the checksums are the same.

Something like this

rclone md5sum source: > source-sums
rclone md5sum dest: > dest-sums

Then process the sums files and find differences

awk '{print $1}' < source-sums | sort > source-sums-sorted
awk '{print $1}' < dest-sums | sort > dest-sums-sorted

Then find sums that are missing

diff source-sums-sorted dest-sums-sorted

That will give you checksums which you'll need to look up in the original source-sums and dest-sums files.

dif <(aw

Thanks @ncw. This is super helpful. I'm on a Windows device and awk isn't an option. Is there an alternative?

hi,

if you have WSL installed, you can run that script as is.

Is using PowerShell an alternative? I'm playing around with ncw's strategy, something like this seems similar:

$Source = .\rclone.exe md5sum source: | ConvertFrom-String -PropertyNames Hash,Path -Delimiter '(?<=^[\w]+)\s+' | Sort-Object -Property Hash
$Destination = .\rclone.exe md5sum destination: | ConvertFrom-String -PropertyNames Hash,Path -Delimiter '(?<=^[\w]+)\s+' | Sort-Object -Property Hash
Compare-Object $Source $Destination -Property Hash -PassThru

(I have only tested on tiny sample sets, so no idea if it blows up if you try it on huge directories)

Here is a python script which does the same and should run on windows

#!/usr/bin/env python3
"""
Discover differences by checksum between two rclone paths

Run it like

python3 hashdiff.py MD5 remote1:path remote2:path
"""

import sys
import subprocess

def sums(remote, hash_type):
    cmd = ["rclone", "hashsum", hash_type, remote]
    print(f"Running: {' '.join(cmd)}")
    out = subprocess.check_output(cmd)
    return dict( line.split(b"  ", 1) for line in out.split(b"\n") if line )

def missing(remote, hashes, sums):
    print(f"Hashes changed or missing on {remote}")
    for hash in hashes:
        print(sums[hash])
    print

def main():
    if len(sys.argv) != 4:
        print(f"Syntax: {sys.argv[0]} HASH remote1:path remote2:path")
        raise SystemExit(1)
    hash_type, source, dest = sys.argv[1:]
    source_sums = sums(source, hash_type)
    dest_sums = sums(dest, hash_type)
    source_hashes = set(source_sums.keys())
    dest_hashes = set(dest_sums.keys())
    missing(source, source_hashes - dest_hashes, source_sums)
    missing(dest, dest_hashes - source_hashes, dest_sums)
    
if __name__ == "__main__":
    main()

Thanks @albertony. Is there a way for me to pass the configuration password? Also, I'm assuming that it's a recursive lookup? When you say large directories, do you mean deeply nested paths or large files or both?

Thanks @ncw. The device unfortunately does not have Python and may be a challenge to install (subject to policy). Assuming it can be deployed, is it recursive if I start at the root of the source & destination? Also, how can I pass the configuration password?

Check out these, if you haven't already?

https://rclone.org/docs/#configuration-encryption

Ran a quick test of the python script, comparing result with the PowerShell snippet: If one of the remotes contain copies of the same file (same hash) that the other remote don't have, then the python script will not show it as diff but the powershell version will. Depends what you want. Both can be adjusted to do the opposite.