The source & destination have different hierarchical structures. The aim is to ignore the structure but confirm that all files at source match the destination. How can I achieve this?
I'd do this by using rclone md5sum
(or whatever hasher your destination supports).
You can then compare the checksums are the same.
Something like this
rclone md5sum source: > source-sums
rclone md5sum dest: > dest-sums
Then process the sums files and find differences
awk '{print $1}' < source-sums | sort > source-sums-sorted
awk '{print $1}' < dest-sums | sort > dest-sums-sorted
Then find sums that are missing
diff source-sums-sorted dest-sums-sorted
That will give you checksums which you'll need to look up in the original source-sums and dest-sums files.
dif <(aw
Thanks @ncw. This is super helpful. I'm on a Windows device and awk
isn't an option. Is there an alternative?
hi,
if you have WSL installed, you can run that script as is.
Is using PowerShell an alternative? I'm playing around with ncw's strategy, something like this seems similar:
$Source = .\rclone.exe md5sum source: | ConvertFrom-String -PropertyNames Hash,Path -Delimiter '(?<=^[\w]+)\s+' | Sort-Object -Property Hash
$Destination = .\rclone.exe md5sum destination: | ConvertFrom-String -PropertyNames Hash,Path -Delimiter '(?<=^[\w]+)\s+' | Sort-Object -Property Hash
Compare-Object $Source $Destination -Property Hash -PassThru
(I have only tested on tiny sample sets, so no idea if it blows up if you try it on huge directories)
Here is a python script which does the same and should run on windows
#!/usr/bin/env python3
"""
Discover differences by checksum between two rclone paths
Run it like
python3 hashdiff.py MD5 remote1:path remote2:path
"""
import sys
import subprocess
def sums(remote, hash_type):
cmd = ["rclone", "hashsum", hash_type, remote]
print(f"Running: {' '.join(cmd)}")
out = subprocess.check_output(cmd)
return dict( line.split(b" ", 1) for line in out.split(b"\n") if line )
def missing(remote, hashes, sums):
print(f"Hashes changed or missing on {remote}")
for hash in hashes:
print(sums[hash])
print
def main():
if len(sys.argv) != 4:
print(f"Syntax: {sys.argv[0]} HASH remote1:path remote2:path")
raise SystemExit(1)
hash_type, source, dest = sys.argv[1:]
source_sums = sums(source, hash_type)
dest_sums = sums(dest, hash_type)
source_hashes = set(source_sums.keys())
dest_hashes = set(dest_sums.keys())
missing(source, source_hashes - dest_hashes, source_sums)
missing(dest, dest_hashes - source_hashes, dest_sums)
if __name__ == "__main__":
main()
Thanks @albertony. Is there a way for me to pass the configuration password? Also, I'm assuming that it's a recursive lookup? When you say large directories, do you mean deeply nested paths or large files or both?
Thanks @ncw. The device unfortunately does not have Python and may be a challenge to install (subject to policy). Assuming it can be deployed, is it recursive if I start at the root of the source & destination? Also, how can I pass the configuration password?
Ran a quick test of the python script, comparing result with the PowerShell snippet: If one of the remotes contain copies of the same file (same hash) that the other remote don't have, then the python script will not show it as diff but the powershell version will. Depends what you want. Both can be adjusted to do the opposite.
This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.