Can not use --track-renames when syncing to and from Crypt remotes?

thestigma · July 30, 2019, 5:25pm

I have an issue where if I try to sync with --track-renames between two crypt remotes then I get an error that "--track-renames ignored because the two remotes do not have a common hash".

The sync works, but not with the feature I want, which I think is pretty important for use.
Also I can run with --track-renames if I try between two non-crypt remotes

Is this some sort of intended limitation or is there a way to work around it?

ncw · July 30, 2019, 7:58pm

--track-renames uses the hashes of the files to check they are really the same file. Without the hash, it would just be comparing the sizes of the files to see if they are the same which I thought was too innacurate.

The crypt backend doesn't support hashes as the hash it would read from the backend would be a hash of the encrypted data. Even if the data was the same for two files, the hash of the encrypted data would differ because the files will be encrypted with different nonces.

So dedupe works with the equivalent of the --checksum flag. It would be possible to have it check modification times I guess, not sure how useful that would be though. It could also revert back to --size-only which might also be acceptable depending on your data, not sure!

thestigma · July 30, 2019, 8:03pm

Ah yes, that makes sense. It can't compare hashes even though they are in the same format because the hashes of 2 different crypts won't be the same on identical files. I see how that is an inevitable limitation.

But yes, it would be nice to have the option (probably not default behavior though) to only perform the check based on name/size/modtime. That should avoid the problem at the cost of not garantueeing 100% accuracy like with checksums. It may not be ideal for everyone (and thus probably should be option-non-default to avoid accidental damage) but for my use I think it would be perfectly adequate.

Is that possible to do right now - or are you saying you could enable as an option relatively easily? If so I would very much like that.

ncw · July 30, 2019, 8:12pm

Is it likely that your duplicate files have the same modtime? I would say that it probably isn't, so I think we are left with these as possibilties for matching identical files

size
hash
leaf name

At the moment we use size and hash. size and leaf name would work quite well too. size alone would be a little risky...

It wouldn't be hard to make different strategies - the actual hash is generated in the renameHash function - it is just a string, so it could have the leaf name in or just the size in easily enough.

At the moment you can see it is size+hash

You could experiment with that and see what works for you?

thestigma · July 30, 2019, 10:09pm

This is a lot to wrap my head around, so excuse me if I miss something that to you may be obvious but...

I don't understand why you think modtime isn't a good candidate (in combination with others obviously). I would think it would be very rare for a file to have its modtime updated and yet not have any of it's contents changed. Yes, it could technically happen I suppose, like making a change to it and then reversing it after. Then you could end up with a binary-identical file with a different modtime. But if this happens and it causes the comparison to fail when it shouldn't in rare cases, is that such a big problem? As far as I understand the only thing that would happen is that sync ends up re-transfering that one file rather than more efficiently moving the existing one. As long as we don't risk losing data that occasional inefficiency seems like it would be perfectly acceptable. And the more factors we can include in the comparison the less risk we have of an actual serious error of a false-positive where a file that should have been synced does not because it is confused for another. To me that seems like it would be a good tradeoff.

Hash we can't use in this context because of the reason you stated for crypt remotes, so if we could use all of name/size/modtime that would probably be "accurate enough" I think - for users that consent to the risk by enabling such an alternate comparison method.

I don't know what a "leaf name" is or if/how that would be different from a filename... can't easily find something on google to explain it. Feel free to illuminate me if you want.

I assume by experiment you mean changing the code myself. Have to admit that is intimidating, but that is such a short snippet that I might actually comprehend it even though I've never touched go before =P But that would involve having to set up all the devtools to edit and recompile go code I assume. I might conceivably be able to do that, or at least try, especially if there exists some sort of list or guide for what I will need to have for the job.

ncw · July 31, 2019, 8:20am

I see what you mean, the feature is called --track-renames and renaming a file doesn't change the modification time.

If you think about it in terms of entropy... size is about 20 bits for a 1MB file and 30 bits for a 1GB file. Modtime has quite a lot of entropy, most OSes store mod times accurate to the nS or 100 nS. Drive stores mod times accurate to 1mS, Quite a lot of cloud storage providers use 1S accuracy. Assuming all your files are within 1 year of each other that gives another 24 bits (at 1S precision) or 34 bits (at 1mS precision). So a total of 40-64 bits entropy. So size+modime might be equivalent to a hash of 40-64 bits.

That might be good enough...

The leaf name is the last part of the path, so if you had /home/ncw/file.txt the leaf is file.txt.

If we don't include the leaf name then we can detect renames of the file from file.txt to file2.txt; if we do include the leaf name then we can only detect the file being moved into a new directory /home/ncw2/file.txt

There are some super brief instructions here: Install if you want to have a go! You need go, git and a text editor - go is quite self contained containing the compiler, linker, standard library etc.

thestigma · July 31, 2019, 2:14pm

Ah, I didn't quite think about that this feature also considers actual renaming (which the name does imply obviously). I thought it's most useful feature was handling cases where files where simply moved - as reorganizing is pretty common for me.

Obviously it would be optimal if it could handle both - which would leave us with size/modtime. But perhaps having an option for both variants with a simple flag to switch it would be a good idea since it probably wouldn't much more work to do than just doing one. Having the option to get (I assume quite significantly) higher accuracy at the cost of not being handle renaming by using all 3 factors of leafname/size/modtime would be nice too. I could very much see myself using both variants for different needs.

I think that would at least be worth a shot. I will put it on my to-do list to look into this. No guarantees as I might get hopelessly stuck on the code, but if the documentation is good enough and since I use the existing function as a template of sorts I might have a chance even though my programming skills are rusty as all hell =P

Thanks so much for answering this in such detail NCW. You are a gem
When I get around to trying to make this, would it be appropriate to open an issue for it as a place to coordinate it? I will no doubt need to ask for some clarifications on technical details along the way. Don't know if there is a more appropriate way of doing it.

ncw · July 31, 2019, 9:30pm

Perhaps a --track-renames-strategy flag which could be a comma separated list of (size would always be included).

"hash"
"modtime"
"leaf"

The default would be hash, and if --track-renames failed because of no hash it could output a message try --track-renames-strategy modtime or modtime,leaf - see docs.

It would be a good idea to make an issue about it now while it is fresh in our minds! There are some complexities handling the modtime (we need to make the precision correct for both the source and destination so we need to find the higest precision and divide the modtime expressed in nS by that). I'd be delighted to talk you through it if you wanted to have a go!

thestigma · July 31, 2019, 10:36pm

Ok, I'll make an issue on it shortly then and summarize the idea from the discussion we had here.

Getting around to attempting this myself might take a little time though as I have a few other things I need to get off my to-do list before I dare take on a new pet project, but I keep close tabs on any issues I open so I won't forget. Promise

MistarMuffin · September 4, 2019, 4:06am

Please, please on the --track-renames-strategy. I found this thread on Google after determining I needed something like this. I have a huge media library and I often do filename cleanups. My cloud provider (GDrive) is strictly for backup. I've done renames on upwards of 1TB worth of media in one sitting. --track-renames has been a welcome feature in these situations. That being said, having to read ~1TB of media from my NAS to calculate checksums is very time consuming. I also wonder about unnecessary wear and tear on the disks. I was brainstorming a solution and came to the conclusion that mod-time and size would be more than enough for me to accurately track renames. I already run a separate rclone command to back up important files where I would want checksum matching.

thestigma, did you get an issue open for this request? I did not see one.

Thanks!

thestigma · September 4, 2019, 4:20am

I have not yet, no. Haven't forgotten about it, but also haven't yet had the time to properly think this though + do some testing in modifying the existing code and see how it fares against a large collection of real files.

If you want to open up an issue, feel free, and I can add my thought and research into that when I have the time to look more into it.

MistarMuffin · November 2, 2019, 6:43pm

thestigma,

I have opened an issue as you requested:

Thank you for your consideration and work on this.

thestigma · November 2, 2019, 7:09pm

Thanks!

But I think this problem may soon be solved in a better and more robust alternative way.

I have been talking to Nick about a way to bake hashing (and other) metadata into encrypted files.
If we can implement this, then rclone will know both the encrypted and unencrypted hashes of files - and it can then easily compare unencrypted files to encrypted ones. This will make the normal function of --track-renames work out of the box

It will even make it possible to --cheksum compare and --track renames between crypted remotes using different crypt-keys

I don't know much about a timetable for this, but Nick insinuated it may be on the agenda for one of the next few versions. Can't make any promises though because I'm not the one who will be implementing it (too advanced for me to mess with lol ).

system · January 31, 2020, 7:09pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.