What does it look like when S3 is missing a hash?

I am working on (yet another) rclone-based bi-directional sync tool (I am the author of PyFiSync which supports one side as rclone but was designed around efficient rsync transfers so a lot of work goes into tracking moves and modification which are unneeded with rclone).

Anyway, I have an extensive amount of tests including different situations and remotes having different capabilities. However, a situation I read about in the S3 docs is lacking a hash. While I think I account for it, I need to test it and think through that more.

I do not have access to S3 (though I do have B2 so I may be able to test with their new S3 interface) but can anyone tell me what lsjson --hash looks like when S3 doesn't have an MD5?

Any other oddities I should be aware of for certain remotes. I already ran into the fact that while WebDAV claims to not support ModTime it will report them and they should just not be trusted (via using rclone serve webdav)

If you have used a recent rclone and you haven't used --s3-disable-checksum then you willl have a hash on both small and large files.

However multipart files not uploaded with rclone won't have a hash.

If a hash isn't present then rclone returns it as empty string.

Backends do their best to return useful data so the times they return will be the modification time of the file at the server.

How much you trust the times for syncing purposes should be discovered using the Precision call, eg

$ rclone backend features webdav:
{
	"Name": "webdav",
	"Root": "",
	"String": "webdav root ''",
	"Precision": 3153600000000000000,
	"Hashes": [],
	"Features": {
		"About": true,
		"BucketBased": false,
		"BucketBasedRootOK": false,
		"CanHaveEmptyDirectories": true,
		"CaseInsensitive": false,
		"ChangeNotify": false,
		"CleanUp": false,
		"Command": false,
		"Copy": true,
		"DirCacheFlush": false,
		"DirMove": true,
		"Disconnect": false,
		"DuplicateFiles": false,
		"GetTier": false,
		"IsLocal": false,
		"ListR": false,
		"MergeDirs": false,
		"Move": true,
		"OpenWriterAt": false,
		"PublicLink": false,
		"Purge": true,
		"PutStream": false,
		"PutUnchecked": false,
		"ReadMimeType": false,
		"ServerSideAcrossConfigs": false,
		"SetTier": false,
		"SetWrapper": false,
		"UnWrap": false,
		"UserInfo": false,
		"WrapFs": false,
		"WriteMimeType": false
	}
}

vs

$ rclone backend features owncloud:
{
	"Name": "owncloud",
	"Root": "",
	"String": "webdav root ''",
	"Precision": 1000000000,
	"Hashes": [
		"MD5",
		"SHA-1"
	],
	"Features": {
		"About": true,
		"BucketBased": false,
		"BucketBasedRootOK": false,
		"CanHaveEmptyDirectories": true,
		"CaseInsensitive": false,
		"ChangeNotify": false,
		"CleanUp": false,
		"Command": false,
		"Copy": true,
		"DirCacheFlush": false,
		"DirMove": true,
		"Disconnect": false,
		"DuplicateFiles": false,
		"GetTier": false,
		"IsLocal": false,
		"ListR": false,
		"MergeDirs": false,
		"Move": true,
		"OpenWriterAt": false,
		"PublicLink": false,
		"Purge": true,
		"PutStream": true,
		"PutUnchecked": false,
		"ReadMimeType": false,
		"ServerSideAcrossConfigs": false,
		"SetTier": false,
		"SetWrapper": false,
		"UnWrap": false,
		"UserInfo": false,
		"WrapFs": false,
		"WriteMimeType": false
	}
}

The precisions are in nS and that really large one ~100 years means don't trust the timestamp at all!

Thanks. I will play with those as well as S3 via B2.

I am excited for this new sync tool since sync is fundamentally opinionated in how you do it and while I do not doubt the quality of the other tools out there, I want certain things done my way (including move tracking, automatic backups, etc). My old tool is great when using rsync since it tries so hard to track moves with modifications. But rclone came a lot later and I wasn't seeing a path forward with rclone to rclone.

It's just that, (and this is like me being in the choir preaching to the choirmaster), there is so much variability and if it isn't something I use often, it's hard to test.

Rclone does its best to provide a consistent view of cloud providers but alas differences show through. The info above tells you exactly what each provider can and can't do and you'll need it!

I've been thinking recently about adding bidirectional sync to rclone. Does your tool do that?

Yes it does.

The thing with bi directional sync is that it can’t be stateless like one-way. Otherwise you have no way to know what’s new vs deleted. You can do a “union” sync where all files match but then you can do deletes. This is actually doable now via two rclone calls.

One nice thing about my tool is that since I do store the previous state, I can also use that to track moves without compatible hashes between remotes. And of course, delete vs new.

It’s very close to finished. Just needs polish, a name, and some real world testing.

But alas, it is python so I doubt you’re interested directly. However, I am happy to talk about the intricacies of bi-sync since I have lots of experience between this new tool and my old one. (The old one, like I said, was more complex since it tried to track moves and modifications adding tons of edge cases to account for and test. This new one is much simpler).

I wrote up the sync algorithm in a doc as I thought through it before coding. I can share that too!

I note in passing that you can use size+modtime now with the new --track-renames-strategy flag on remotes that don't have a hash...

Yes I would be interested to learn more about bi-sync.

Maybe I can tempt you into learning some Go :wink:

Yes please!

I have a bit of background to prepare with rclone - I want to be able to persist rclone objects to disk so I can persist a recursive listing of the source and the destination. Once that is done I think all the tools for doing a bi-sync will be in rclone.

How do you resolve conflicts BTW? Persumably they need some kind of user action?

It's rough but the algorithm is here. Lots of blanks and THISCODE's (I haven't come up with a name I like).

If you don't care about move tracking, then it's really not too hard. Again, persistence is the problem. In my new tool, I have the user specify a (hopefully) unique name and save it there. Reading up on how, for example Unison works, they also store it in the past. It does add a bit of complexity as you want a simple tool.

Again, ignoring moves, you need history to know if:

  • A file is new
  • A file has been delete
  • A file is deleted on one side and modified on the other

Comparing ModTime, size, or hash to decide if same-named-files are alike is the easy part.

Maybe I can tempt you into learning some Go :wink:

You've certainly been trying :wink:! I love the idea but time is limited. I've had to claw time to even work on this new sync tool and I am pushing my "happy wife happy life" limits. But we'll see!

What makes it harder is that hard-core coding is a hobby. I use Python extensively for my job so I use often and can "leverage" some time.

How do you resolve conflicts BTW? Presumably they need some kind of user action?

No. This is one of the "opinionated" parts of a sync algorithm. I have the user specify how to handle conflicts. Options are choose older (or smaller when not using ModTIme), newer (or larger when not using ModTIme), always keep one side or the other, tag (add the date and name to the file) both or tag the older.

The reason this is safe is that in my code, unless it is explicitly disabled the equivalent of rclone's --backup-dir is always used.

BTW, again harking back to Unison, it has a batch mode and an interactive mode.

One more thing to add: It is very easily to accidentally get into a O(n^2) situation. This was worse in my older code since I had to look at every from the previous state while the newer one only has to look at those that changed (similar to how rclone does track-renames). Still, I developed a Python O(1) (so the final order is O(n)) data object that makes these lookups fast. I suspect this is nothing new to you but worth mentioning for anyone else following along

Thanks for writing that up makes sense! I have a notes file with some of the same things in :slight_smile:

Yes that sounds like a great idea. No data loss possible but the right thing happens automatically with a manual work-around.

I shall have to study your tool some more (and unison!).

If I ever get round to the bisync for rclone I'd certainly like to start with a well proven algorithm rather than invent it myself. Those corner cases are out to get you!

Yes that will bite you when the user has 100,000 files...

Thank you for taking the time to explain how your code works. I shall continue putting the foundations into rclone :slight_smile:

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.