Local and Hasher don't work nicely each other

This is similar to #5934, but from a different perspective.

Currently, the local backend declares all checksums are available and hash by demand.

As it declares all checksums are available, hasher always do pass-through (i.e. call local's checksum) and doesn't cache at all

I can came up with some fixes/workaround like:

  • Option to disable/choose checksums on local backend
  • Hasher should cache all checksums returned by wrapping backend, and always return it unless invalidated
  • (workaround) Make an union with the local path and a read-only crypt with remote=:memory:
    • This is (ab)using the fact that the crypt backend does declare no checksums are available. then union will report as well, so hasher will calculate by itself and cache
    • hasher must be set auto_size = 1000G (or anything larger) to work also

Can it be better without doing nasty workaround like that?

I thought this was how hasher was supposed to work.

Caching the local hashes is one it's use cases I think.

No it currently doesn't
I carried out a test here:

Version:

$ rclone version
rclone v1.59.2
- os/version: ubuntu 22.04 (64 bit)
- os/kernel: 5.15.0-48-generic (x86_64)
- os/type: linux
- os/arch: amd64
- go/version: go1.18.6
- go/linking: static
- go/tags: none

config:

[sqfs-union]
type = union
upstreams = queue/:ro done/:ro sqfs-dummy::ro

[sqfs-hasher]
type = hasher
remote = sqfs-union:
hashes = md5,sha1,sha256
max_age = off
auto_size = 1000G

[sqfs-dummy]
type = crypt
remote = :memory:
filename_encryption = off
directory_name_encryption = false
password = xtQ3DAvNBi87vip07G468lDIbFdVch3XIyjI

[sqfs-union-nodummy]
type = union
upstreams = queue/:ro done/:ro

[sqfs-hasher-nodummy]
type = hasher
remote = sqfs-union-nodummy:
hashes = md5,sha1,sha256
max_age = off
auto_size = 1000G

Here, sqfs-hasher is the working one, while sqfs-hasher-nodummy is not (which is also a typical configuration)

Create large enough sparse files:

$ mkdir queue/
$ fallocate -l50G queue/50Gtest
$ fallocate -l100G queue/100Gtest

Hash these files twice via working hasher:

$ time rclone sha1sum sqfs-hasher: 
2bc1c231f26aab1f688c429f92135c23dfe7c65f  50Gtest
941efac846d49c595f11bc3444fda75e5a263247  100Gtest

real    11m40.002s
user    16m54.984s
sys     2m16.848s
$ time rclone sha1sum sqfs-hasher: 
941efac846d49c595f11bc3444fda75e5a263247  100Gtest
2bc1c231f26aab1f688c429f92135c23dfe7c65f  50Gtest

real    0m0.111s
user    0m0.105s
sys     0m0.015s

Again, with the failing one:

$ time rclone sha1sum sqfs-hasher-nodummy: 
2bc1c231f26aab1f688c429f92135c23dfe7c65f  50Gtest
941efac846d49c595f11bc3444fda75e5a263247  100Gtest

real    3m46.047s
user    4m28.722s
sys     2m10.564s
$ time rclone sha1sum sqfs-hasher-nodummy: 
2bc1c231f26aab1f688c429f92135c23dfe7c65f  50Gtest
941efac846d49c595f11bc3444fda75e5a263247  100Gtest

real    3m42.466s
user    4m17.443s
sys     2m7.644s

Despite working one is totally taking longer (because it's calculating 3 hashes at a time), failing one's second attempt is taking the same time long as the first attempt.

See what's inside the DB:

$ rclone backend dump sqfs-hasher:
ok  md5:09cd755eb35bc534487a5796d781a856 sha1:941efac846d49c595f11bc3444fda75e5a263247 sha256:f0b14a8da7f1c48a0846647a078b97956edd8df451a62fc4b466879aa24d4fd7     10m9s 100Gtest
ok  md5:e7f4706922e1edfdb43cd89eb1af606d sha1:2bc1c231f26aab1f688c429f92135c23dfe7c65f sha256:ab743e145f643a1f6237b7390baf2e6edc71d83997f5bf4ed40d975fb50ba423    15m17s 50Gtest
$ rclone backend dump sqfs-hasher-nodummy:

So, hasher isn't saving hashes even with the latest version.

edit: I should have added them too

$ time rclone -vv sha1sum sqfs-hasher: 
2022/09/25 01:18:56 DEBUG : rclone: Version "v1.59.2" starting with parameters ["rclone" "--transfers=5" "--retries=999999" "-vv" "sha1sum" "sqfs-hasher:"]
2022/09/25 01:18:56 DEBUG : Creating backend with remote "sqfs-hasher:"
2022/09/25 01:18:56 DEBUG : Using config file from "/home/lesmi/.config/rclone/rclone.conf"
2022/09/25 01:18:56 INFO  : Hasher is EXPERIMENTAL!
2022/09/25 01:18:56 DEBUG : Creating backend with remote "sqfs-union:"
2022/09/25 01:18:56 DEBUG : Creating backend with remote "sqfs-dummy:"
2022/09/25 01:18:56 DEBUG : Creating backend with remote "queue/"
2022/09/25 01:18:56 DEBUG : Creating backend with remote "done/"
2022/09/25 01:18:56 DEBUG : fs cache: renaming cache item "done/" to be canonical "***/sqfs/done"
2022/09/25 01:18:56 DEBUG : fs cache: switching user supplied name "done/" for canonical name "***/sqfs/done"
2022/09/25 01:18:56 DEBUG : fs cache: renaming cache item "queue/" to be canonical "***/sqfs/queue"
2022/09/25 01:18:56 DEBUG : fs cache: switching user supplied name "queue/" for canonical name "***/sqfs/queue"
2022/09/25 01:18:57 DEBUG : Creating backend with remote ":memory:"
2022/09/25 01:18:57 DEBUG : union root '': actionPolicy = *policy.EpAll, createPolicy = *policy.EpMfs, searchPolicy = *policy.FF
2022/09/25 01:18:57 DEBUG : hasher::sqfs-hasher:: Groups by usage: cached [md5, sha1, sha256], passed [], auto [md5, sha1, sha256], slow [], supported [md5, sha1, sha256]
2022/09/25 01:18:57 DEBUG : sqfs-union~hasher.bolt: Opened for reading in 572.303µs
2022/09/25 01:18:57 DEBUG : 100Gtest: getHash: fingerprint changed
2022/09/25 01:18:57 DEBUG : 50Gtest: getHash: fingerprint changed
2022/09/25 01:18:57 DEBUG : sqfs-union~hasher.bolt: released
2022/09/25 01:25:06 DEBUG : sqfs-union~hasher.bolt: Opened for writing in 80.269µs
2bc1c231f26aab1f688c429f92135c23dfe7c65f  50Gtest
2022/09/25 01:25:06 DEBUG : sqfs-union~hasher.bolt: released
2022/09/25 01:29:52 DEBUG : sqfs-union~hasher.bolt: Opened for writing in 71.935µs
941efac846d49c595f11bc3444fda75e5a263247  100Gtest
2022/09/25 01:29:52 DEBUG : 3 go routines active
2022/09/25 01:29:52 DEBUG : sqfs-union~hasher.bolt: released

real    10m55.867s
user    15m55.350s
sys     2m8.510s
$ time rclone -vv sha1sum sqfs-hasher-nodummy: 
2022/09/25 01:19:03 DEBUG : rclone: Version "v1.59.2" starting with parameters ["rclone" "--transfers=5" "--retries=999999" "-vv" "sha1sum" "sqfs-hasher-nodummy:"]
2022/09/25 01:19:03 DEBUG : Creating backend with remote "sqfs-hasher-nodummy:"
2022/09/25 01:19:03 DEBUG : Using config file from "/home/lesmi/.config/rclone/rclone.conf"
2022/09/25 01:19:03 INFO  : Hasher is EXPERIMENTAL!
2022/09/25 01:19:03 DEBUG : Creating backend with remote "sqfs-union-nodummy:"
2022/09/25 01:19:03 DEBUG : Creating backend with remote "done/"
2022/09/25 01:19:03 DEBUG : Creating backend with remote "queue/"
2022/09/25 01:19:03 DEBUG : fs cache: renaming cache item "done/" to be canonical "***/sqfs/done"
2022/09/25 01:19:03 DEBUG : fs cache: renaming cache item "queue/" to be canonical "***/sqfs/queue"
2022/09/25 01:19:03 DEBUG : fs cache: switching user supplied name "done/" for canonical name "***/sqfs/done"
2022/09/25 01:19:03 DEBUG : fs cache: switching user supplied name "queue/" for canonical name "***/sqfs/queue"
2022/09/25 01:19:03 DEBUG : union root '': actionPolicy = *policy.EpAll, createPolicy = *policy.EpMfs, searchPolicy = *policy.FF
2022/09/25 01:19:03 DEBUG : hasher::sqfs-hasher-nodummy:: Groups by usage: cached [md5, sha1, sha256], passed [md5, sha1, whirlpool, crc32, sha256, dropbox, hidrive, mailru, quickxor], auto [md5, sha1, sha256], slow [], supported [md5, sha1, whirlpool, crc32, sha256, dropbox, hidrive, mailru, quickxor]
2022/09/25 01:19:03 DEBUG : 50Gtest: pass sha1
2022/09/25 01:19:03 DEBUG : 100Gtest: pass sha1
2bc1c231f26aab1f688c429f92135c23dfe7c65f  50Gtest
941efac846d49c595f11bc3444fda75e5a263247  100Gtest
2022/09/25 01:22:49 DEBUG : 3 go routines active

real    3m45.746s
user    4m26.331s
sys     2m11.271s

Do you want to have a go at fixing this @Lesmiscore ?

I think this is probably the best plan.

How does hasher define invalidated?

I think it'll just blow up passHashes
Is that okay?
Turns out SlowHash isn't propagated to Hasher because of Union

If we did that, would it fix your problem?

At the moment the feature mask is the AND of all the features. Maybe the SlowHash in the union should be the OR though as in if any member of the union has slow hashes, the union has slow hashes?

The purpose is to avoid Hasher to always do pass-through for Union backend, so yes. Looks like SlowHash=true will disable pass-through.

I guess it should be. While non-SlowHash backends return checksums faster, it's very frustrating to do hashing local files without cache and being able to continue.