Hasher with GDrive backend does not return SHA1 & SHA256 for old files

What is the problem you are having with rclone?

Hasher with GDrive backend. Hasher setup to have MD5, SHA1, SHA256 saved in backend.

In GDrive, there are some old files where GDrive doesn't provide SHA1 / SHA256.

"Note, however, that a small fraction of files uploaded may not have SHA1 or SHA256 hashes especially if they were uploaded before 2018."

Eventhough the SHA1 & SHA256 hashsum are saved in the backend, it doesn't show when running commands.

Expected behaviour, when GDrive does not return SHA1 and SHA256 for old files, Hasher will return the hashsum from the backend.

Run the command 'rclone version' and share the full output of the command.

rclone v1.65.2

  • os/version: ubuntu 23.10 (64 bit)
  • os/kernel: 6.5.0-1015-oracle (aarch64)
  • os/type: linux
  • os/arch: arm64 (ARMv8 compatible)
  • go/version: go1.21.6
  • go/linking: static
  • go/tags: none

Which cloud storage system are you using? (eg Google Drive)

Hasher with GDrive as backend

The command you were trying to run (eg rclone copy /tmp remote:tmp)

$ rclone lsjson --hash 'gdhash:/Google Photos/2013/_2.jpeg'
[
{"Path":"_2.jpeg","Name":"_2.jpeg","Size":165206,"MimeType":"image/jpeg","ModTime":"2014-07-07T00:06:00.000Z","IsDir":false,"Hashes":{"md5":"03c6c126c60cc85620e77757ea538550","whirlpool":"1990038b3b8c61ff8acd4c7fa1073562dc7fd03760e0040b5c7f94a2f6b2da1730e542b1a2e3d7bcb6288efa0c898853f40083d8c90163334b5d62e6580d8290"},"ID":"1_wNymqIvOgZL2UT-dqdsLvubqmMwsvtROg"}
]

The rclone config contents with secrets removed.

[gdrive]
type = drive
client_id = <<client_id_removed>>
client_secret = <<client_secret_removed>>
scope = drive
root_folder_id = <<root_folder_id_removed>>
token = <<token_removed>>
team_drive =

[gdhash]
type = hasher
remote = gdrive:
hashes = md5,sha1,sha256,whirlpool
max_age = off


A log from the command with the -vv flag

$ rclone lsjson --hash 'gdhash:/Google Photos/2013/_2.jpeg' -vv
2024/02/20 04:37:41 DEBUG : rclone: Version "v1.65.2" starting with parameters ["rclone" "lsjson" "--hash" "gdhash:/Google Photos/2013/_2.jpeg" "-vv"]
2024/02/20 04:37:41 DEBUG : Creating backend with remote "gdhash:/Google Photos/2013/_2.jpeg"
2024/02/20 04:37:41 DEBUG : Using config file from "/home/ubuntu/.config/rclone/rclone.conf"
2024/02/20 04:37:41 INFO  : Hasher is EXPERIMENTAL!
2024/02/20 04:37:41 DEBUG : Creating backend with remote "gdrive:/Google Photos/2013/_2.jpeg"
2024/02/20 04:37:44 DEBUG : fs cache: adding new entry for parent of "gdrive:/Google Photos/2013/_2.jpeg", "gdrive:Google Photos/2013"
2024/02/20 04:37:44 DEBUG : hasher::gdhash:/Google Photos/2013: Groups by usage: cached [md5, sha1, whirlpool, sha256], passed [md5, sha1, sha256], auto [md5, sha1, whirlpool, sha256], slow [], supported [md5, sha1, whirlpool, sha256]
2024/02/20 04:37:44 DEBUG : gdrive~hasher.bolt: Opened for reading in 85.8µs
2024/02/20 04:37:44 DEBUG : fs cache: adding new entry for parent of "gdhash:/Google Photos/2013/_2.jpeg", "gdhash:/Google Photos/2013"
[
2024/02/20 04:37:44 DEBUG : gdrive~hasher.bolt: released
2024/02/20 04:37:45 DEBUG : _2.jpeg: pass md5
2024/02/20 04:37:45 DEBUG : _2.jpeg: pass sha1
2024/02/20 04:37:45 DEBUG : gdrive~hasher.bolt: Opened for reading in 81.32µs
2024/02/20 04:37:45 DEBUG : _2.jpeg: cached whirlpool = "1990038b3b8c61ff8acd4c7fa1073562dc7fd03760e0040b5c7f94a2f6b2da1730e542b1a2e3d7bcb6288efa0c898853f40083d8c90163334b5d62e6580d8290"
2024/02/20 04:37:45 DEBUG : _2.jpeg: pass sha256
{"Path":"_2.jpeg","Name":"_2.jpeg","Size":165206,"MimeType":"image/jpeg","ModTime":"2014-07-07T00:06:00.000Z","IsDir":false,"Hashes":{"md5":"03c6c126c60cc85620e77757ea538550","whirlpool":"1990038b3b8c61ff8acd4c7fa1073562dc7fd03760e0040b5c7f94a2f6b2da1730e542b1a2e3d7bcb6288efa0c898853f40083d8c90163334b5d62e6580d8290"},"ID":"1_wNymqIvOgZL2UT-dqdsLvubqmMwsvtROg"}
]
2024/02/20 04:37:45 DEBUG : 7 go routines active
2024/02/20 04:37:45 DEBUG : gdrive~hasher.bolt: released
2024/02/20 04:37:45 Failed to lsjson: database stopped

When running the command to query the back end
rclone returns with SHA1 & SHA256

$ rclone backend dump 'gdhash:' -vv | grep _2.jpeg
2024/02/20 04:42:42 DEBUG : rclone: Version "v1.65.2" starting with parameters ["rclone" "backend" "dump" "gdhash:" "-vv"]
2024/02/20 04:42:42 DEBUG : Using config file from "/home/ubuntu/.config/rclone/rclone.conf"
2024/02/20 04:42:42 INFO  : Hasher is EXPERIMENTAL!
2024/02/20 04:42:42 DEBUG : Creating backend with remote "gdrive:"
2024/02/20 04:42:42 DEBUG : hasher::gdhash:: Groups by usage: cached [md5, sha1, whirlpool, sha256], passed [md5, sha1, sha256], auto [md5, sha1, whirlpool, sha256], slow [], supported [md5, sha1, whirlpool, sha256]
2024/02/20 04:42:42 DEBUG : gdrive~hasher.bolt: Opened for reading in 61.52µs
ok  md5:03c6c126c60cc85620e77757ea538550 sha1:8391e852a8da44f5b8df99149314dd95a3cb36d4 whirlpool:1990038b3b8c61ff8acd4c7fa1073562dc7fd03760e0040b5c7f94a2f6b2da1730e542b1a2e3d7bcb6288efa0c898853f40083d8c90163334b5d62e6580d8290 sha256:963df079beaefcddbb94aa43d65b1eea51d21e2c045d3c1b3921cf6097351018       90h Google Photos/2013/_2.jpeg
2024/02/20 04:42:43 INFO  : /home/ubuntu/.cache/rclone/kv/gdrive~hasher.bolt: 19068 records
2024/02/20 04:42:43 DEBUG : 5 go routines active

I think what is happening here is that because Hasher thinks Drive natively supports SHA1 and SHA256, it expects that Drive will always return one when asked. Hasher therefore skips doing its own calculation, because getting one that Drive is storing natively is much more efficient, when possible.

What Hasher does not know is that while Drive does advertise support for these hash types, some files have them and some do not -- because of the pre-2018 thing that you cited. Hasher probably should be changed to check if the returned hash is blank (and this would be a fairly easy change to make) -- but currently it does not.

So as a workaround, what I'd suggest instead is to do a one-time rclone hashsum sha256 --download to force hasher to fill in all the blanks. You should probably only need to do this once, since you are using max_age = off and all future files should have them natively.

i would try to narrow down the total number of files to be processes
i do not use gdrive much, so there might be a better way.

--drive-use-created-date --min-age=2018-01-01
or
maybe using metadata such as btime?

1 Like

Yes, or maybe feed it a --files-from list of only the files that have blank hashes.

@nielash , I have already pre-hashed the file and both SHA1 and SHA256 are stored at the backend. See the last command I posted above.

$ rclone backend dump 'gdhash:' | grep _2.jpeg
ok  md5:03c6c126c60cc85620e77757ea538550 sha1:8391e852a8da44f5b8df99149314dd95a3cb36d4 whirlpool:1990038b3b8c61ff8acd4c7fa1073562dc7fd03760e0040b5c7f94a2f6b2da1730e542b1a2e3d7bcb6288efa0c898853f40083d8c90163334b5d62e6580d8290 sha256:963df079beaefcddbb94aa43d65b1eea51d21e2c045d3c1b3921cf6097351018      108h Google Photos/2013/_2.jpeg

I think we need Hasher to change the algorithm to check if the returned hash is blank then return the hash stored in the backend.

Ahh -- you are right. Sorry I missed that before.

I think this small change would fix it:

Any chance you are able to test it out?

Hi @nielash ,

How can I test it out?
How to download and install your fix to test on my machine?

You could clone my branch and then install from source. This requires having Go installed. If you don't want to do that, I could make a binary and send a download link.

Hi @nielash ,

Yes. this fixes the issue. It's now returning hash from backend when hash is missing from the server.

~/rclone.fix/rclone$ ./rclone --version
rclone v1.66.0-DEV
- os/version: ubuntu 23.10 (64 bit)
- os/kernel: 6.5.0-1015-oracle (aarch64)
- os/type: linux
- os/arch: arm64 (ARMv8 compatible)
- go/version: go1.21.1
- go/linking: dynamic
- go/tags: none
~/rclone.fix/rclone$ ./rclone lsjson --hash 'gdhash:/Google Photos/2013/_2.jpeg'
[
{"Path":"_2.jpeg","Name":"_2.jpeg","Size":165206,"MimeType":"image/jpeg","ModTime":"2014-07-07T00:06:00.000Z","IsDir":false,"Hashes":{"md5":"03c6c126c60cc85620e77757ea538550","sha1":"8391e852a8da44f5b8df99149314dd95a3cb36d4","sha256":"963df079beaefcddbb94aa43d65b1eea51d21e2c045d3c1b3921cf6097351018","whirlpool":"1990038b3b8c61ff8acd4c7fa1073562dc7fd03760e0040b5c7f94a2f6b2da1730e542b1a2e3d7bcb6288efa0c898853f40083d8c90163334b5d62e6580d8290"},"ID":"1_wNymqIvOgZL2UT-dqdsLvubqmMwsvtROg"}
]
2024/02/22 12:22:40 Failed to lsjson: database stopped

Separate question, do you have any idea about the last line?

2024/02/22 12:22:40 Failed to lsjson: database stopped

No matter what commmand I run, hasher will return

Failed to <<command>>: database stopped

Glad to hear that :slightly_smiling_face:

This looks like a bug. db.Stop is getting called again after the db is already stopped. I think this should fix it:

Thanks..
Yes confirming that this fixes the issue.
No more failed messages.

1 Like

@nielash , also observing another issue with hasher. Instead of creating a new topic, I'll just add it here.

I have a very large file of 20Gb. Hasher is not storing the hash after calculating whirlpool hash. This happens when the file is large. I have a few files larger than 6Gb, and hasher failed to save the hash for all of them.

Before calculating hash

$ rclone lsjson --hash 'gdhash:/Documents/MP4.mp4' -vv
2024/02/23 09:13:55 DEBUG : rclone: Version "v1.65.2" starting with parameters ["rclone" "lsjson" "--hash" "gdhash:/Documents/MP4.mp4" "-vv"]
2024/02/23 09:13:55 DEBUG : Creating backend with remote "gdhash:/Documents/MP4.mp4"
2024/02/23 09:13:55 DEBUG : Using config file from "/home/ubuntu/.config/rclone/rclone.conf"
2024/02/23 09:13:55 INFO  : Hasher is EXPERIMENTAL!
2024/02/23 09:13:55 DEBUG : Creating backend with remote "gdrive:/Documents/MP4.mp4"
2024/02/23 09:13:56 DEBUG : fs cache: adding new entry for parent of "gdrive:/Documents/MP4.mp4", "gdrive:Documents"
2024/02/23 09:13:56 DEBUG : hasher::gdhash:/Documents: Groups by usage: cached [md5, sha1, whirlpool, sha256], passed [md5, sha1, sha256], auto [md5, sha1, whirlpool, sha256], slow [], supported [md5, sha1, whirlpool, sha256]
2024/02/23 09:13:56 DEBUG : gdrive~hasher.bolt: Opened for reading in 75.84µs
2024/02/23 09:13:56 DEBUG : fs cache: adding new entry for parent of "gdhash:/Documents/MP4.mp4", "gdhash:/Documents"
[
2024/02/23 09:13:57 DEBUG : gdrive~hasher.bolt: released
2024/02/23 09:13:57 DEBUG : MP4.mp4: pass md5
2024/02/23 09:13:57 DEBUG : MP4.mp4: pass sha1
2024/02/23 09:13:57 DEBUG : gdrive~hasher.bolt: Opened for reading in 69.52µs
2024/02/23 09:13:57 DEBUG : MP4.mp4: getHash: no record
2024/02/23 09:13:57 DEBUG : MP4.mp4: pass sha256
{"Path":"MP4.mp4","Name":"MP4.mp4","Size":20245158941,"MimeType":"video/mp4","ModTime":"2024-01-02T12:03:47.010Z","IsDir":false,"Hashes":{"md5":"f8d805ee9e098960e2e751690de95eb7","sha1":"2284f3217858916080708526fef12d8111a9a313","sha256":"88391535c6fc6f58f174516eb088a37c535c6b921a7524411f061918603a72de"},"ID":"1FCmCS5QjMFmH0Ck2BiAipOpDQl9VaMge"}
]
2024/02/23 09:13:57 DEBUG : 7 go routines active
2024/02/23 09:13:57 DEBUG : gdrive~hasher.bolt: released
2024/02/23 09:13:57 Failed to lsjson: database stopped

Calculate hash

$ rclone hashsum whirlpool --download 'gdhash:/Documents/MP4.mp4' -vv
2024/02/23 09:18:14 DEBUG : rclone: Version "v1.65.2" starting with parameters ["rclone" "hashsum" "whirlpool" "--download" "gdhash:/Documents/MP4.mp4" "-vv"]
2024/02/23 09:18:14 DEBUG : Creating backend with remote "gdhash:/Documents/MP4.mp4"
2024/02/23 09:18:14 DEBUG : Using config file from "/home/ubuntu/.config/rclone/rclone.conf"
2024/02/23 09:18:14 INFO  : Hasher is EXPERIMENTAL!
2024/02/23 09:18:14 DEBUG : Creating backend with remote "gdrive:/Documents/MP4.mp4"
2024/02/23 09:18:16 DEBUG : fs cache: adding new entry for parent of "gdrive:/Documents/MP4.mp4", "gdrive:Documents"
2024/02/23 09:18:16 DEBUG : hasher::gdhash:/Documents: Groups by usage: cached [md5, sha1, whirlpool, sha256], passed [md5, sha1, sha256], auto [md5, sha1, whirlpool, sha256], slow [], supported [md5, sha1, whirlpool, sha256]
2024/02/23 09:18:16 DEBUG : gdrive~hasher.bolt: Opened for reading in 82.8µs
2024/02/23 09:18:16 DEBUG : fs cache: adding new entry for parent of "gdhash:/Documents/MP4.mp4", "gdhash:/Documents"
2024/02/23 09:18:16 DEBUG : gdrive~hasher.bolt: released
2024/02/23 09:30:41 INFO  : MP4.mp4: auto hashing error: database stopped
6be80181bd5ce2214e3b176b18836d0aa0fe2f062aff4568dddbfa770e547222b62a5afdea49c234b0f9d040763119b3dc7b279418f927bd785689da1eb17a13  MP4.mp4
2024/02/23 09:30:41 DEBUG : 7 go routines active
2024/02/23 09:30:41 Failed to hashsum: database stopped

After calculate hash

$ rclone lsjson --hash 'gdhash:/Documents/MP4.mp4' -vv
2024/02/23 09:33:50 DEBUG : rclone: Version "v1.65.2" starting with parameters ["rclone" "lsjson" "--hash" "gdhash:/Documents/MP4.mp4" "-vv"]
2024/02/23 09:33:50 DEBUG : Creating backend with remote "gdhash:/Documents/MP4.mp4"
2024/02/23 09:33:50 DEBUG : Using config file from "/home/ubuntu/.config/rclone/rclone.conf"
2024/02/23 09:33:50 INFO  : Hasher is EXPERIMENTAL!
2024/02/23 09:33:50 DEBUG : Creating backend with remote "gdrive:/Documents/27 MP4.mp4"
2024/02/23 09:33:51 DEBUG : fs cache: adding new entry for parent of "gdrive:/Documents/MP4.mp4", "gdrive:Documents"
2024/02/23 09:33:51 DEBUG : hasher::gdhash:/Documents: Groups by usage: cached [md5, sha1, whirlpool, sha256], passed [md5, sha1, sha256], auto [md5, sha1, whirlpool, sha256], slow [], supported [md5, sha1, whirlpool, sha256]
2024/02/23 09:33:51 DEBUG : gdrive~hasher.bolt: Opened for reading in 83.32µs
2024/02/23 09:33:51 DEBUG : fs cache: adding new entry for parent of "gdhash:/Documents/MP4.mp4", "gdhash:/Documents"
[
2024/02/23 09:33:51 DEBUG : gdrive~hasher.bolt: released
2024/02/23 09:33:52 DEBUG : MP4.mp4: pass md5
2024/02/23 09:33:52 DEBUG : MP4.mp4: pass sha1
2024/02/23 09:33:52 DEBUG : gdrive~hasher.bolt: Opened for reading in 67.32µs
2024/02/23 09:33:52 DEBUG : MP4.mp4: getHash: no record
2024/02/23 09:33:52 DEBUG : MP4.mp4: pass sha256
{"Path":"MP4.mp4","Name":"MP4.mp4","Size":20245158941,"MimeType":"video/mp4","ModTime":"2024-01-02T12:03:47.010Z","IsDir":false,"Hashes":{"md5":"f8d805ee9e098960e2e751690de95eb7","sha1":"2284f3217858916080708526fef12d8111a9a313","sha256":"88391535c6fc6f58f174516eb088a37c535c6b921a7524411f061918603a72de"},"ID":"1FCmCS5QjMFmH0Ck2BiAipOpDQl9VaMge"}
]
2024/02/23 09:33:52 DEBUG : 7 go routines active
2024/02/23 09:33:52 DEBUG : gdrive~hasher.bolt: released
2024/02/23 09:33:52 Failed to lsjson: database stopped

rclone backend dump also shows no hash is saved

$ rclone backend dump gdhash: -vv | grep MP4.mp4
2024/02/23 09:35:59 DEBUG : rclone: Version "v1.65.2" starting with parameters ["rclone" "backend" "dump" "gdhash:" "-vv"]
2024/02/23 09:35:59 DEBUG : Using config file from "/home/ubuntu/.config/rclone/rclone.conf"
2024/02/23 09:35:59 INFO  : Hasher is EXPERIMENTAL!
2024/02/23 09:35:59 DEBUG : Creating backend with remote "gdrive:"
2024/02/23 09:35:59 DEBUG : hasher::gdhash:: Groups by usage: cached [md5, sha1, whirlpool, sha256], passed [md5, sha1, sha256], auto [md5, sha1, whirlpool, sha256], slow [], supported [md5, sha1, whirlpool, sha256]
2024/02/23 09:35:59 DEBUG : gdrive~hasher.bolt: Opened for reading in 60.56µs
2024/02/23 09:36:00 INFO  : /home/ubuntu/.cache/rclone/kv/gdrive~hasher.bolt: 19094 records
2024/02/23 09:36:00 DEBUG : 5 go routines active

Hmm. Looks like the error is on this line:

2024/02/23 09:30:41 INFO  : MP4.mp4: auto hashing error: database stopped
6be80181bd5ce2214e3b176b18836d0aa0fe2f062aff4568dddbfa770e547222b62a5afdea49c234b0f9d040763119b3dc7b279418f927bd785689da1eb17a13  MP4.mp4

I'll see if I can reproduce it to figure out why it's happening.

Does it only happen on large files? It would be helpful to know what your config settings are for auto_size and max_age.

Did not set any values for auto_size, it should be disabled by default.
max_age is off

Yes. So far from my testing, this scenario only happens to large files 6Gb or more.

[gdhash]
type = hasher
remote = gdrive:
hashes = md5,sha1,sha256,whirlpool
max_age = off

[gdrive]
type = drive
client_id = XXX
client_secret = XXX
scope = drive
root_folder_id = XXX
token = XXX
team_drive =

Ok I have a theory. Not totally sure yet if it's right.

Can you try with:

--fs-cache-expire-duration 0 

And let me know if that changes the behavior?

Yes!!!
This solved the problem.

For the record, this is the debug lines.

$ rclone hashsum whirlpool --download 'gdhash:/Documents/MP4.mp4' -vv --fs-cache-expire-duration 0
2024/02/23 23:20:50 DEBUG : rclone: Version "v1.65.2" starting with parameters ["rclone" "hashsum" "whirlpool" "--download" "gdhash:/Documents/MP4.mp4" "-vv" "--fs-cache-expire-duration" "0"]
2024/02/23 23:20:50 DEBUG : Creating backend with remote "gdhash:/Documents/MP4.mp4"
2024/02/23 23:20:50 DEBUG : Using config file from "/home/ubuntu/.config/rclone/rclone.conf"
2024/02/23 23:20:50 INFO  : Hasher is EXPERIMENTAL!
2024/02/23 23:20:50 DEBUG : Creating backend with remote "gdrive:/Documents/MP4.mp4"
2024/02/23 23:20:51 DEBUG : fs cache: adding new entry for parent of "gdrive:/Documents/MP4.mp4", "gdrive:Documents"
2024/02/23 23:20:51 DEBUG : hasher::gdhash:/Documents: Groups by usage: cached [md5, sha1, whirlpool, sha256], passed [md5, sha1, sha256], auto [md5, sha1, whirlpool, sha256], slow [], supported [md5, sha1, whirlpool, sha256]
2024/02/23 23:20:51 DEBUG : gdrive~hasher.bolt: Opened for reading in 82.68µs
2024/02/23 23:20:51 DEBUG : fs cache: adding new entry for parent of "gdhash:/Documents/MP4.mp4", "gdhash:/Documents"
2024/02/23 23:20:51 DEBUG : gdrive~hasher.bolt: released
2024/02/23 23:34:20 DEBUG : gdrive~hasher.bolt: Opened for writing in 473.56µs
6be80181bd5ce2214e3b176b18836d0aa0fe2f062aff4568dddbfa770e547222b62a5afdea49c234b0f9d040763119b3dc7b279418f927bd785689da1eb17a13  MP4.mp4
2024/02/23 23:34:20 DEBUG : 8 go routines active
$ rclone lsjson --hash 'gdhash:/Documents/MP4.mp4' -vv
2024/02/23 23:35:00 DEBUG : rclone: Version "v1.65.2" starting with parameters ["rclone" "lsjson" "--hash" "gdhash:/Documents/MP4.mp4" "-vv"]
2024/02/23 23:35:00 DEBUG : Creating backend with remote "gdhash:/Documents/MP4.mp4"
2024/02/23 23:35:00 DEBUG : Using config file from "/home/ubuntu/.config/rclone/rclone.conf"
2024/02/23 23:35:00 INFO  : Hasher is EXPERIMENTAL!
2024/02/23 23:35:00 DEBUG : Creating backend with remote "gdrive:/Documents/MP4.mp4"
2024/02/23 23:35:02 DEBUG : fs cache: adding new entry for parent of "gdrive:/Documents/MP4.mp4", "gdrive:Documents"
2024/02/23 23:35:02 DEBUG : hasher::gdhash:/Documents: Groups by usage: cached [md5, sha1, whirlpool, sha256], passed [md5, sha1, sha256], auto [md5, sha1, whirlpool, sha256], slow [], supported [md5, sha1, whirlpool, sha256]
2024/02/23 23:35:02 DEBUG : gdrive~hasher.bolt: Opened for reading in 107.36µs
2024/02/23 23:35:02 DEBUG : fs cache: adding new entry for parent of "gdhash:/Documents/MP4.mp4", "gdhash:/Documents"
[
2024/02/23 23:35:02 DEBUG : gdrive~hasher.bolt: released
2024/02/23 23:35:03 DEBUG : MP4.mp4: pass md5
2024/02/23 23:35:03 DEBUG : MP4.mp4: pass sha1
2024/02/23 23:35:03 DEBUG : gdrive~hasher.bolt: Opened for reading in 75.76µs
2024/02/23 23:35:03 DEBUG : MP4.mp4: cached whirlpool = "6be80181bd5ce2214e3b176b18836d0aa0fe2f062aff4568dddbfa770e547222b62a5afdea49c234b0f9d040763119b3dc7b279418f927bd785689da1eb17a13"
2024/02/23 23:35:03 DEBUG : MP4.mp4: pass sha256
{"Path":"MP4.mp4","Name":"MP4.mp4","Size":20245158941,"MimeType":"video/mp4","ModTime":"2024-01-02T12:03:47.010Z","IsDir":false,"Hashes":{"md5":"f8d805ee9e098960e2e751690de95eb7","sha1":"2284f3217858916080708526fef12d8111a9a313","sha256":"88391535c6fc6f58f174516eb088a37c535c6b921a7524411f061918603a72de","whirlpool":"6be80181bd5ce2214e3b176b18836d0aa0fe2f062aff4568dddbfa770e547222b62a5afdea49c234b0f9d040763119b3dc7b279418f927bd785689da1eb17a13"},"ID":"1FCmCS5QjMFmH0Ck2BiAipOpDQl9VaMge"}
]
2024/02/23 23:35:03 DEBUG : 7 go routines active
2024/02/23 23:35:03 DEBUG : gdrive~hasher.bolt: released
2024/02/23 23:35:03 Failed to lsjson: database stopped
ubuntu@instance-20240104-1254:~$ rclone backend dump gdhash: -vv | grep MP4.mp4
2024/02/23 23:35:22 DEBUG : rclone: Version "v1.65.2" starting with parameters ["rclone" "backend" "dump" "gdhash:" "-vv"]
2024/02/23 23:35:22 DEBUG : Using config file from "/home/ubuntu/.config/rclone/rclone.conf"
2024/02/23 23:35:22 INFO  : Hasher is EXPERIMENTAL!
2024/02/23 23:35:22 DEBUG : Creating backend with remote "gdrive:"
2024/02/23 23:35:22 DEBUG : hasher::gdhash:: Groups by usage: cached [md5, sha1, whirlpool, sha256], passed [md5, sha1, sha256], auto [md5, sha1, whirlpool, sha256], slow [], supported [md5, sha1, whirlpool, sha256]
2024/02/23 23:35:22 DEBUG : gdrive~hasher.bolt: Opened for reading in 57.4µs
2024/02/23 23:35:22 INFO  : /home/ubuntu/.cache/rclone/kv/gdrive~hasher.bolt: 19095 records
2024/02/23 23:35:22 DEBUG : 5 go routines active
ok  md5:f8d805ee9e098960e2e751690de95eb7 sha1:2284f3217858916080708526fef12d8111a9a313 whirlpool:6be80181bd5ce2214e3b176b18836d0aa0fe2f062aff4568dddbfa770e547222b62a5afdea49c234b0f9d040763119b3dc7b279418f927bd785689da1eb17a13 sha256:88391535c6fc6f58f174516eb088a37c535c6b921a7524411f061918603a72de      1m2s Documents/MP4.mp4

Thanks for confirming! I think this is probably not the proper long-term fix, but it at least confirms my theory about what is happening. To elaborate a bit:

This change added a finalizer function that calls Hasher's Shutdown method upon cache expiry. This Shutdown method stops the bolt db which Hasher uses to store the hashes.

The default --fs-cache-expire-duration is 5 minutes. And something like hashing a 20Gb file could easily take longer than 5 minutes (looks like it was about 14 minutes according to your logs). So 5 minutes into this 14-minute transaction, Shutdown gets called in another goroutine, and when the hashing completes at minute 14, there's no db to store it in.

This is also why it only affects large files and not smaller files which could be hashed in under 5 minutes.

I think in order to properly fix this I need to better understand how cache and Shutdown are supposed to work together. Is the problem that cache is calling Shutdown too early? Or is the problem that Hasher is taking that Shutdown instruction too literally? What is supposed to happen when a backend task takes longer than the --fs-cache-expire-duration? (@ncw in case you have guidance on this)

What is happening here is that the backend is being dropped from the backend cache. At that point the Shutdown method is called.

The proper fix for this is to Pin the backend while it is in use. If you examine the other wrapping backends (eg union, combine) for Pin calls you'll see what I mean hopefully!

So to fix, the hasher backend needs to pin the backend it is wrapping in the cache.

Not at my desk right now so can't check the code but hopefully that helps!

1 Like

Very helpful, thank you! That was the piece I was missing. I think I know how to fix it now.

1 Like

Just an update: I have opened a PR with what I think is the more proper fix for the underlying issue:

This turned out to be a much deeper rabbit hole than I anticipated!

2 Likes