"check" command, "--download"/"checksum" flags and differences regarding google documents conversions

Greetings, everyone.

I would like to (cordially) request some information/help.

As expected, by default, rclone converts Google documents when they are downloaded by, e.g., using the copy command, and adds the corresponding extension to the files' names (docx, pptx...) [https://rclone.org/drive/#import-export-of-google-documents].

I use the following command to copy my whole Google Drive to my local device: rclone copy gdrive: ~/A-Drive --progress --exclude "Google Photos/**" (sometimes using the --checksum flag).

As someone who wants to be really sure about things, I like to count on the check command and its logging-related flags (specifically, --combined, --differ, --error, --missing-on-dst, --missing-on-src) [https://rclone.org/commands/rclone_check/].

NOTE: I omitted these flags from the commands below for the sake of keeping things organized.

I tried 3 "variations" of the check command, using: 1- the --download flag; 2- the --checksum flag (which is redundant in that case, I assume); and 3- no flags.

I have also generated logs with --log-file and --log-level INFO.

NOTE: the files generated by the flags --error, --missing-on-dst, --missing-on-src contained zero entries.

=======================================

Using the --download flag (rclone check gdrive: ~/A-Drive --download --progress --exclude "Google Photos/**") led to the following results:

2021/01/21 18:53:23 NOTICE: Local file system at /home/myuser/A-Drive: 153 differences found
2021/01/21 18:53:23 NOTICE: Local file system at /home/myuserA-Drive: 153 errors while checking
2021/01/21 18:53:23 NOTICE: Local file system at /home/myuser/A-Drive: 4740 matching files
2021/01/21 18:53:23 INFO  : 
Transferred:   	   28.781G / 28.781 GBytes, 100%, 6.014 MBytes/s, ETA 0s
Errors:               154 (retrying may help)
Checks:              4893 / 4893, 100%
Transferred:         9786 / 9786, 100%
Elapsed time:    1h22m0.3s

2021/01/21 18:53:23 Failed to check with 154 errors: last error was: 153 differences found

The file generated by the --differ flag contained 153 file names.

So, 154 errors were found, and 153 of them were related to differences, which makes me wonder what the missing error ("number 154", so to speak) is.

Also, even though they were referred to as "errors", the files generated by the flag --error had zero entries, as stated above. Maybe I didn't properly understand the "terminology".

Searching for "*" followed by a space (which indicates different files) in the file generated by the --combined flag fittingly yielded 153 results.

=======================================

Using the --checksum flag (rclone check gdrive: ~/A-Drive --checksum --progress --exclude "Google Photos/**") led to the following results:

2021/01/21 21:03:16 NOTICE: Local file system at /home/myuser/A-Drive: 0 differences found
2021/01/21 21:03:16 NOTICE: Local file system at /home/myuser/A-Drive: 153 hashes could not be checked
2021/01/21 21:03:16 NOTICE: Local file system at /home/myuser/A-Drive: 4893 matching files
2021/01/21 21:03:16 INFO  : 
Transferred:   	         0 / 0 Bytes, -, 0 Bytes/s, ETA -
Checks:              4893 / 4893, 100%
Elapsed time:      8m47.9s

So, unlike the previous command, no "errors" were found, but 153 hashes could not be checked.

Also, no entries were found in the file generated by the flag --differ.

Searching for "*" in the --combined file, as described above, yielded no results.

However, searching for "=" led to 4893 results, which corresponds to the total amount of files checked.

As such, even though 153 hashes weren't checked, all files were reported as matching.

The same results were observed without the --checksum flag (rclone check gdrive: ~/A-Drive --progress --exclude "Google Photos/**").

NOTE2: only the command containing the --download flag actually provided a list with the names of the "different" (--differ) files, which amounted to 153. The other variations only mentioned that 153 hashes weren't checked. Considering that the amount is the same (153), I assume that the "different" files are the same ones whose hashes could not be checked.

=======================================

All of those different files were in one of the following formats after copying them to my local drive: docx, xlsx, pptx.

Considering I am not aware of an efficient ("automated") way of comparing all of them, I did some manual work.

Suitably, the files I compared were all Google documents (in the cloud) that got converted when copied to the local drive (as expected).

Just to be sure, I manually downloaded some files directly from my Google Drive (which got converted to the same aforementioned formats) and compared their hashes with the ones of the corresponding files downloaded and converted by rclone.

For some reason, none of the hashes matched. In other words, a document copied to my local drive with rclone had different hashes than the same document downloaded manually from my drive.

I inspected the log generated by using the -vv flag along the check command (without --download), compared some files and noticed that: a) 153 hashes could not be checked;
b) the files whose hashes could not be checked match the ones listed as "different" when using the --download flag (at least the ones I manually compared).

=======================================

The questions:

1- how can I ensure (aside from manually checking) that those 153 files whose hashes could not be checked really are the converted Google docs documents?

2- when the --download flag is provided to the check command, how are the files compared (e.g. MD5 hashes calculated on the go and compared)?

3- while using the flag --download, rclone warned about "153 errors while checking", but the corresponding --error file contained zero entries. Is this behavior expected?

4- still about the --download flag: the message displayed after completing the operation (""Failed to check with 154 errors: last error was: 153 differences found") states that 154 errors were found, but only specifies 153 of those. What could the missing error be? Or is the existence of errors considered "one" error by itself?

5- the results of the check differ according to the use of the --download flag: using it leads to errors being mentioned and to files being listed as different under de corresponding logs; not using it just leads to hashes that "could not be checked", but files are still reported as matching. Is this behavior expected?

6- would setting rclone to get the links to the Google documents (instead of downloading and converting them) avoid this kind of errors in the future?

=======================================

Sorry for the long post. I wanted to be as specific as possible. I hope the provided information is enough.

Thanks a lot for your help and attention.

rclone version:
v1.53.2 (latest version is 1.53.4)
os/arch: linux/amd64
go version: go1.15.3

OS:

Ubuntu 20.04 LTS 64-bit

Storage system

Google Drive + local (HDD)

That is a bug, now fixed in the beta!

An ERROR is when there was a problem reading or downloading the file

These 153 files are almost definitely Google docs.

They don't have hashes, and really, really annoyingly they can be different lengths and have different contents when you download them multiple times.

The files matched as well as could be matched. Not having a hash isnt an error. Note that with google docs

  • they don't have a size (the size is indeterminate)
  • then don't have a hash
    So all rclone check is doing here is checking the file exists :frowning:

The only files without hashes on google drive are google docs so I think we can be 100% sure that they are the docs.

The file is downloaded (streamed into memory) and the MD5SUM of the actual data is computed.

Yes. the --error flag is defined like this

   --error string            Report all files with errors (hashing or reading) to this file

The files read fine and created hashes that were wrong. This is reported as a top level error but not logged in the --error file

Yes you are right the existense of errors is considered an error :slight_smile: This is now fixed!

I think I've explained that above. The root causes are

  • google docs don't have MD5SUMS
  • the contents differ when you download them multiple times

I don't think rclone has a setting to do that.

So, backing up google docs is a bit of a sorry state of affairs. The best we can do is assume that if the download completed successfully then the doc is OK. This is effectively what rclone check will do without --download. If you use --download then you'll get the checksum errors you see as the files differ each time you download them.

Greetings, sir Craig-Wood

Thanks for (thoroughly) answering my question so fast, and sorry for taking so long to reply.

I will probably delete those documents from the local storage, since I don't need to have them available offline.

Also, I want to avoid those "errors", so I can always be sure that everything went right ><

Now, about this:

I don't think rclone has a setting to do that.

I may be wrong, but I think it actually has.

Take a look at the end of the "Import/Export of google documents" in https://rclone.org/drive/#import-export-of-google-documents. It states the following (and, afterwards, lists the supported formats):

Google documents can also be exported as link files. These files will open a browser window for the Google Docs website of that document when opened. The link file extension has to be specified as a --drive-export-formats parameter. They will match all available Google Documents.

If I recall correctly, this was the behavior of Google's Backup/Sync in Windows: the docs would show up like regular files, but "opening" them would open a browser window. So, they were pretty much links to the cloud file.

Not sure how the checking would proceed in this case, though. I will try it later.

Again, thank you so much for the clarification.

Oh, yeah! I'd forgotten about that!

No, me neither!

Hey there.

A quick follow-up about the results, in case someone eventually needs information about it. I will keep it short.

I used the --drive-export-formats argument with the parameter link.html, described in the documentation as "An HTML Document with a redirect", compatible with all OS.

Using the check command (regardless of providing the --download argument) led to two errors for each Google document:

  • One for each ".link.html" file (created by Rclone), since it exists only in the local storage and, as such, is reported as missing from Google Drive;
  • One for each Google document file, because it exists only in Google Drive, while the corresponding local file, as stated above, is just a link.

So, for example, a Google Docs document named "potato" would be exported from the cloud as "potato.link.html". The results of check --download would point that "potato" is missing from the local storage, and that "potato.link.html" is missing from the cloud.

I am not a programmer myself, so I am sorry if this suggestion is infeasible or just plain stupid, but maybe an argument could be provided to ignore those mismatches related to "link" files.
Also, rclone could just check if the files' names (maybe location, too) match when dealing with docs exported as links.

Of course, this suggestion/idea does not apply do Google Documents exported as actual files, such as ".docx".

Anyways, thanks again for your help and attention (and, of course, for developing rclone).

I think you want to use this on the rclone check command too then it should work...

$ rclone check --drive-export-formats link.html drive:GDocs /tmp/GDocs
2021/02/09 10:16:08 NOTICE: Local file system at /tmp/GDocs: 0 differences found
2021/02/09 10:16:08 NOTICE: Local file system at /tmp/GDocs: 20 hashes could not be checked
2021/02/09 10:16:08 NOTICE: Local file system at /tmp/GDocs: 26 matching files

Note the 20 hashes could not be checked - that is the google docs.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.