Summary of Problems
- Rclone does not continue to download (or re-download) a previously unfinished or incomplete file download from Google Photos.
- There seems to be no way to detect incomplete or corrupt files downloaded from Google Photos.
rclone version
rclone v1.54.0
- os/arch: linux/amd64
- go version: go1.15.7
OS
Linux 64 bit (Ubuntu 20.04)
Cloud storage system
Google Photos
Background
I have been running rclone sync
over last few weeks to backup over ~500k photos locally (~1+ TB down so far). This has been a very fun (note: sarcasm) due to Google photos daily API limits and Google starting to give 404s after ~1 hour of run time (even if API limit is not reached).
I have been running rclone sync
on sub-directories like by-month/2020/2020-01
so I don't run out of API quota. To address the 404 error I have been using the linux command line tool timeout
to kill rclone sync after ~1 hour (3602 seconds to be exact).
I settled on timeout
after I had already tried using the rclone flag --max-duration=1h
. I found Google API gives intermittent errors for different API calls so rclone was never exiting (even if no new transfers were scheduled after 1 hour) but still eating up my daily API quota.
I have rclone sync
setup to run via cron a few times a day so I don't have to watch it constantly. The cron job runs are spaced out evenly throughout the day so that Google doesn't give me 404s for being too aggressive with the API.
Problem
1. I discovered a corrupted image.
This was a jpg file in feature/favorites
folder that when opened in an image viewer only produced a partially viewable file - most of the file was 'grayed' out.
Here is the corrupted file info.
-rw-rw-r-- 1 rclone rclone 126976 Feb 13 04:59 IMG_20190713_073837-COLLAGE.jpg
There could be at least two legitimate reason why this photo was corrupted:
- Timeout had abruptly killed rclone sync even when Google Photos hadn't started to give 404s
- Network interruption or downtime.
There's a very good chance there are many other files that are also corrupted.
2. Running rclone sync
did not detect or fix corruption
Running: timeout -v --preserve-status -k 1m 3602s rclone sync --transfers=10 --fast-list --gphotos-include-archived --log-level INFO --log-file=REDACTED.txt gphotos:feature feature/
2021/03/07 16:38:48 INFO : There was nothing to transfer
2021/03/07 16:38:48 INFO :
Transferred: 0 / 0 Bytes, -, 0 Bytes/s, ETA -
Checks: 4 / 4, 100%
Elapsed time: 1.7s
3. Renaming corrupted file and running rclone sync
re-downloaded the image:
Running: timeout -v --preserve-status -k 1m 3602s rclone sync --transfers=10 --fast-list --gphotos-include-archived --log-level INFO --log-file=REDACTED.txt gphotos:feature /REDACTED/feature
2021/03/07 16:40:45 INFO : favorites/IMG_20190713_073837-COLLAGE.jpg: Copied (Rcat, new)
2021/03/07 16:40:45 INFO : favorites/IMG_20190713_073837-COLLAGE_corrupted.jpg: Deleted
2021/03/07 16:40:45 INFO :
Transferred: 693.541k / 693.541 kBytes, 100%, 1.400 MBytes/s, ETA 0s
Checks: 4 / 4, 100%
Deleted: 1 (files), 0 (dirs)
Transferred: 2 / 2, 100%
Elapsed time: 1.4s
After this, the photo was correctly viewable and most importantly the file size and modtime both had changed. Modtime now seemed accurate (based on file name + based on checking Google Photos Web):
-rw-rw-r-- 1 rclone rclone 710186 Jul 13 2019 IMG_20190713_073837-COLLAGE.jpg
4. Experimentation found no quick/easy way to detect corruption.
I restored the corrupt file from backup and ran a few rclone commands but most of the command I ran did not detect the corruption!
I knew that as per the rclone features page, hashing or modtime are not supported by Google Photos but I was hoping there might be something special being done to help detect these types of errors/problems.
rclone check --log-level INFO "gphotos:feature" feature/
$ rclone check --log-level INFO "gphotos:feature" feature/
2021/03/07 16:53:01 NOTICE: Local file system at /REDACTED/feature: 0 differences found
2021/03/07 16:53:01 NOTICE: Local file system at /REDACTED/feature: 4 hashes could not be checked
2021/03/07 16:53:01 NOTICE: Local file system at /REDACTED/feature: 4 matching files
2021/03/07 16:53:01 INFO :
Transferred: 0 / 0 Bytes, -, 0 Bytes/s, ETA -
Checks: 4 / 4, 100%
Elapsed time: 1.1s
rclone check --size-only --log-level INFO "gphotos:feature" feature/
$ rclone check --size-only --log-level INFO "gphotos:feature" feature/
2021/03/07 16:55:00 NOTICE: Local file system at /REDACTED/feature: 0 differences found
2021/03/07 16:55:00 NOTICE: Local file system at /REDACTED/feature: 4 matching files
2021/03/07 16:55:00 INFO :
Transferred: 0 / 0 Bytes, -, 0 Bytes/s, ETA -
Checks: 4 / 4, 100%
Elapsed time: 1.2s
rclone hashsum MD5 --log-level INFO "gphotos:feature"
$ rclone hashsum MD5 --log-level INFO "gphotos:feature"
UNSUPPORTED favorites/IMG_20191226_174523.jpg
UNSUPPORTED favorites/VID_20200202_124028.mp4
2021/03/07 16:59:05 ERROR : favorites/VID_20200202_124028.mp4: Hash unsupported: hash type not supported
UNSUPPORTED favorites/IMG_20190713_073837-COLLAGE.jpg
UNSUPPORTED favorites/PXL_20210204_090657534.jpg
2021/03/07 16:59:05 ERROR : favorites/PXL_20210204_090657534.jpg: Hash unsupported: hash type not supported
2021/03/07 16:59:05 ERROR : favorites/IMG_20190713_073837-COLLAGE.jpg: Hash unsupported: hash type not supported
2021/03/07 16:59:05 ERROR : favorites/IMG_20191226_174523.jpg: Hash unsupported: hash type not supported
2021/03/07 16:59:05 Failed to hashsum with 8 errors: last error was: Hash unsupported: hash type not supported
5. Using the --download
flag would detect corruption.
Given the size of my photos library - doing this for my entire library is infeasible (and very painful) so I'm looking for easier or faster ways now. Hence this post.
rclone check --download --log-level INFO "gphotos:feature" feature/
$ rclone check --download --log-level INFO "gphotos:feature" feature/
2021/03/07 22:38:20 NOTICE: Local file system at /REDACTED/feature: 1 differences found
2021/03/07 22:38:20 NOTICE: Local file system at /REDACTED/feature: 1 errors while checking
2021/03/07 22:38:20 NOTICE: Local file system at /REDACTED/feature: 3 matching files
2021/03/07 22:38:20 INFO :
Transferred: 248.010M / 248.010 MBytes, 100%, 8.737 MBytes/s, ETA 0s
Errors: 1 (retrying may help)
Checks: 4 / 4, 100%
Transferred: 8 / 8, 100%
Elapsed time: 29.9s
2021/03/07 22:38:20 Failed to check: 1 differences found
6. I was surprised I could not use --size-only
(or modtime).
I'm not familiar with the rclone code base but I looked around and found that ModTime and Size methods seem to be implemented for objects but not for the filesystem.
I looked around the Google Photos API documentation and I see that there doesn't' seem to be support for retrieving this info via any official API calls.
However the modtime info is definitely retrieved by rclone from somewhere since it is being correctly set for all my photos. Looking at the source code it seems to be stored & retrieved from here.
I don't see the file size info being available within the Metadata so not sure how rclone handles this. Is the file streamed until Google photos says EOF and only then the file size is discovered?
I ran rclone check with DEBUG to see if Size or modtime was retrieved as part of the check but I found that it wasn't:
rclone check --size-only --log-level DEBUG --stats-log-level DEBUG "gphotos:feature" feature/
$ rclone check --size-only --log-level DEBUG --stats-log-level DEBUG "gphotos:feature" feature/
2021/03/07 17:05:43 DEBUG : rclone: Version "v1.54.0" starting with parameters ["rclone" "check" "--size-only" "--log-level" "DEBUG" "--stats-log-level" "DEBUG" "gphotos:feature" "feature/"]
2021/03/07 17:05:43 DEBUG : Using config file from "/home/rclone/.config/rclone/rclone.conf"
2021/03/07 17:05:43 DEBUG : Creating backend with remote "gphotos:feature"
2021/03/07 17:05:43 DEBUG : Creating backend with remote "feature/"
2021/03/07 17:05:43 DEBUG : fs cache: renaming cache item "feature/" to be canonical "/REDACTED/feature"
2021/03/07 17:05:43 DEBUG : Local file system at /REDACTED/feature: Waiting for checks to finish
2021/03/07 17:05:43 DEBUG : Google Photos path "feature": List: dir=""
2021/03/07 17:05:43 DEBUG : Google Photos path "feature": >List: err=<nil>
2021/03/07 17:05:43 DEBUG : Google Photos path "feature": List: dir="favorites"
2021/03/07 17:05:44 DEBUG : Google Photos path "feature": >List: err=<nil>
2021/03/07 17:05:44 DEBUG : favorites/PXL_20210204_090657534.jpg: Size:
2021/03/07 17:05:44 DEBUG : favorites/PXL_20210204_090657534.jpg: >Size:
2021/03/07 17:05:44 DEBUG : favorites/VID_20200202_124028.mp4: Size:
2021/03/07 17:05:44 DEBUG : favorites/VID_20200202_124028.mp4: >Size:
2021/03/07 17:05:44 DEBUG : favorites/IMG_20191226_174523.jpg: Size:
2021/03/07 17:05:44 DEBUG : favorites/IMG_20191226_174523.jpg: >Size:
2021/03/07 17:05:44 DEBUG : favorites/IMG_20190713_073837-COLLAGE.jpg: Size:
2021/03/07 17:05:44 DEBUG : favorites/IMG_20190713_073837-COLLAGE.jpg: >Size:
2021/03/07 17:05:44 DEBUG : favorites/PXL_20210204_090657534.jpg: Size:
2021/03/07 17:05:44 DEBUG : favorites/PXL_20210204_090657534.jpg: >Size:
2021/03/07 17:05:44 DEBUG : favorites/IMG_20191226_174523.jpg: Size:
2021/03/07 17:05:44 DEBUG : favorites/IMG_20191226_174523.jpg: >Size:
2021/03/07 17:05:44 DEBUG : favorites/IMG_20191226_174523.jpg: Size:
2021/03/07 17:05:44 DEBUG : favorites/IMG_20191226_174523.jpg: >Size:
2021/03/07 17:05:44 DEBUG : favorites/PXL_20210204_090657534.jpg: Size:
2021/03/07 17:05:44 DEBUG : favorites/IMG_20190713_073837-COLLAGE.jpg: Size:
2021/03/07 17:05:44 DEBUG : favorites/VID_20200202_124028.mp4: Size:
2021/03/07 17:05:44 DEBUG : favorites/VID_20200202_124028.mp4: >Size:
2021/03/07 17:05:44 DEBUG : favorites/IMG_20190713_073837-COLLAGE.jpg: >Size:
2021/03/07 17:05:44 DEBUG : favorites/PXL_20210204_090657534.jpg: >Size:
2021/03/07 17:05:44 DEBUG : favorites/IMG_20191226_174523.jpg: OK
2021/03/07 17:05:44 DEBUG : favorites/VID_20200202_124028.mp4: Size:
2021/03/07 17:05:44 DEBUG : favorites/PXL_20210204_090657534.jpg: OK
2021/03/07 17:05:44 DEBUG : favorites/VID_20200202_124028.mp4: >Size:
2021/03/07 17:05:44 DEBUG : favorites/VID_20200202_124028.mp4: OK
2021/03/07 17:05:44 DEBUG : favorites/IMG_20190713_073837-COLLAGE.jpg: Size:
2021/03/07 17:05:44 DEBUG : favorites/IMG_20190713_073837-COLLAGE.jpg: >Size:
2021/03/07 17:05:44 DEBUG : favorites/IMG_20190713_073837-COLLAGE.jpg: OK
2021/03/07 17:05:44 NOTICE: Local file system at /REDACTED/feature: 0 differences found
2021/03/07 17:05:44 NOTICE: Local file system at /REDACTED/feature: 4 matching files
2021/03/07 17:05:44 DEBUG :
Transferred: 0 / 0 Bytes, -, 0 Bytes/s, ETA -
Checks: 4 / 4, 100%
Elapsed time: 1.1s
2021/03/07 17:05:44 DEBUG : 4 go routines active
What I'm looking for
Given the number of files, there's no way I can manually check if all photos have been downloaded corruption free so I'm trying to figure out what I can do here without having to resort to using --download
flag on either rclone check
or rclone checksum
.
It seems to be that checking based on modtime (instead of simple file existence check) should be possible. So I'm looking to understand if there is a reason why this isn't already implemented (what am I missing?). If it is possible to be implemented (but not yet done for some reason), would it be a simple addition?
I'm also looking for advice or guidance on methods I could employ outside of rclone to help detect all corrupt files so I can force a re-download of them.
I came across the following code that can check for media integrity: https://github.com/ftarlao/check-media-integrity. I am currently running it on my entire library as a quick test revealed that it can detect the file in question above. It looks like it'll take ~2+ days to complete on my entire library. This seems like it'll help me find programmatically detectable corrupted files. I'm not sure if this will be 100% fool-proof though.
I feel checking against file size, modTime is likely much more reliable. Obviously the best would be hashsum but there is no support from Google Photos and I really don't want to use --download
until API limits can be improved (I'll happily pay if there was an option) and the 404s go away.
Any and all help would be greatly appreciated!