Detecting incomplete file download from Google Photos

Thanks very much for taking the time to respond and for providing the excellent suggestions. I've tried it all and here's my findings & thoughts so far.

1. --gphotos-read-size works!

At least for detecting image corruption this worked. I don't know how I missed trying this flag in my experimentation!

I haven't tried with videos yet. I found a few incomplete video download as well but they're in directories with thousands of other items so I won't try until I'm sure this is the method for me. Otherwise I'll run out of API too fast (I'm still running sync in the background :wink: )

rclone sync --gphotos-read-size
Running: timeout -v --preserve-status -k 1m 3602s rclone sync --gphotos-read-size --transfers=10 --fast-list --gphotos-include-archived --log-level INFO --log-file=REDACTED.txt gphotos:feature /REDACTED/feature
2021/03/08 16:02:57 INFO  : favorites/IMG_20190713_073837-COLLAGE.jpg: Copied (replaced existing)
2021/03/08 16:02:57 INFO  : 
Transferred:   	  693.541k / 693.541 kBytes, 100%, 1.902 MBytes/s, ETA 0s
Checks:                 4 / 4, 100%
Transferred:            1 / 1, 100%
Elapsed time:        11.5s

2. rclone lsf --max-age looks very promising.

Since I've been downloading for a few weeks already and I need to detect corruption from all that time I'll need to use --max-age as 1 month probably. This way I get a list of likely corrupt files. Since almost all photo filenames since I started my rsync rclone marathon contain the date I can programmatically or manually remove those files from this list of potentially corrupt files.

Obviously this means that any corrupt files for photos taken in last 1 month won't be detected but I can run rclone sync --gphotos-read-size for those and that's manageable from API quota perspective. For albums, shared-albums that have thousands of items, I'll probably build a file list of these photos from the last 1 month and call rclone sync --gphotos-read-size $remote_file $local_file on each file so I don't have to call rclone sync on the whole album or shared directory (which will eat my quota).

3. rclone lsf -Ftp --files-only -R --csv seems to the best.

Ideally I would run this for my entire gphotos collection but I know (based on past experience) just running rclone lsf on my collection eats up my API quota for the day so this is likely something I'll do much later down the track once I have finished all my syncing.

But this would be easy once I have the data: save csv output for remote & local, sort, diff, delete local files that don't match, finally run rclone sync on those files only (rather than parent directory).

rclone lsf -Ftp --files-only -R --csv --log-level INFO "gphotos:feature"
$ rclone lsf -Ftp --files-only -R --csv --log-level INFO "gphotos:feature"
2019-07-13 19:07:14,favorites/IMG_20190713_073837-COLLAGE.jpg
2019-12-26 17:45:23,favorites/IMG_20191226_174523.jpg
2021-02-04 10:06:57,favorites/PXL_20210204_090657534.jpg
2020-02-02 12:40:28,favorites/VID_20200202_124028.mp4

Next steps

I realise I actually need to solve two different problems. First is to fix all files that are likely corrupt and re-download them. That's been my focus so far in this thread.

However next I need to make sure future syncs don't cause this problem. Both of these will require different approaches.

Fixing existing corruptions

In the short-term, I'm happy to hack together things and mix-and-match some of the different ways to detect corruptions (as per my rantings above). It's totally fine if I detect some false positives since worst-case here is a re-download. As long as the number of false positives are relatively small it won't eat up my API quota.

Preventing future corruptions

By using --gphotos-read-size and limiting download to /by-month/$CURRENT_YEAR/$CURRENT_YEAR-$CURRNT_MONTH we can keep things sane without exhausting quota if I run this only once or twice a day.

However this is not enough since I need to detect corruption even when:

  1. I sync albums, shared-albums because I can't hardcode their names in my script;
  2. Older photos are added/edited/removed that are not from `$CURRENT_YEAR.

In these two cases, if I use --gphotos-read-size that will definitely exhaust API quota. So I need a long-term solution for these use cases.

I can also run rclone lsf -Ftp --files-only -R --csv once a month (sacrifice one day a month of API quota) and detect corruption through this way. Although I think this is not sustainable because I'm expecting that at some point in the future my gphotos collection size will get big enough that I won't even be able to complete rclone lsf without running out of quota.

I think this likely needs better programmatic support within rclone. I'll create seperate posts for what I think could be feature additions to make this easier (will link back on this thread when done).

Open questions

  1. Does rclone only set the modtime after it has successfully finished downloading? Can I rely on this?