Understanding Rclone output of Google Photos repository


This is probably more of a question about Google Photos API and how rClone interprets it that about how to use rClone. I want to be able to log what images are in what albums and which are in none. To do that I ran the following commands and piped the results into a file....

rClone lsf --csv --files-only --format pst remote:album -R
rClone lsf --csv --files-only --format pst remote:media/by-year -R

I'm using rClone v1.62.2 against a "Google Photo" repository on a Windows 10 computer. My config is called remote.

Initial wisdom seems to indicate that the "by-year" list will give me a unique list of the names of all media items uploaded to the repository. About 37,000 items. Then the "albums" list will include a list of each media item once per album in which the item is contained. That might be once, many or no occurrences per media item. Matching the two lists should highlight which media items are not in any albums.

Life is not that simple, Because Google Photos enforces uniqueness on names by shoving a very long code between braces on certain media item names. This "uniqueness factor" (UF) appears in both the "by-year" list and the "albums" list but not in any way consistently. I've failed to identify consistent rules on when this UF is applied. It may be when a file is uploaded a second time or when an item has been edited within Google Photos. I've seen examples of both but, again, not consistently. Some duplicates simply have the suffix "copy" added.

Regardless of the cause of the UF being applied to the "by-year" list, there should absolutely not be a media item in the "albums" list that does not match exactly an item in the 'by-year" list - which contains absolutely everything only once. By eyeballing I would estimate we are around 10% of files have a unique name in the "albums" list that does not appear in the "by-year" list and with the UF stripped out of them, that drops down to below 1%. Though it is still present.

If it helps, the biggest mess is with older media items that pre-date Google Photos (the picasa era). The early Google Photos / Google Drive uploader did untold damage to my collection. But, regardless, there are unmatched examples of media items uploaded this month. So this is no historic artefact. It is something affecting very new items.

Thank you if you are still reading this far. I've gone into considerable detail to save the time of those who may have assumed my problem is simply pulling data out of rClone. It's not. The lsf command seems to be doing what it says on the tin. It's just that it's not entirely clear how its Weltanschauung on what is not a standard file repository.


One idea:

to make sure that you compare apples to apples why not to use hashes?

from rclone lsf doc:

$ rclone lsf -R --hash MD5 --format hp --separator "  " --files-only swift:bucket

7908e352297f0f530b84a756f188baa3  bevajer5jef
cd65ac234e6fea5925974a51cdd865cc  canole
03b5341b4f234b9d984d03ad076bae91  diwogej7
8fd37c3810dd660778137ac3a66cc06d  fubuwic
99713e14a4c4ff553acaf1930fad985b  gixacuh7ku

Thanks, sounds like an option.

Thanks for the suggestion. Sadly it didn't work, saying that there's no such hash available. Re-reading the notes on the clone lsf command, it appears that this is a possibility.

However it has seeded an idea. There is also the "i" format option for file-id. This adds a variable to the output containing a 40 odd character (guessing - I didn't count) string of random characters that may do the same job.

The process takes a long time. I have 37k photos and they appear in one or many albums. So running once as by-year and once as albums, takes about 5 hours. When I have generated the two data sets I've feed it into Excel and try and match up the file-ids.

Here's hoping.

Thanks again.

Unfortunately I know nothing about Google Photos but good to hear that my random thought helped at least in spinning new ideas:)

Also if your "i" format option for file-id works let us know. Can be useful for somebody in the future.

It would be relatively easy to make an option for rclone to always put the {ID} markers in. Rclone puts them in when there are two or more files in the same "directory" with the same name which is why you see them sometimes and why you don't sometimes.

Would that help?

Thanks. I’m not sure other than if it only sometimes appears then it would cause a deal of confusion. Better to either be there or be absent.

I think you’ve just given me the solution to my problem!

The issue I had was that I thought Google Photos was changing the names of some of my photos by shoving a {gobbledygook} into various file names, after the file name but before the file type. I now believe that that is nothing to do with Google Photos but instead is the function you describe above.

My reasoning follows.

I’ve got my first LSF result formatted as ipst. So I have to ID for the file as well as the fully qualified name. Now because I know that photos will appear in multiple albums, I fully expect the file ID to be non unique. So I loaded the resultant CSV into excel and sorted it. Then highlighted duplicates. The first to catch my eye was three files in three different albums. The first had the IMG_0846.jpg name as you might expect. But the second two were annotated with the gobbledygook before the file type. These all had identical file IDs.

So, as I suggested before, enforcing uniqueness by injecting the gobbledygook, in my case, is what caused confusion. Though, if I had mounted it as a drive then I have no idea how that would cope with duplicate file names.

Now I know about this behaviour, I can simply do a global find and replace to the remove the {*} injection.

I’m making progress. Thanks for your time.

Yes it is rclone.

The {gobblegook} is the internal ID of the photo which is guaranteed unique (unlike the file name). So you can use that for your deduplication task if you want (using the output of lsf above).

Rclone (and file systems in general) don't deal well with identical file names in the same directory!