Rclone not see all files in dir at Google Drive

madison437 · December 19, 2016, 8:30pm

Hi,

I’ve been using rclone for well over a year to pull down md5sums of various directories of backup files that I have uploaded.

To this end, my procedure is:

Build a local text file containing all md5sums on the local side for a particular backup directory (using rhash), and I run this periodically to pick up updates as new backup files are created within the directory.
I use the Syncovery app which on a scheduled basis, does a sync – to mirror the local files to the remote Google Drive directory.
To see what’s at Google, I use “rclone md5sum” to download the md5sums for a particular directory on Google Drive
To assume nothing – that everything has made it’s way to Google successfully – and not to have to trust anyone else about, it I compare the two md5sum files to see if there is a difference – do all files on the local side that rhash logged exist at Google? Do all files on Google I am pulling with “rclone md5sum” exist in the local directory?

This has all worked very well for over a year now, without any notable wrinkles. However, something has changed in the past month or so. When I use:

“rclone md5sum”, it’s not pulling down all md5sums. That is, rclone is not putting out a complete list of files. Sometimes one file is missing from the list, sometimes two or three, say out of 1200 files or so.

How do I know this? Because when I go to the Google Drive web interface, the file is there, and I can download it, and the md5sum even checks out.

I’ve also tried:
“rclone ls”, and verified that the file is not making it’s way into this list.

In addition, if I use another 3rd-party tool such as “odrive”, it ALSO cannot see the file that rclone is “missing”.

But all along, both Google Drive, and Syncovery can see the file.

Now, to perhaps give a little more insight into this, if I upload another file (say via web browser) to Google Drive into that same directory, the behavior changes. Now, rclone can see BOTH this newly uploaded file as well as the one that previously was not listed on the remote end.

What happens when I delete the newly uploaded file? The old file goes “missing” again on the remote end. You can literally watch an odrive folder show the file appear / disappear.

I contacted Google, but they were generally uninterested because the problem involved 3rd party tools.
I contacted the Syncovery developer, but he didn’t really have any new insight.

Any tips here?

BTW, I am running rclone v1.34, and I have also tried the most beta version as of two days ago:
v1.34-71-g4482e75β-osx-amd64

The result was the same.

My OS is OS X 10.11.6.
Syncovery version 7.68.
odrive v6110

I must have changed one of the apps in my workflow, either that or perhaps something has changed in the Drive API, or a bug has crept in somewhere?

Any insights or things to try would be appreciated.

Thanks,

– madison

ncw · December 21, 2016, 11:40am

This sounds like something strange going on at google drive.

The best thing to do would be to do the rclone ls command when you have a missing file and add the -v --dump-bodies flag. This will generate a lot of data so use the --log-file log.txt option too.

Ideally you’d send that file to me for me to investigate. It will have all the metadata for all your files in though (name, size, date, who uploaded etc). If you are happy for me to see that then email it to nick@craig-wood.com with the subject “rclone forum log from madison437” which a note of the name of the file that is missing.

If you want to investigate yourself then grep the log file for the name of the file that is missing. It will be in a JSON blob which you could post.

madison437 · December 21, 2016, 1:46pm

Thanks for taking the time to reply. I’ll get you some more data as you suggest. I very much appreciate you taking a look at it.

This problem is a bit like playing “whack a mole” – if I put more files into a directory, the old file that was noted as missing reappears, and a new one or two go “missing” that were there previously there. The problem seems to crop up more often in two of the directories I have on Google Drive. I know on the backend it’s really just a bunch of objects, but still, I am referencing them based on their advertised path.

Since odrive is also having this problem, as well as rclone, this thing is a bit of a curiosity.

Again, many thanks.

– madison

ncw · December 21, 2016, 5:51pm

rclone uses the drive v2 API. It is possible that Syncovery uses the v3 API which might make the difference. I don’t know if you can find that out easily.

madison437 · December 21, 2016, 11:49pm

I am just about to run the test, but in the meantime, I did contact Tobias, the developer of Syncovery. He was kind enough to share the following, which I have permission to post.

The following is from his email:

Hello,
it seems we use the v2 API also. We do all the REST programming manually, so not
using any library from Google. Our API URL is:
https://www.googleapis.com/drive/v2/files

The typical JSON of a file uploaded with Syncovery looks like this. It looks normal to
me and I can’t see any thing that would be specific to Syncovery.

{
“id”: “3833556”,
“fileId”: “0By9Hwo0NrudTYVFobFY3M1R5c1E”,
“deleted”: false,
“file”: {
“title”: “Abfallkalender-2012 nk!.pdf”,
“mimeType”: “application/pdf”,
“labels”: {
“trashed”: false
},
“createdDate”: “2016-12-01T15:32:18.629Z”,
“modifiedDate”: “2012-01-12T18:58:51.000Z”,
“parents”: [
{
“id”: “0By9Hwo0NrudTcG83d0ZHWHVhRW8”,
“isRoot”: false
}
],
“downloadUrl”: “https://doc-0k-6k-docs.googleusercontent.com/docs/securesc/h1lk5ckf3b2vpmgfm84dhsgatjgophjq/e621jvuhrdi0fctu5iet692r6sadhcu3/1480600800000/12959692258638513246/12959692258638513246/0By9Hwo0NrudTYVFobFY3M1R5c1E?e=download&gd=true”,
“originalFilename”: “Abfallkalender-2012 nk!.pdf”,
“md5Checksum”: “55bb391daf517e0441a483713c4cf8c4”,
“fileSize”: “97308”
}
},

Kind Regards,

Tobias Giesen
Super Flexible Software Ltd. & Co. KG

madison437 · December 22, 2016, 7:35am

Ok, Nick, I managed to get a test case where I (seemingly) broke the behavior of Google Drive or rsync as described in this thread. I sent you logs of two rounds of testing, as well as my basic test procedure.

I also emailed this to Tobias for completeness, though this time, the actual file in question (that “goes missing” in the rclone ‘ls’ command) was actually copied via an rclone sync command.

Hopefully this uncovers something, even if it’s me doing something silly.

Thanks for your time.

Regards,

– madison437

madison437 · January 20, 2017, 7:43pm

There seems to be another developer with this issue who has a logged a bug on the Google Drive API stackoverflow.com website:

Please reference the following bug report another developer has written (v3 API in their case):
https://code.google.com/a/google.com/p/apps-api-issues/issues/detail?can=2&start=0&num=100&q=Type%3DDefect%20API%3DDrive&colspec=Stars%20Opened%20ID%20Type%20Status%20Summary%20API%20Owner&groupby=&sort=&id=5009

durval · January 21, 2017, 12:44pm

Hi Madison,

Thanks for reporting this and working with @ncw to run it down. I also think that there are strange, data-threatening things happening with Google Drive (as I reported in the topic you noticed), and the more people we have on this, the better for everyone (including Google, who is getting a free, high-quality audit of their service).

Two questions:

Is your Google Drive remote encrypted or plain? Sorry if you stated it somewhere, but it’s not immediately obvious to me.
Why are you using rhash? Any advantage over “find … -type f -print0 | xargs -0 md5sum”? I can see it calculates other hashes besides MD5 and would make for a shorter command line, but for that case we only need MD5, and I always try to use more “standard”/established/commonly available commands over newer ones whenever possible (please note that I’m not criticizing you in any way, I’m just curious about your motives).

Cheers,

Durval.

madison437 · January 21, 2017, 9:18pm

Yes, the more people we have on this issue the better.

As you can see, I've also been communicating with odrive -- see the below link:

One of the folk there has been useful helping me get started with the "Try this API" Google web interface to execute commands directly to their API.

Last evening, I was able to show, that as Nick had suspected, when you ask for a directory listing, the results can come back differently depending on how many "results / page" you ask for. Rclone is by default asking for "maxResults" as 1000. So does odrive. If you have more than 1000 files in a directory, the API returns a link for the next page in a linked list. You grab that next page giving the pointer to the next node, and so on, until you have all files.

Now, if your results are "missing" a file using the "1000" setting (max size API allows) for the page size, and you ask for a smaller page size (e.g. 300), you will most likely get back the file that was previously missing, but other files may instead "go missing." Whack-a-mole, so to speak.

This problem seems to happen more so with the v2 API (rclone & Syncovery are using). I believe odrive is also v2 API. However, much like the "Issue #5009" logged on Google's issue tracker, I was able to repro a case where the v3 API is broken as well.

Another small notable: this "page size" parameter is called:
"maxResults" in v2 Drive API
"pageSize" in v3 Drive API

This can be demonstrated using the Files:list command.

As far as your questions are concerned:

My Google Drive is plain, at least with respect to rclone. The data in question here is encrypted locally by my backup application (Retrospect), and then pushed to GDrive. I still have all of this data locally, and am able to verify md5sums locally and remotely, and I have yet to come across any corruption issues.
Let me humbly say that I am not a coder by trade. I just hobble together what is bridging a gap I might have for my little IT back-end. So, to be honest, without doing too much digging, as I understand the command you suggested, it would build a list of md5sums for a file or directory.

The advantage with rhash is that it can keep a simple textfile with filenames and md5 hashes, and when you query a directory in the future you merely ask it to "update" the text file. It will then write any newly found files that are not in the "text database" into that database.

After each local backup, and after the material has been pushed to Google, I run rhash, updating this local database of md5sums. I then ask rclone for all md5sums for the corresponding directory at Google. The output of this latter command is saved fresh, into a new file.

I then do:
'grep -v -f rclone_downloaded_Google_md5sums rhash_generated_local_database_md5sums'
AND
'grep -v -f rhash_generated_local_database_md5sums rclone_downloaded_Google_md5sums'

This will check to make sure in BOTH directions whether or not a file is missing or different.

About data integrity / possible silent corruption:
Now truth be told, once rhash has built the text table of hashes, those are not being re-checked each time. That would be accomplished by removing entries from the database, which would trigger a recompute of the md5sum. And at Google, when rclone is asking for the file md5sums, that's just reading back the metadata from those files at Google. I'm not sure how Google crawls / scrubs their data, or if in fact there is any way that metadata field for the file could be updated in the future.

However, locally, I am storing my files on a ZFS backend. So I know I have integrity there... that those files are verified each time they are used. And my understanding is that Google's Colossus filesystem uses Reed Solomon coding in a RS (6,3) arrangement. (See a paper that makes reference to this at https://www.usenix.org/node/188447).
Hence, I would really not expect silent corruption there, but I guess anything is possible, but I suspect there is probably quite a bit of checksumming going on at Google. If the data got there ok, I suspect it will stay ok.

What I can tell you from my experience, is that over a year of uploads / downloads / checking md5sums (sometimes rebuilding the rhash table just to double-check) over thousands of files (maybe around 6-7k, guessing) that I can't recall any data corruption. This includes re-downloading the files in question ("missing files") and doing a 'diff' by hand so-to-speak.

My limited knowledge suggests to me that the lower one goes into the storage stack, the more robust the code probably is. So, in this case, although we're asking Google Drive from their API to deliver a list of files, it actually involves fairly high-level abstractions -- treating the list of objects (database) somewhat as a filesystem through the API, and hence it would not surprise me that something could be broken here for a 'files:list" command.

What I am surprised about is that this has been going on for a few months now, and it has not been fixed. I'm stupefied, actually.

-- madison

madison437 · January 24, 2017, 8:48pm

A Google API tester has been in contact with me as of today regarding the issue.
I walked him through the steps.

He called back later to tell me the issue was verified, and that this is now being sent directly to engineering.

They will update me on the status. If there is anything appreciable to report, I will post again here.

– madison

ncw · January 25, 2017, 2:23pm

Well done - that took some persistence!

Cross fingers.

madison437 · January 25, 2017, 4:48pm

Thanks, Nick, for your help in getting to this point.

Keeping my fingers crossed too – we’ll be out of the woods when it’s actually fixed!

P.S. About the rclone ‘–drive-list-chunk’ parameter for GDrive – I’m assuming that’s merely in the build you gave me, and that it is not going back to the main tree? Just checking, 'cause if it’s not headed for the main codebase, then I’ll keep the test build you made with that parameter running for a bit longer until the dust settles.

durval · January 25, 2017, 6:34pm

Great job, @madison437! Please keep us posted.

Cheers,

Durval.

ncw · January 26, 2017, 8:36am

I wasn't planning to merge it, but I can easily if you think that would be helpful.

madison437 · January 27, 2017, 11:51am

Hey Nick,

My hunch is it’s better to not merge it unless this thing becomes fixed and then re-appears yet again, i.e. a persistent problem.

Not merging means less code to manage, however trivial the change might be.

Regards,

– madison

ncw · January 27, 2017, 8:31pm

It is a bit of an esoteric option yes. I have it in a branch if it is ever needed again though!

madison437 · February 2, 2017, 6:39am

Ok, NIck, in the latest communications with Google, they were asking me to try to do a directory listing of all the files in my account without specifying a ‘q’ parameter or even a maxResults parameter.

I first replied that this was equivalent to the ‘default’ of 100, and yet the tester claimed that it’s not necessarily the case.

I replied that I could not easily do this, as my method of traversing the tree using their web-based API tool requires me to manually go through several pages, collecting the results. Whether or not leaving the argument blank is exactly equivalent to putting in ‘100’ as the page table size for the results, it will be returning page tables of size 100, and that would be a lot of pages to traverse by hand.

I could try and fire up the python API calls, but I’m really trying to avoid that if I can.

Using rclone, I don’t know if specifying:
–drive-list-chunk /0 would be the same or not. Is it possible to have your option run a query without putting in any argument for maxResults?

Also, to my chagrin, I am thinking that perhaps temporarily this parameter could be merged, as I’m running other testing with rclone, and it might be good to be on the latest version.

Perhaps when this thing is fixed, you could pull it out again?

Sorry to ask!

– madison

ncw · February 2, 2017, 9:59pm

I’ve merged this and made the change so that --drive-list-chunk 0 doesn’t add the parameter at all (you can verify this with --dump-headers)

http://beta.rclone.org/v1.35-59-g7679620/ (uploaded in 15-30 mins)

That is still going to use a q parameter though so it isn’t exactly what they asked for, but the maxResults paramter is gone if you set --drive-list-chunk to 0

If you use rclone to do a listing with -v -dump-auth then you can cut and paste things into curl to get a listing

rclone size --drive-list-chunk 0 drive:1000files -v --dump-auth --log-file z.log

Now look in z.log for the listing line (search for q=). Use the authorization header in a curl request like this

curl -H "Authorization: Bearer ya29.GlvmA4...xR0E" "https://www.googleapis.com/drive/v2/files?alt=json"

And you’ll get a pile of JSON to your screen!

madison437 · February 2, 2017, 10:53pm

Thanks, Nick, I will soon take a look at this!

I’ll probably add both you and the odrive developer to the next email exchange with them.
And thanks for the suggestions as well.