[bug] corruption of modtime via VFS layer

What is your rclone version (output from rclone version)

rclone v1.49.4-003-gb5ea6af6-v1.49-fixes-beta

  • os/arch: windows/amd64
  • go version: go1.12.10

Windows 10
Google Drive

@NCW This is mostly intended for you, but I figure I might as well air the topic in public.

I would like to bring your attention to a fairly significant problem of modified dates being corrupted (effectively left blank and thus showing default values such as 1601 on Windows).
I touched on this subject last time you fixed the cache reference bug, but I haven't had time to delve into the detail until now.

I've needed some time to find what the triggers for the problem is, and I think I have found some good clues now that may give us something to go on in terms of debugging.

When uploaded via VFS write-cache, some files have their mod-time corrupted. the following steps appear to happen in those cases:

  • File is copied to write-cache (with current date, which is wrong but this may be intentional?)
  • The moment this finishes the file shows up on the mount as expected, but there the date is blank
  • After the upload completes, some mechanism seems to trigger a re-download of the same file (some cache-refresh mechanism in the VFS?). This may be happening because it sees a mismatch between the dates I suspect.
  • After file download the file in the cache now also has an empty moddate (it had current date before this)

2019/10/08 20:10:41 INFO : TESTING/test2/1GB.bin: Copied (new)
2019/10/08 20:11:57 INFO : TESTING/test2/1GB.bin: Copied (replaced existing)

Obviously this causes all sorts of havoc as modtimes are not only permanently lost (and relied upon by many rclone functions for proper operation), but also the cache is actually doing more harm to efficiency than not using it at all.

To contrast - On unaffected files this happens:

  • file is copied to write-cache (with current date, which is wrong, but maybe intentional?)
  • after landing in the cache it shows correct date on the mount (even though the date is wrong in the cache, so that's why I expect the above is not necessarily a bug)
  • No apparent change when upload is complete
  • File is not re-downloaded and everything seems to be ok.

There is also another layer to this as there seems to be a consistent pattern of certain file-extensions always being affected while others are never affected. This is based on limited testing and observations, but it is far too consistent to be random.

extensions that appear consistently affected in testing:

  • .jpg
  • .bin
  • .zip
  • .rar (but not other files in the same archive using 001, 002 extensions for example - indicating the extension is the trigger and not the file-format)

I suspect many more based on what patterns I see, but these are the ones I have tested properly.

A contrasting example of an extension I can not replicate the problem with (for replication of the issue) would be .bmp - this never fails under the exact same conditions.

Additional observations:

  • The issue does not seem to occur outside of the VFS layer - such as by an upload directly via console. Pretty sure it's a problem in the VFS, or happening in the VFS-to-Remote interactions.
  • Could potentially be a Windows-specific issue just based on the lack of reports (can anyone confirm on Linux?)
  • Pretty sure this is a fairly old problem dating back to at least 1.47-1.48 or more so I don't suspect recent changes - but I haven't done the detailed testing until now to be fair ...

If anyone has ideas on how to proceed or narrow down the problem even further - I am all ears :slight_smile:

Full Debug log of issue as it is happening. 1 file "100MB.bin" was handled so you can search for that to make the job easier:

i believe that a date of with year 1601 is a value of zero.

Very possible. I think modtime is technically an offset from 01.01.1601.

I'm not really concerned about that spesific number. That is a weird but known historical quirk. The question is why (and where exactly) the corruption or non-transference of that value happens.

Quick addendum:

Copied and renamed 100MB.bin to 100MB.bmp
uploaded both at the same time (via VFS write cache)
.bin fails and .bmp is correct (consistently through several tests).

Very strange indeed - but these files are bit-identical aside from name, so what else can it be besides the extension at that point? I figure this must be a strong clue towards finding the root cause.

How are you copying files to the mount? With Explorer? Can you try with CP and see if you get the same effect?

I suspect this is Explorer doing something unexpected.

Can you post a log with -vv somewhere for me to have a look at?

Up until now I have been using explorer drag&drop primarily yes.
testing now with commandline copy - will edit with results in 2min.
I posted a full debug log in case you missed it (bottom of OP)
Let me know if you need something more or different in terms of logging results.

EDIT - test results:

C:>copy F:\TESTFILES\100MB.bin Z:\TESTING\test2
1 file(s) copied.

C:>copy F:\TESTFILES\100MB.bmp Z:\TESTING\test2
1 file(s) copied.

Identical results - .bin modtime is corrupted/missing. .bmp is fine (bit identical files)

I will test now with not using a encrypted remote just in case. I haven't checked that. Just a hunch.

EDIT2 - results:

C:>copy F:\TESTFILES\100MB.bin Z:
1 file(s) copied.

C:>copy F:\TESTFILES\100MB.bmp Z:
1 file(s) copied.

In this case Z is unencrypted
same results as before.
At least we have eliminated some possibilities.

The file should get its correct date once it is closed - see this line in the log

2019/10/08 20:51:04 DEBUG : TESTING/test2/100MB.bin: File.applyPendingModTime OK

I made a test with some more logging - can you have a go with this? -It should print some INFO logs as to exactly the value of the modified time when it sets it in various places.

https://beta.rclone.org/branch/v1.49.5-189-ga45a11af-fix-vfs-modtime-beta/ (uploaded in 15-30 mins)

When in doubt - bruteforce the shit out of it ...
made a script to generate all possible alphabetical 3-char extensions
uploading now to see if I can see a pattern in the results.

With 17K files it might take a while though ... :sweat_smile:

Will test as soon as it's done.
Since it's INFO-level, I guess you just want a --verbose log to keep it from being cluttered? I will assume so unless you specify otherwise.

Result of bruteforce test - here are all the file extensions that failed to keep modtime. I cut away the others for the sake of all our sanity :stuck_out_tongue:
376 fails out of 17.576 or about 2%
As for pattern .... fuck if I know. I was hoping for something clear to become apparent, but we might need a codebreaker for this one :sweat_smile:

Odd modtime though. Expected 1601, but it's 2185. This might just be a way Windows commandline dir interprets the default though.

I ran a small of this set twice and got consistent results to exclude randomness:

Directory of F:\extensionfails

08.10.2019 23:49 .
08.10.2019 23:49 ..
22.07.2185 01:34 8 test.acm
22.07.2185 01:34 8 test.ade
22.07.2185 01:34 8 test.adp
22.07.2185 01:34 8 test.aht
22.07.2185 01:34 8 test.ani
22.07.2185 01:34 8 test.apk
22.07.2185 01:34 8 test.app
22.07.2185 01:34 8 test.asd
22.07.2185 01:34 8 test.asf
22.07.2185 01:34 8 test.asp
22.07.2185 01:34 8 test.asx
22.07.2185 01:34 8 test.awx
22.07.2185 01:34 8 test.bas
22.07.2185 01:34 8 test.bat
22.07.2185 01:34 8 test.bht
MOved full list to pastebin to not spam so much

Here is the output of you test-build using INFO level logging.
Test includes copying of 100MB.bin and 100MB.bmp - for contrasting.

2019/10/09 00:19:06 INFO  : Using --user test --pass XXXX as authenticated user
2019/10/09 00:19:06 NOTICE: Serving remote control on
2019/10/09 00:19:06 INFO  : Starting bandwidth limiter at 18MBytes/s
The service rclone has been started.
2019/10/09 00:19:08 INFO  : Cleaned the cache: objects 2569 (was 0), total size 2.735G (was 0)
2019/10/09 00:19:25 INFO  : Setting pending mod time to 2019-09-13 04:40:53.1383421 +0200 CEST
2019/10/09 00:19:25 INFO  : Setting pending mod time to 2019-09-13 04:40:53.1383421 +0200 CEST
2019/10/09 00:19:29 INFO  : Setting pending mod time to 1601-01-01 00:59:59.9999999 +0100 CET
2019/10/09 00:19:39 INFO  : TESTING/test2/100MB.bin: Copied (new)
2019/10/09 00:19:39 INFO  : TESTING/test2/100MB.bin: Setting Modtime to 1601-01-01 00:59:59.9999999 +0100 CET (zero false)
2019/10/09 00:19:39 INFO  : TESTING/test2/100MB.bmp: Copied (new)
2019/10/09 00:19:39 INFO  : TESTING/test2/100MB.bmp: Setting Modtime to 2019-09-13 04:40:53.1383421 +0200 CEST (zero false)
2019/10/09 00:19:46 INFO  : TESTING/test2/100MB.bin: Copied (replaced existing)
The service rclone has been stopped.

EDIT: Cleaned up log slightly with a cleaner run

OK we can see the bad times being set there... Now I know what I'm looking for I can see these in the original log

2019/10/08 20:50:56 DEBUG : /TESTING/test2/100MB.bin: Utimens: tmsp=[{Sec:-11644473600 Nsec:-100} {Sec:-11644473600 Nsec:-100}]

That decodes to: 1600-12-31 23:59:59.9999999 +0000 UTC so it is the time we are looking for.

That time came straight from the Windows kernel via WinFSP then cgofuse.

However it is possible that rclone passed this time to cgofuse/WinFSP/WIndows kernel at some point, so here is an rclone with even more INFO statements which tries to work that out!

https://beta.rclone.org/branch/v1.49.5-189-ge343240e-fix-vfs-modtime-beta/ (uploaded in 15-30 mins)

I note that I opened an issue about this on the cgofuse bug tracker but closed it again as I came to the conclusion that it was Windows passing the time to rclone. I'm not sure that was correct now, but the log should show!

I'm beginning to feel there will be a fairly simple fix for this :slight_smile:

They all look like known files of some kind so Windows understands their extensions. I don't recognize all of them but there are a lot I do which makes me think they aren't just random.

That time appears in your original log too

2019/10/08 20:50:37 DEBUG : TESTING/100MB.bin: updateTime: setting atime to 2185-07-22 00:34:33.7085516 +0100 CET

I can't find a significance for it though!

Can you double-check if the build was ok here? Can't see any windows version, and this was 7 hrs ago so I imagine it's not going to appear by itself.

Once available I will test again.

Sorry, I broke the Windows build... (more haste less speed as my Dad always used to say!)

Try this

https://beta.rclone.org/branch/v1.49.5-195-g0906a914-fix-vfs-modtime-beta/ (uploaded in 15-30 mins)

Here is new logging of the same test as above.

This one fails in the same pattern as we expected:

However I want to add a second log because I noticed something else when testing.
If the cache is deleted / did not exist it seems to work at least the first time, though the problem crops up again quickly. I also saw one result where I'm sure I saw the moddate empty on .bin on the mount, but after just restarting the mount (no other action taken) it was correct. I have not nailed down the extract trigger here, but it may point towards the cache being poisoned with wrong data - and sometimes this may propagate further into the system such as if it gets re-downloaded.

This log is the same test, but with a freshly deleted cache and first copy operation being done. The result is correct:

In that log we can see the 1601 date doesn't originate from rclone it comes straight from windows.

I can see in the logs that windows sets the correct time then overwrites it with the 1601 time which is odd...

Anyway, I've made an attempted fix. This filters out all dates before 1601-01-02.

Can you check the file ends up with the correct datetime in the VFS and on the remote cloud.

Can you have a go with this and record a log with -vv. You should see some debugs starting with "Utimens:" which tell you what is happening!

     fs.Debugf(path, "Utimens: Not setting out of range time: %v", t)
     fs.Debugf(path, "Utimens: SetModTime: %v", t)

https://beta.rclone.org/branch/v1.49.5-196-gf3e1d9ce-fix-vfs-modtime-beta/ (uploaded in 15-30 mins)

Thank you

Copy that. Testing when done and updating this post, so results will be here in 30-45min
(but that doesn't mean you shouldn't go sleep :stuck_out_tongue: )

EDIT: Slightly confusing results here initially. Will need more time with this to unpack what is really happening and present a summary... expect it tomorrow

1 Like

Testing report:

This definitely made a change, but not the one we wanted. The results are messy this time so I will summarize the general overview:

  • Some files do end up with correct dates, but the vast majority of them (out of the 17K test-files) end up with the current date instead - which is not much of an improvement. Oddly this breaks the moddate on way more files that consistently were unaffected before. 47 files correct out of just about 8000 tested in the set before aborting due to time.
  • Seems a bit more consistent if just copying a few files at a time, but it's still not reliable. I think I saw at least one failure (resulting in current date) in 4 attempts of copying only the previous .bin and .bmp files used in earlier tests.
  • Results no longer seem consistent and easily reproducible. I used to get identical results if I ran the files through VFS to a local remote in the interest of time/speed. Now I don't. 100% of files to the local remote end up with the current date. I also ran an overnight test to the actual drive on the same dataset (the 47 out of 8000 mentioned above) that didn't produce the same results. If there is still a systematic pattern here I can't see it now.
  • Transfers (anecdotally) seem much more sluggish and slow. Hypothesis: Many retries of some sort on catching a date that isn't out of range due to the workaround you implemented?

I did debug-log a bunch of this, but since it's no longer easy to pin down the failure-point they are raw and bulky. I have sent them to you on gmail if you want to take a look as they are too unwieldy to post. At least you have easy keywords to search for ...

Perhaps this workaround strategy is not the way to solve it. I still think it is too weird that the corruption was consistent based on the extension type. I did consider as you did that it may somehow be related to filetypes Windows recognizes and somehow treats special in some way - but there are lots of recognized extensions among the untouched files too - and also many I have never seen among the affected ones - so I don't know if this hypothesis holds much water. I will do some cursory research on a pattern there to see if I can find any sort of commonality - but I wouldn't hold my breath on that panning out to something useful.

Not saying we should necessarily abandon the idea of a simple workaround fix immediately - but if we struggle to get it to behave as we want we should be wary of investing too much time into that approach and maybe look to solve the problem at the root instead, assuming it's something in WinFSP and not Windows itself.

Thanks for your comprehensive test writeup.

If the modtimes are not correct with that patch then it means that the modtime got corrupted somehow (into that epoch time) on its way from Windows to rclone either in WinFSP or cgofuse.

I think you are right, we should investigate WinFSP further.

Can you write up a step by step reproduction (as simple as possible) that I can share with the author of WinFSP. Bill knows more about file systems than anyone I've ever met so if we can make a reproduction I bet he can work out what is happening. I have tried to reproduce the problem but I haven't succeeded yet on a Windows 2019 server VM.

WinFSP has some comprehensive debugging which we could turn on as does cgofuse (with --debug-fuse).

One thing that might be worth a try is to see if sshfs-win has the same problem, so use that as the remote filing system instead of rclone.

Huh, well that is pretty weird. I'm on Windows 10, but that seems unlikely to matter much.
I will send you a zip on gmail of my "test set". If you can't reproduce it with that then I will be scratching my head. I definitely want to eliminate any possibility of it being my spesific system that is at fault somehow.

When I have the time I will check on a fresh VM of Windows 10 and Windows Server 2019 to compare results (using otherwise same rclone config).

The last thing I can think about is that i have been using a setup that uses a long attr timeout - but I can't logically link that to only affecting certain file extensions. I am also fairly certain I saw this issue happen from the very start of using rclone - long before I understood and utilized advanced timeout parameters.

To eliminate this I will try to run a really basic rclone setup and see if it makes any change.

Last hunch - could antivirus with real-time scanning somehow be involved in this? Will test that too. I do have Avira on this system I have been testing on...

I can do a reproduction writeup sure, no problem. Just want to eliminate a few more things before I proceed with that.