Rclone mount cache constantly expired for no reason

I’m mounting a quite large directory (about ~40,000 files) in order to run a script that just renames all the files based on their st_mtime. The backing remote (a Google Drive) is not being written/read to outside of the mount at all.

However, even with very aggressive values for both --dir-cache-time and --cache-info-age of over 24 hours, I’m still finding that rclone does a cold listing of the big directory (which takes over 5 minutes) every thirty or forty renames. So it’ll block a mv for 5-150 minutes to warm up the cache, rename a couple of files in the folder for maybe 30 seconds, and then suddenly decide to invalidate the cache again, for the exact same directory it just did. This means it will likely take me days or weeks to finish renaming all the files in here.

Here’s the relevant section of the configured remotes:

[gdrive]
type = drive
client_id = ***
client_secret = ***
scope = drive
token = ***

[gdrive-cache]
type = cache
remote = gdrive:freezer/
chunk_size = 10M
workers = 10
info_age = 24h

Here’s the command I’m using to mount:

rclone mount -vvv --log-file=/home/apetresc/tmp/rclone-mount.log --dir-cache-time=240h0m gdrive-cache: freezer/

The rename script is just doing a single ls -al freezer/path/to/subdir and then a long sequence of mv commands within that subdir. When one of those mysterious cache misses happens, the rclone mount log looks something like this:

2018/12/16 19:46:26 DEBUG : YouTube/ChannelFireball/(2015-08-14) #GPSD Quarterfinals - Ben Weitz (Jeskai) vs. Paul Rietzl (Abzan Control) [k7xGJhUOSrs].mkv: notify: expired 'YouTube/ChannelFireball'
2018/12/16 19:46:26 DEBUG : YouTube/ChannelFireball/(2016-08-27) #GPSD Quarterfinals - Ben Weitz (Jeskai) vs. Paul Rietzl (Abzan Control) [k7xGJhUOSrs].mp4: move: file moved
2018/12/16 19:46:26 DEBUG : YouTube/ChannelFireball/#GPSD Quarterfinals - Ben Weitz (Jeskai) vs. Paul Rietzl (Abzan Control) [k7xGJhUOSrs].mp4: move: removed from cache
2018/12/16 19:46:26 DEBUG : YouTube/ChannelFireball: cache: expired YouTube/ChannelFireball
2018/12/16 19:46:26 DEBUG : YouTube/ChannelFireball: cache: expired YouTube
2018/12/16 19:46:26 DEBUG : YouTube/ChannelFireball: cache: expired
2018/12/16 19:46:26 INFO  : YouTube/ChannelFireball: move: cache expired
2018/12/16 19:46:26 DEBUG : YouTube/ChannelFireball/(2016-08-27) #GPSD Quarterfinals - Ben Weitz (Jeskai) vs. Paul Rietzl (Abzan Control) [k7xGJhUOSrs].mp4: move: added to cache
2018/12/16 19:46:26 DEBUG : YouTube/ChannelFireball: cache: expired YouTube/ChannelFireball
2018/12/16 19:46:26 DEBUG : YouTube/ChannelFireball: cache: expired YouTube
2018/12/16 19:46:26 DEBUG : YouTube/ChannelFireball: cache: expired
2018/12/16 19:46:26 INFO  : YouTube/ChannelFireball: move: cache expired
2018/12/16 19:46:26 INFO  : YouTube/ChannelFireball/#GPSD Quarterfinals - Ben Weitz (Jeskai) vs. Paul Rietzl (Abzan Control) [k7xGJhUOSrs].mp4: Moved (server side)
2018/12/16 19:46:26 DEBUG : YouTube/ChannelFireball/#GPSD Quarterfinals - Ben Weitz (Jeskai) vs. Paul Rietzl (Abzan Control) [k7xGJhUOSrs].mp4: Updating file with YouTube/ChannelFireball/(2016-08-27) #GPSD Quarterfinals - Ben Weitz (Jeskai) vs. Paul Rietzl (Abzan Control) [k7xGJhUOSrs].mp4 0xc003c4f040
2018/12/16 19:46:26 DEBUG : YouTube/ChannelFireball/: >Rename: err=<nil>
2018/12/16 19:46:26 DEBUG : YouTube/ChannelFireball/: Lookup: name="#GPSD Quarterfinals - Ben Weitz (Jeskai) vs. Paul Rietzl (Abzan Control) [k7xGJhUOSrs].jpg"
2018/12/16 19:46:26 DEBUG : Cache remote gdrive-cache:: list 'YouTube/ChannelFireball'
2018/12/16 19:46:27 DEBUG : YouTube/ChannelFireball: list: cold listing: 2018-12-15 19:46:26.709783659 -0500 EST
2018/12/16 19:47:25 DEBUG : Cache remote gdrive-cache:: starting cleanup
2018/12/16 19:47:25 DEBUG : Google drive root 'freezer': Checking for changes on remote
2018/12/16 19:47:26 INFO  : YouTube/ChannelFireball/(2015-08-14) #GPSD Quarterfinals - Ben Weitz (Jeskai) vs. Paul Rietzl (Abzan Control) [k7xGJhUOSrs].jpg: received cache expiry notification
2018/12/16 19:48:25 DEBUG : Cache remote gdrive-cache:: starting cleanup
2018/12/16 19:49:25 DEBUG : Cache remote gdrive-cache:: starting cleanup
2018/12/16 19:50:15 DEBUG : YouTube/ChannelFireball: list: read 39301 from source
2018/12/16 19:50:15 DEBUG : YouTube/ChannelFireball: list: source entries: [... 39K files omitted ...]

I really can’t figure out what I’m doing wrong here. Here’s the weirdest part; that line that says:

2018/12/16 19:46:27 DEBUG : YouTube/ChannelFireball: list: cold listing: 2018-12-15 19:46:26.709783659 -0500 EST

It’s completely bogus that there was a listing at that time, because I had definitely deleted the entire ~/.cache/rclone directory right before starting this particular run, so there is no way that it had a cold listing from 2018-12-15. It’s like rclone just made up a listing that was almost exactly 24 hours ago to justify why it needed to re-list. And this message appears every time; five minutes later there will be another phantom cold listing that was supposedly from 2018-12-15 19:50 or somesuch. If I change --cache-info-age, it changes the timestamps on these phantom listings so that they’re always just outside the window - even though it’s a brand new cache backend I just created.

Very much at a loss here, hoping someone can point out what I’m doing wrong. Thanks!

Maybe a dumb question, but can you have less files in a directory? At some point, it has to invalidate the directory when you move file around and having that many files in a single spot will be slow.

If the info_age is 24h, it would invalidate the cache and cold list out as that is what it looks like it did if the last listing was from 12-15.

I’m not following how you deleted the cache as you have a lot of cache entires showing me the cache.db file seemed to be there when this ran and wasn’t deleted.

Are you able to start fresh? I did a little testing with info age and it seemed to properly cold list in my tests.

What version are running as well?

Maybe a dumb question, but can you have less files in a directory? At some point, it has to invalidate the directory when you move file around and having that many files in a single spot will be slow.

That will be my last resort if it comes to that, yeah :confused: But I’d really rather not have to re-do all my scrapers to deepen the hierarchy. (Though I’m not sure what you meant by the second sentence - why would moving a file around inside a directory require invalidating the directory? Aren’t writes passed through the cache, surely?)

If the info_age is 24h, it would invalidate the cache and cold list out as that is what it looks like it did if the last listing was from 12-15.
I’m not following how you deleted the cache as you have a lot of cache entires showing me the cache.db file seemed to be there when this ran and wasn’t deleted.

Yeah, that’s the confusing part. Unless there’s some other place where the cache backend is storing its stuff other than the default ~/.cache/rclone, then I’m 100% sure I deleted everything before starting this run and so there is no way for there to be an actual cold listing from 12-15 (the log there is from 12-16). And even if there was, it doesn’t explain why there is always another cold listing from 5 minutes later. Surely if there was a listing from five minutes later, it wouldn’t matter that there was also a cold listing that had just expired.

Are you able to start fresh? I did a little testing with info age and it seemed to properly cold list in my tests.

Yup, I’ve done a lot of fresh starts (if by fresh start you mean cleaning out ~/.cache/rclone and re-mounting). Just did one ten minutes ago after posting this thread. Still seeing the same thing, it’s already done 2 clean listings since then.

What version are running as well?

Sorry, should have mentioned that at the start. It’s the latest v1.45 running on Arch, go version 1.11.2.

Thanks for looking into it, @Animosity022 :slight_smile:

I think you are in the right spot. If the mount is stopped and you remove that area, you should be getting rid of:

felix@gemini:~/.cache/rclone/cache-backend$ ls -al
total 13980
drwxr-xr-x 3 felix felix     4096 Aug 25 19:26 .
drwxr-xr-x 3 felix felix     4096 Aug 25 19:26 ..
drwxr-xr-x 3 felix felix     4096 Aug 25 19:31 gcache
-rw-r--r-- 1 felix felix 30584832 Dec 17 11:04 gcache.db

The cache.db should be the item that has the structure in it.

The thing I’m seeing odd is the dates timestamped in here:

2018/12/17 11:06:30 DEBUG : smu5ej34ujbdoip1cm3mlk92q4/bm9l1ogqbq5r5su3vui0a9891p4kt9di9o4knhe4tcmhmklq6uh0: list: cold listing: 2018-08-25 19:27:47.507278103 -0400 EDT
2018/12/17 11:06:31 DEBUG : smu5ej34ujbdoip1cm3mlk92q4/lprfoi8lkc2951vbhcos3sfehc: list: cold listing: 2018-08-25 19:27:47.248239489 -0400 EDT
2018/12/17 11:06:31 DEBUG : smu5ej34ujbdoip1cm3mlk92q4/a05oaqg2j4nc3fj7bisudf8mmc: list: cold listing: 2018-08-25 19:27:47.004165525 -0400 EDT
2018/12/17 11:06:34 DEBUG : smu5ej34ujbdoip1cm3mlk92q4/n6q9bdf974oqtgmrcgal3khqtc: list: cold listing: 2018-08-25 19:27:43.857859795 -0400 EDT
2018/12/17 11:06:36 DEBUG : smu5ej34ujbdoip1cm3mlk92q4/njpqsvdd572afhrvjpn1p158s4: list: cold listing: 2018-08-25 19:27:47.93723428 -0400 EDT

The dates are back from 2018-08-25 which seems just a bit odd.

So here are my other logs. When I started fresh, the first date was some odd number a few months back as I shared in the logs.

If I kept it running though and set my info_age to 2 minutes, it would cold list as the right time.

2018/12/17 11:12:14 DEBUG : tnvepu36qiohcun8v84ddhsam0: list: cold listing: 2018-12-17 11:04:29.970026383 -0500 EST
2018/12/17 11:17:47 DEBUG : tnvepu36qiohcun8v84ddhsam0: list: cold listing: 2018-12-17 11:12:54.936325501 -0500 EST
2018/12/17 11:22:47 DEBUG : tnvepu36qiohcun8v84ddhsam0: list: cold listing: 2018-12-17 11:17:55.713036156 -0500 EST

For me to list out ~2k files takes about 3 minutes:

|real|2m57.174s|
|---|---|
|user|0m0.012s|
|sys|0m0.024s|

I think your best bet is to not use the cache and just use the regular vfs backend. I get much better results as it doesn’t have to check the cache for each item.

I tested with defaults:

felix@gemini:~$ rclone mount gcrypt: /Test --log-level DEBUG --log-file /home/felix/logs/test.log --dir-cache-time 2m

and moved down the dir cache time to see how long those lists took as well.

felix@gemini:/Test/Movies$ time ls | tail -5
Zombieland (2009).mp4
Zookeeper (2011).mp4
Zoolander (2001).mkv
Zoolander 2 (2016).mkv
Zootopia (2016).mkv

real	0m4.190s
user	0m0.000s
sys	0m0.000s

4 seconds is much more tolerable than 2 minutes.

The time to run the listing is not the thing I’m trying to optimize; I can accept that it takes ~5 minutes. What I’m trying to prevent is having to do the re-listing every few minutes when the only write operations to that folder happening from anywhere are simple mv operations within the mount. There’s no reason to do any listings (beyond the first one), regardless of how long they take.

(If you’re trying to reproduce this issue, I would recommend mounting the cache with a much longer info_age, say 24h, and then just do something like $ touch a; while true; do mv a b && mv b a ; done and watch the logs to see how often you get a cold listing. My understanding of the cache semantics are that it should only need to re-list every 24 hours, but it’s actually relisting every 30-40 mv operations instead)

Do you have a fresh log of the issue with everything caught up properly? The one I saw the cold listing time expired so that’s why it did a cold list.

If the timestamp on the cold list is wrong, that would be the cause.

But that’s precisely my point; every time this cold list happens (every few minutes), it prints out a timestamp that is just over 24 hours old - but there could not have been a listing at that time because I specifically cleared out the cache dir.

And even if there was an outdated listing 24 hours ago, why would that matter? There was also a listing made 5 minutes ago, wouldn’t the existence of the 5-minute-old listing negate the need to refresh the 24-hour-old one?

Unless you didn’t clear it out but you think you did?

Can you recreate it with the log?

If I get your question, I thought the way it worked is that if you have a 24 hour info age, it would refresh that list at the end of 24 hours asking for a cold list. I could be wrong though.

@ncw - can you maybe chime in?

One add:

I unmounted and remounted and that caused a new cold listing.

2018/12/17 14:28:32 DEBUG : : list: cold listing: 2018-12-16 14:26:33.461695327 -0500 EST
2018/12/17 14:28:33 DEBUG : tnvepu36qiohcun8v84ddhsam0: list: cold listing: 2018-12-16 14:26:33.461695327 -0500 EST

That was specifically after a remount as it seems to expire it so I think you deleted the cache, remounted and it is shows an expired message based on your cache time, mine is 24 hours so it shows yesterday.

Okay I think I found the culprit - it’s actually working as designed, the cache backend explicitly invalidates both the old and new parent directories when doing a Move operation. It’s in the source code right here: https://github.com/ncw/rclone/blob/master/backend/cache/cache.go#L1671-L1693

This seems overly aggressive. Surely if one or both of the parent directories (or if they’re the same directory in the case of a simple rename) was already cached, we can just update the corresponding cache entries directly, the same as we would do for a write. It’s not like the server has any extra information about the result of the operation than the client does. (At the very least we shouldn’t need to invalidate the cache twice in the case when the old parent is the same as the new parent)

@ncw Is what I’m saying making sense, or am I missing some important reason why it is necessary to invalidate the cache every time a move happens?

1 Like

You are correct, the mv operations invalidates the caches. As to why? Well that is because it is easy. It is always correct to throw away the cache, whereas updating elements in the cache can be tricky with locking, etc.

This could of course be fixed, however I think in the case of google drive you’d run into another problem, namely the change polling (from the list changes API) would come in, that that would cause the directories to be flushed too.

You might be better off changing your script to output rclone moveto commands then running those seperately.

Or alternatively you could use rclone --rc and the operations/movefile API

1 Like

Interesting - thanks for the authoritative explanation @ncw :slight_smile: I didn’t realize the cache backend was also polling the list changes API, separately from its internal invalidation timer. But now that I know that, it perfectly explains all the symptoms I’m seeing.

Is there any way to disable the changes listing? In the case where I’m confident my mount is the only possible writer for the backing storage layer, it seems entirely redundant considering I already have cache_info_age.

Sure, --poll-interval 0 should disable it.