I have an rclone mount with a directory that contains ~3500 child directories. After clearing any caches (through HUP’ing, using vfs/expire and/or cache/forget), a directory listing can take upwards of 6 minutes. The logs show it iterating through the child directories, and rclone ostensibly returns no data to the waiting thread until the process completes. Some applications like “ls” handle this fine, while others lock up and become unresponsive, requiring the process to be killed.
Oftentimes the kernel logs will alert about the stalled threads or application that issued the call.
This can be mitigated somewhat by using large cache expiry timeouts, but this only causes the issue to happen less often while causing other issues of its own.
I’ve run numerous tests using “rclone ls” to verify that some combination of flags wouldn’t ameliorate the issue to no avail. The two flags that had the most effect were “–fast-list” (positive) and “–dir-chunk-size” with values less than 1000 (negative).
time ( rclone ls GD:/Crypt/qtven3b8rorak2le5h4htl55hg --verbose=2 )
2018/04/12 10:55:23 DEBUG : rclone: Version "v1.40-065-ge82452ceβ" starting with parameters ["/usr/bin/rclone" "--verbose=2" "ls" "GD:/Crypt/qtven3b8rorak2le5h4htl55hg" "--verbose=2"]
2018/04/12 10:55:23 INFO : Google drive root 'Crypt/qtven3b8rorak2le5h4htl55hg': Modify window is 1ms
4166 01tvcqqgqibipq27meg9cd3o39rfhro93l6sd064sctd6tpie5f0/1a2s21cj9phssceallhrcoslas41eilelh5m4jthc8kqcckbleolala3nqpv5gmkeutccrtmvieq2l5ktvoo02f55o6d0c84ptf53jo
340039 01tvcqqgqibipq27meg9cd3o39rfhro93l6sd064sctd6tpie5f0/ai2tdefgc62g8k47o1kocejbqo
1012 01tvcqqgqibipq27meg9cd3o39rfhro93l6sd064sctd6tpie5f0/h7j2a7n8rbr2lem75pqob0aei6r6ih8n30igbbdc8jph6m92mh6unre9v3mscjh555ig49b7d1t3pdc3u829hegjeviuv5hlvltrvhg
340039 01tvcqqgqibipq27meg9cd3o39rfhro93l6sd064sctd6tpie5f0/mdraftr9ginhbjln4ecgkubcu8dcvso4qlfqhqatoh9b2fai4bd47nv6sar24oktvr9d49c7di97r1gbf99bejgiolj9ab5jcdihdgo
...
69745 vvmclgoise9dt3097pumqn57uo0s3svndf9tb9l6t0ruj2122mog/k8pi3dnm3fjtkhkbp4mhjm8100q11v97masuhc0s33n42os1omb524msr1ca7750bhb7p6440d1p2m2pj8h8spjqfp7r0sc3kt8lurg
504953 vvmclgoise9dt3097pumqn57uo0s3svndf9tb9l6t0ruj2122mog/nfsgi32ocqq19sucqst60gj9qo
2018/04/12 11:04:03 DEBUG : 18 go routines active
2018/04/12 11:04:03 DEBUG : rclone: Version "v1.40-065-ge82452ceβ" finishing with parameters ["/usr/bin/rclone" "--verbose=2" "ls" "GD:/Crypt/qtven3b8rorak2le5h4htl55hg" "--verbose=2"]
( rclone ls GD:/Crypt/qtven3b8rorak2le5h4htl55hg --verbose=2; ) 2.77s user 1.15s system 0% cpu 7:05.92 total
With all that said, I don’t know if the bug lies with rclone, fuse, or the applications that handle the long directory listing times poorly. I was curious if anyone else had run into this and found a solutiont.
What’s the mount and the commands you are trying to do when you pull the directory?
My biggest dir is only 1750 entries and it doesn’t work well if I do the first cache build by doing something like “ls -alR | wc -l” in there. If I let Plex scan the dir, it seems to step through in a nicer fashion as I think it does an ls at the top level and then goes through each folder.
I wonder if the problem is more exponential than linear if you cross a threshold.
I used rclone ls only to test what kind of effect various parameters had on the speed of recursively listing a directory, but did not result in anything useful. As you’ve further pointed out, most variation was probably due to outside variables and not relevant.
I can produce the results reliably and have a verbose log saved. This is the part containing what was asked:
Of note: immediately before performing this directory listing, I had rm -r the entire rclone cache dir rather than forcing a cache flush through rclone itself.
Comparing the two log outputs shows possibly that an “empty listing” completes much faster than a “cold listing.”
Thanks. From that log (the directory listings inside the ReadDirAll and the >ReadDirAll it is very clear that this a problem with the cache backend recursing (or maybe looking only 1 more directory deep) on the directory read.
I had a look at the code, but I could see anything obviously wrong… I think it is probably best if you summarise this in a new issue on github and we can get @remus.bunduc to take a look at it.