ReadDirAll, Long Directory Listings, and "blocked for more than 120 seconds"


#1

I have an rclone mount with a directory that contains ~3500 child directories. After clearing any caches (through HUP’ing, using vfs/expire and/or cache/forget), a directory listing can take upwards of 6 minutes. The logs show it iterating through the child directories, and rclone ostensibly returns no data to the waiting thread until the process completes. Some applications like “ls” handle this fine, while others lock up and become unresponsive, requiring the process to be killed.

Oftentimes the kernel logs will alert about the stalled threads or application that issued the call.

INFO: task Threadpool work:51008 blocked for more than 120 seconds.
   Not tainted 4.15.0-2-amd64 #1 Debian 4.15.11-1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Threadpool work D    0 51008      1 0x00000004
Call Trace:
 ? __schedule+0x297/0x8b0
 schedule+0x28/0x80
 schedule_preempt_disabled+0xa/0x10
 __mutex_lock.isra.1+0x1a0/0x4e0
 ? __d_alloc+0x134/0x1d0
 ? fuse_lookup+0x30/0x120 [fuse]
 fuse_lookup+0x30/0x120 [fuse]
 ? d_alloc_parallel+0x9a/0x4d0
 ? call_rwsem_down_read_failed+0x14/0x30
 ? call_rwsem_down_read_failed+0x14/0x30
 lookup_slow+0xa9/0x170
 walk_component+0x1c4/0x340
 ? link_path_walk+0x68/0x510
 ? path_init+0x177/0x2f0
 path_lookupat+0x84/0x1f0
 filename_lookup+0xb6/0x190
 ? locks_dispose_list+0x45/0x60
 ? __check_object_size+0xaf/0x1b0
 ? strncpy_from_user+0x48/0x160
 ? getname_flags+0x6a/0x1e0
 ? vfs_statx+0x73/0xe0
 vfs_statx+0x73/0xe0
 SYSC_newstat+0x39/0x70
 ? do_fcntl+0x436/0x550
 ? SyS_fcntl+0x57/0xa0
 do_syscall_64+0x6e/0x130
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x7f1200828755
RSP: 002b:00007f11938692c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000004
RAX: ffffffffffffffda RBX: 00007f11e4022f20 RCX: 00007f1200828755
RDX: 00007f1193869330 RSI: 00007f1193869330 RDI: 00007f11e4022f20
RBP: 00007f1193869330 R08: 0000000000000000 R09: 0000000000000000
R10: 000055afc694f808 R11: 0000000000000246 R12: 00007f11e4022f20
R13: 00007f1193869330 R14: 0000000000000000 R15: 00007f1193869548

All of this seems to be preceded by ReadDirAll.

Apr 12 10:11:33 DEBIAN rclone[88980]: 2018/04/12 10:11:33 DEBUG : LargeDir/: Attr:
Apr 12 10:11:33 DEBIAN rclone[88980]: 2018/04/12 10:11:33 DEBUG : LargeDir/: >Attr: attr=valid=1s ino=0 size=0 mode=drwxrwxr-x, err=<nil>
Apr 12 10:11:33 DEBIAN rclone[88980]: 2018/04/12 10:11:33 DEBUG : LargeDir/: ReadDirAll:
Apr 12 10:11:33 DEBIAN rclone[88980]: 2018/04/12 10:11:33 DEBUG : LargeDir: Re-reading directory (10m30.606120832s old)
Apr 12 10:11:33 DEBIAN rclone[88980]: 2018/04/12 10:11:33 DEBUG : Cache remote gdcache:Crypt: list 'qtven3b8rorak2le5h4htl55hg'
Apr 12 10:11:33 DEBIAN rclone[88980]: 2018/04/12 10:11:33 DEBUG : qtven3b8rorak2le5h4htl55hg: list: cold listing: 2018-04-12 04:07:42.187117885
Apr 12 10:11:39 DEBIAN rclone[88980]: 2018/04/12 10:11:39 DEBUG : qtven3b8rorak2le5h4htl55hg: list: read 3459 from source
...
Apr 12 10:15:24 DEBIAN rclone[88980]: 2018/04/12 10:15:24 DEBUG : qtven3b8rorak2le5h4htl55hg: list: cached dir: qtven3b8rorak2le5h4htl55hg/hb3f880eoshg3qfvbdptueplvuisrqe012t3274s26s92tm14avg
Apr 12 10:15:24 DEBIAN rclone[88980]: 2018/04/12 10:15:24 DEBUG : qtven3b8rorak2le5h4htl55hg: list: cached dir: qtven3b8rorak2le5h4htl55hg/r19ahrq1pp4si5trvsb6scktt1tdgtd2kcs609dmtl2pil5pi53g
Apr 12 10:15:24 DEBIAN rclone[88980]: 2018/04/12 10:15:24 DEBUG : qtven3b8rorak2le5h4htl55hg: list: cached dir: qtven3b8rorak2le5h4htl55hg/1204r646g3jbip7ephi7mafur9rgjto21of5c6oel0pvh3r89ci0
Apr 12 10:15:24 DEBIAN rclone[88980]: 2018/04/12 10:15:24 DEBUG : qtven3b8rorak2le5h4htl55hg: list: cached dir: 'Crypt/qtven3b8rorak2le5h4htl55hg', cache ts: 2018-04-12 10:15:24.840588457 m=+995.442825758

This can be mitigated somewhat by using large cache expiry timeouts, but this only causes the issue to happen less often while causing other issues of its own.

I’ve run numerous tests using “rclone ls” to verify that some combination of flags wouldn’t ameliorate the issue to no avail. The two flags that had the most effect were “–fast-list” (positive) and “–dir-chunk-size” with values less than 1000 (negative).

time ( rclone ls GD:/Crypt/qtven3b8rorak2le5h4htl55hg --verbose=2 )
2018/04/12 10:55:23 DEBUG : rclone: Version "v1.40-065-ge82452ceβ" starting with parameters ["/usr/bin/rclone" "--verbose=2" "ls" "GD:/Crypt/qtven3b8rorak2le5h4htl55hg" "--verbose=2"]
2018/04/12 10:55:23 INFO  : Google drive root 'Crypt/qtven3b8rorak2le5h4htl55hg': Modify window is 1ms
     4166 01tvcqqgqibipq27meg9cd3o39rfhro93l6sd064sctd6tpie5f0/1a2s21cj9phssceallhrcoslas41eilelh5m4jthc8kqcckbleolala3nqpv5gmkeutccrtmvieq2l5ktvoo02f55o6d0c84ptf53jo
   340039 01tvcqqgqibipq27meg9cd3o39rfhro93l6sd064sctd6tpie5f0/ai2tdefgc62g8k47o1kocejbqo
     1012 01tvcqqgqibipq27meg9cd3o39rfhro93l6sd064sctd6tpie5f0/h7j2a7n8rbr2lem75pqob0aei6r6ih8n30igbbdc8jph6m92mh6unre9v3mscjh555ig49b7d1t3pdc3u829hegjeviuv5hlvltrvhg
   340039 01tvcqqgqibipq27meg9cd3o39rfhro93l6sd064sctd6tpie5f0/mdraftr9ginhbjln4ecgkubcu8dcvso4qlfqhqatoh9b2fai4bd47nv6sar24oktvr9d49c7di97r1gbf99bejgiolj9ab5jcdihdgo
...
    69745 vvmclgoise9dt3097pumqn57uo0s3svndf9tb9l6t0ruj2122mog/k8pi3dnm3fjtkhkbp4mhjm8100q11v97masuhc0s33n42os1omb524msr1ca7750bhb7p6440d1p2m2pj8h8spjqfp7r0sc3kt8lurg
   504953 vvmclgoise9dt3097pumqn57uo0s3svndf9tb9l6t0ruj2122mog/nfsgi32ocqq19sucqst60gj9qo
2018/04/12 11:04:03 DEBUG : 18 go routines active
2018/04/12 11:04:03 DEBUG : rclone: Version "v1.40-065-ge82452ceβ" finishing with parameters ["/usr/bin/rclone" "--verbose=2" "ls" "GD:/Crypt/qtven3b8rorak2le5h4htl55hg" "--verbose=2"]
( rclone ls GD:/Crypt/qtven3b8rorak2le5h4htl55hg --verbose=2; )  2.77s user 1.15s system 0% cpu 7:05.92 total

With all that said, I don’t know if the bug lies with rclone, fuse, or the applications that handle the long directory listing times poorly. I was curious if anyone else had run into this and found a solutiont.

Thanks.


#2

So what it looks like is that when you did the list of the directory, cache took it upon itself to list all the subdirectories too.

Have you still got the log? Where exactly does the line with > ReadDirAll go?

Are you testing for performance with rclone ls? Note that --fast-list doesn’t work with google drive.

I think it is probably an rclone bug…


#3

What’s the mount and the commands you are trying to do when you pull the directory?

My biggest dir is only 1750 entries and it doesn’t work well if I do the first cache build by doing something like “ls -alR | wc -l” in there. If I let Plex scan the dir, it seems to step through in a nicer fashion as I think it does an ls at the top level and then goes through each folder.

I wonder if the problem is more exponential than linear if you cross a threshold.


#4

Mount command:

rclone mount "${MOUNT_NAME}:" "${MOUNT_DIR}" \
--allow-other \
--default-permissions \
--buffer-size=0 \
--vfs-cache-mode=writes \
--cache-tmp-upload-path="${TMP_UPLOAD_DIR}" \
--rc \
;

I used rclone ls only to test what kind of effect various parameters had on the speed of recursively listing a directory, but did not result in anything useful. As you’ve further pointed out, most variation was probably due to outside variables and not relevant.

I can produce the results reliably and have a verbose log saved. This is the part containing what was asked:

Apr 13 08:32:14 DEBIAN rclone[2478]: 2018/04/13 08:32:14 DEBUG : Google drive root 'Crypt': Checking for changes on remote
Apr 13 08:32:14 DEBIAN rclone[2478]: 2018/04/13 08:32:14 DEBUG : Google drive root 'Crypt': All changes were processed. Waiting for more.
...
Apr 13 08:36:30 DEBIAN rclone[2478]: 2018/04/13 08:36:30 DEBUG : LargeDir/: Attr:
Apr 13 08:36:30 DEBIAN rclone[2478]: 2018/04/13 08:36:30 DEBUG : LargeDir/: >Attr: attr=valid=1s ino=0 size=0 mode=drwxrwxr-x, err=<nil>
Apr 13 08:36:30 DEBIAN rclone[2478]: 2018/04/13 08:36:30 DEBUG : LargeDir/: ReadDirAll:
Apr 13 08:36:30 DEBIAN rclone[2478]: 2018/04/13 08:36:30 DEBUG : Cache remote gdcache:Crypt: list 'qtven3b8rorak2le5h4htl55hg'
Apr 13 08:36:30 DEBIAN rclone[2478]: 2018/04/13 08:36:30 DEBUG : qtven3b8rorak2le5h4htl55hg: list: empty listing
Apr 13 08:36:30 DEBIAN rclone[2478]: 2018/04/13 08:36:30 DEBUG : qtven3b8rorak2le5h4htl55hg: list: read 0 from temp fs
Apr 13 08:36:30 DEBIAN rclone[2478]: 2018/04/13 08:36:30 DEBUG : qtven3b8rorak2le5h4htl55hg: list: temp fs entries: []
Apr 13 08:37:00 DEBIAN rclone[2478]: 2018/04/13 08:37:00 DEBUG : qtven3b8rorak2le5h4htl55hg: list: read 3459 from source
Apr 13 08:37:00 DEBIAN rclone[2478]: 2018/04/13 08:37:00 DEBUG : qtven3b8rorak2le5h4htl55hg: list: source entries: [qtven3b8rorak2le5h4htl55hg/b6vn27vu5urb7au8tomnoc64imkuakarig76j2ddpdfuptl9r2069ig5kg5tr0hmna48jou55s6g6 ... (all 3459 child objects) ... qtven3b8rorak2le5h4htl55hg/1204r646g3jbip7ephi7mafur9rgjto21of5c6oel0pvh3r89ci0]
Apr 13 08:37:00 DEBIAN rclone[2478]: 2018/04/13 08:37:00 DEBUG : qtven3b8rorak2le5h4htl55hg: list: cached dir: qtven3b8rorak2le5h4htl55hg/b6vn27vu5urb7au8tomnoc64imkuakarig76j2ddpdfuptl9r2069ig5kg5tr0hmna48jou55s6g6
Apr 13 08:37:00 DEBIAN rclone[2478]: 2018/04/13 08:37:00 DEBUG : qtven3b8rorak2le5h4htl55hg: list: cached dir: qtven3b8rorak2le5h4htl55hg/mv1ulr1ufefmptnbqqmkdeflvt1ipvt993jvpq0o594468vepl80
Apr 13 08:37:00 DEBIAN rclone[2478]: 2018/04/13 08:37:00 DEBUG : qtven3b8rorak2le5h4htl55hg: list: cached dir: qtven3b8rorak2le5h4htl55hg/mvq7nt41712f80n36b7d5ubf208gnf0kvcipgtmpemu98l1ad1eg
Apr 13 08:37:00 DEBIAN rclone[2478]: 2018/04/13 08:37:00 DEBUG : qtven3b8rorak2le5h4htl55hg: list: cached dir: qtven3b8rorak2le5h4htl55hg/pmavqc3r65287kljfnr0jruprq9aao88m0p4u28v62m47ms2m8lo1n7ro1t8stpbe6q2ek75038fo
Apr 13 08:37:00 DEBIAN rclone[2478]: 2018/04/13 08:37:00 DEBUG : qtven3b8rorak2le5h4htl55hg: list: cached dir: qtven3b8rorak2le5h4htl55hg/0k05d82ui3tbna419ifu9jpa9c
...
Apr 13 08:38:04 DEBIAN rclone[2478]: 2018/04/13 08:38:04 DEBUG : qtven3b8rorak2le5h4htl55hg: list: cached dir: qtven3b8rorak2le5h4htl55hg/hb3f880eoshg3qfvbdptueplvuisrqe012t3274s26s92tm14avg
Apr 13 08:38:04 DEBIAN rclone[2478]: 2018/04/13 08:38:04 DEBUG : qtven3b8rorak2le5h4htl55hg: list: cached dir: qtven3b8rorak2le5h4htl55hg/r19ahrq1pp4si5trvsb6scktt1tdgtd2kcs609dmtl2pil5pi53g
Apr 13 08:38:04 DEBIAN rclone[2478]: 2018/04/13 08:38:04 DEBUG : qtven3b8rorak2le5h4htl55hg: list: cached dir: qtven3b8rorak2le5h4htl55hg/1204r646g3jbip7ephi7mafur9rgjto21of5c6oel0pvh3r89ci0
Apr 13 08:38:04 DEBIAN rclone[2478]: 2018/04/13 08:38:04 DEBUG : qtven3b8rorak2le5h4htl55hg: list: cached dir: 'Crypt/qtven3b8rorak2le5h4htl55hg', cache ts: 2018-04-13 08:38:04.765211873 -0700 PDT m=+352.418024454
Apr 13 08:38:04 DEBIAN rclone[2478]: 2018/04/13 08:38:04 DEBUG : LargeDir/: >ReadDirAll: item=3459, err=<nil>
Apr 13 08:38:04 DEBIAN rclone[2478]: 2018/04/13 08:38:04 DEBUG : /: Attr:
Apr 13 08:38:04 DEBIAN rclone[2478]: 2018/04/13 08:38:04 DEBUG : /: >Attr: attr=valid=1s ino=0 size=0 mode=drwxrwxr-x, err=<nil>
Apr 13 08:38:04 DEBIAN rclone[2478]: 2018/04/13 08:38:04 DEBUG : /: Lookup: name="LargeDir"
Apr 13 08:38:04 DEBIAN rclone[2478]: 2018/04/13 08:38:04 DEBUG : /: >Lookup: node=LargeDir/, err=<nil>
Apr 13 08:38:04 DEBIAN rclone[2478]: 2018/04/13 08:38:04 DEBUG : LargeDir/: Attr:
Apr 13 08:38:04 DEBIAN rclone[2478]: 2018/04/13 08:38:04 DEBUG : LargeDir/: >Attr: attr=valid=1s ino=0 size=0 mode=drwxrwxr-x, err=<nil>
Apr 13 08:38:04 DEBIAN rclone[2478]: 2018/04/13 08:38:04 DEBUG : /: Lookup: name="LargeDir"
Apr 13 08:38:04 DEBIAN rclone[2478]: 2018/04/13 08:38:04 DEBUG : /: >Lookup: node=LargeDir/, err=<nil>

Of note: immediately before performing this directory listing, I had rm -r the entire rclone cache dir rather than forcing a cache flush through rclone itself.

Comparing the two log outputs shows possibly that an “empty listing” completes much faster than a “cold listing.”


#5

Thanks. From that log (the directory listings inside the ReadDirAll and the >ReadDirAll it is very clear that this a problem with the cache backend recursing (or maybe looking only 1 more directory deep) on the directory read.

I had a look at the code, but I could see anything obviously wrong… I think it is probably best if you summarise this in a new issue on github and we can get @remus.bunduc to take a look at it.


#6

I should really be thanking you and @remus.bunduc.

GitHub issue: https://github.com/ncw/rclone/issues/2255