Rclone is stuck when a large number of files are transferred

wangzejian123 · September 16, 2019, 6:34am

I use the following command：

Source Address（10.49.1.236） Information: There are 518 files in a small directory, and 10,000,000 files in each of the other eight directories. Small directory transfer is normal, but when transferring eight large directories, it seems to be stuck. The rclone log does not change except for time. (10,000 files were detected when testing 1 million files before, and then started to transfer) It looks like it's going to read the catalog all the time. It won't stop. Ouch.. Tracking threads find that LSTAT and getdent64 are also called every minute, but most of the time they are stuck in futex. By using the grab package, it was found that it was still reading the catalog, but it had been reading for more than ten hours,and the number of files detected should have already exceeded 10,000.

The version I use:rclone-1.45-1.el7.x86_64
I tried to set the --max-backlog to 1, but it still stuck.
Who knows where the problem may be and how to locate it? Thank you.

Animosity022 · September 16, 2019, 10:15am

Can you update to the latest and try again as your version is a bit old?

Can you run the command with -vv and share the output?

ncw · September 16, 2019, 10:41am

rclone will be listing the 10,000,000 file directory. This will undoubtedly take some time especially over NFS.

Rclone won't start transferring anything from the directory until it has read it all.

Note also that rclone uses getdents64 whereas libc uses getdents; I've seen kernel bugs in getdents64 because it doesn't get used as often (eg in cifs and a go bug report).

How long does rclone size /path/to/10000000filedirectory take?

@Animosity022 's advice for using a newer rclone is good as you'll get a newer go runtime too.

wangzejian123 · September 16, 2019, 12:38pm

Oh, Reading directories and selecting files are two steps. Max-backlog only works in the selection files stage, but it still needs to read directories first.this takes up memory, so swap guarantees adequate memory.
Does rclone need to read all directories in advance, or is it just a directory of 10 million files?

It took me less than three minutes to execute rclone size directly on the NFS server where I stored 10,000,000 file directories. But I executed rclone size on the NFS client that mounted this directory, and it's not over in 150 minutes.

I'll test the speed of the latest version 1.49.3 next.

ncw · September 16, 2019, 2:49pm

No it doesn't need to read all the directories in advance.

However it needs to read a whole directory before acting on it.

This is likely the problem Actually reading the directories over NFS is very slow. You can tweak NFS to increase the speed I think - see the link above.

wangzejian123 · September 17, 2019, 9:18am

NFS does slow down reading directories, but there may be other problems. I executed 'rclone size' in the directory of 100,000 files on the NFS client for 1 minute and 4 seconds. 'rclone size' was executed in the same environment, the directory of one million files was executed for 10 minutes, and the directory of ten million files was executed for nearly a day but still not finished. NFS uses xfs, and NFS client is a virtual machine with 4 GB memory. This may be related to indirect inode or swap, but it's still strange.

ncw · September 17, 2019, 9:41am

What happens if you do the same tests with du -hs (which does pretty much the same thing as rclone size)?

Is rclone swapping when you do the 10 million files?

wangzejian123 · September 17, 2019, 3:52pm

I tested it under 100,000 files. 'du -hs' is about 30% slower than 'rclone size', probably because gendents64 performs better.
Indeed, I executed the command 'du -hs' for nearly ten hours, and it was still incomplete.I find that "du -hs" reads 100,000 documents at a time, starting from scratch every time! And the command 'getdents' reads 50 directories at a time, but the cookie of the getdents read directory is not the last cookie of the last getdents directory at a time.These two problems slowed down later.So gendents take up more and more time later.So that should be the reason.

Swap did occur during rclone's transmission of 10 million files, and the last picture on the first floor was the memory scene at that time.

ncw · September 18, 2019, 8:52am

So this isn't specifically an rclone problem if du takes a similar length of time.

Did you try increasing rsize and wsize on the NFS mount - that can help with performance.

Can you try on a machine with a bit more memory so rclone does not swap?

system · September 21, 2019, 8:52am

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.