High load average using rclone mount + storj

FragileRasputin · July 21, 2020, 6:46pm

What is the problem you are having with rclone?

I have the same setup on 4 different serves, but 2 of them for some reason get higher and higher load average; I have tried looking into iotop, vnstat vmstat 1, iostat -x 5 and as far as I understood the numbers, the CPUs are mostly idle and waiting, network activity is not stressing the machine as it's just random data coming in an out; running sync does not take more than a split second to finish. the system remains very usable and fast to responde, other than the unusually high load average ( in the 400s after several minutes)

I have also tried adjusting tpslimit, workers, rps, chunk size, cache time, on the servers where the load goes crazy but couldn't get any conclusive results
are there any O.S. settings I should be looking at that could be causing such weird behavior?
v1.52.1 is the version on the hosts that don't have high load on the system, but downgrading did not help the ones that do.
Any other information I can provide to help debug this?

What is your rclone version (output from `rclone version`)

rclone v1.52.2

os/arch: linux/amd64
go version: go1.14.4
and
rclone v1.52.1
os/arch: linux/amd64
go version: go1.14.4

Which OS you are using and how many bits (eg Windows 7, 64 bit)

Archlinux and another with Ubuntu

Which cloud storage system are you using? (eg Google Drive)

Google drive and a remote for cache

The command you were trying to run (eg `rclone copy /tmp remote:tmp`)

/usr/bin/rclone mount \
    --log-level=INFO --log-file=/var/log/rclone.log \
    --config=~/.config/rclone/rclone.conf \
    --allow-other \
    --poll-interval 15s \
    --timeout 1m \
    --cache-chunk-path=/caches/remote/chunks \
    --cache-dir=/caches/remote/vfs \
    --cache-db-path=/caches/remote/db \
    --drive-use-trash=false \
    --user-agent storjbox \
    --gid=1000 --uid=1000 \
    --tpslimit 6 --transfers 4 \
    --vfs-read-chunk-size 16M --vfs-read-chunk-size-limit 50G \
    --vfs-cache-mode writes \
    --dir-cache-time=86400m --cache-info-age=7d \
    --cache-rps=10 \
    cached_remote:/ /local_mount

The rclone config contents with secrets removed.

[direct-remote]
type = drive
client_id = xxxxxxxxxxxxxxxxxxxxxxxxx.apps.googleusercontent.com
client_secret = -xxxxxxxxxxxx
scope = drive
root_folder_id = xxxxxxxxxxxxxxxx
token = {"access_token":"xxxxxxxxxx","token_type":"Bearer","refresh_token":"1//xxxxxxxxxx","expiry":"2020-07-21T01:17:49.478592788+02:00"}
rps = 50
tpslimit = 4

[cached_remote]
type = cache
remote = direct-remote:/foldder-inside-remote
info_age = 5d
chunk_total_size = 20G
workers = 2
rps = 10

A log from the command with the `-vv` flag

https://pastebin.com/Y2iwqQAj

asdffdsa · July 21, 2020, 7:06pm

hello,
what makes you think rclone is at fault.
what does storj have to do with this, as the mount is using gdrive

are you sure you need the cache backend?
it is the source of many problems, as seen in the number of forum posts.
https://rclone.org/cache/#status

you wrote "sync does not take more than a split second to finish"
are you using rclone mount, rclone sync, some combination of the two or some other kind of sync?

FragileRasputin · July 21, 2020, 7:38pm

I'm not sure rclone is at fault, but I thought of asking for some help on what could be causing such load on a few of the systems, or if it could be related to rclone at all); i.e. as I don't know what's causing it, I also don't know what's not causing it....

My use case is running storj and using a mounted remote via rclone. I'm using the cache backend to help keep the number of API calls under "control".

Sorry for not being clear about sync. I meant simply sync to flush buffers to disk and not rclone sync

My best wild guess at this point is something between storj and rclone. As the only thing I can really adjust is rclone, I decided to ask for help here. I know I didn't specify exactly what number I've tried for different settings, but as none of my changes made any difference, I wasn't sure if those would help others... I've reduced the number of workers, rps, increased cache time, for example, The only small difference it made was that instead of reaching loads of 400s in 10 minutes, it would take longer. but it would still happen.

As I mentioned I have other systems with the same setup and they are fine.

asdffdsa · July 21, 2020, 7:40pm

i would remove that cache.
as per the link i shared.
"There are many docs online describing the use of the cache backend to minimize API hits and by-and-large these are out of date and the cache backend isn't needed in those scenarios any more."

FragileRasputin · July 21, 2020, 7:40pm

Thanks for your reply.. I was going over the cache-related bugs and this one caught my eye:

I'll give it a try without the cache backend, and report back

system · September 20, 2020, 3:40pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.