Rclone keeps restarting

tophee · March 8, 2023, 5:38pm

Indeed, that gives me

max locked memory           (kbytes, -l) 3049767
max memory size             (kbytes, -m) unlimited

So that is where the limit comes from. But where is it set?

chatGPT tells me

The default resource limits are usually defined in the /etc/security/limits.conf file. This file contains default limits for different user and group classes. (...)

In addition to the default limits set in limits.conf, there are also system-wide limits set in the kernel that apply to all processes on the system. These limits can be viewed and changed using the sysctl command or by modifying the kernel parameters in the /etc/sysctl.conf file.

But both of these files contain nothing but commented lines...

Animosity022 · March 8, 2023, 6:25pm

Correct.

That's the spot.

Mine as an example only contains:

# End of file

root     soft   nofile  65535
root     hard   nofile  65535
*     soft   nofile  65535
*     hard   nofile  65535

and

root@gemini:/etc/security# ulimit -a
real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) 0
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 127356
max locked memory           (kbytes, -l) 4091400
max memory size             (kbytes, -m) unlimited
open files                          (-n) 65535
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) 127356
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

I'm Ubuntu though so slightly higher defaults. You can adjust that file and add in and reboot.

I think it's something like:

* soft memlock unlimited
* hard memlock unlimited

tophee · March 9, 2023, 7:49am

My point was: if the limits arenät set in these files, where do they come from?

So you're saying that each OS has it's built in defaults?

I'm getting quite deep low-level settings here. So deep that I can't seem to manage to even temporarily increase the limits for the user running rclone, just to test things out. ulimit seems to be no ordinary command...

Not sure I want to set it to unlimited. I assume there is a reason for those default settings. I was thinking that 6GB should be fine for rclone to handle 4-5m files, no?

ncw · March 9, 2023, 10:05am

Locked memory is used to stop memory being paged out to the swap file.

Rclone doesn't use it, so I'm afraid this is probably not the problem.

In general processes only lock small amounts of memory (like the keys for crypto) so I don't think you should change the default.

Your ulimits looks fine so I think we have to look elsewhere for why rclone gets killed when it approaches 4G of memory use.

You are running 64 bit kernel with 64 bit rclone so there should be no reason it stops at 4GB.

Could systemd be limiting it? I don't see a memory limit in your systemd config file but maybe there is a place to put it I don't know about.

Any ideas @Animosity022 ?

Animosity022 · March 9, 2023, 12:45pm

They are OS limits.

Yes.

Set it to whatever you want as it's up to you. Ubuntu defaults to unimited.

tophee · March 9, 2023, 1:41pm

My understanding is that the limits for locked memory are not relevant for rclone because

Last thing I tried was running

strace -e setrlimit nohup rclone serve sftp pcloud:Backup/ --addr :2022 --user ******* --pass **********t --log-file=/zfs/NAS/config/rclone/rclone.log --vfs-cache-mode writes --rc &

which gave me a lot of

--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=982083, si_uid=1000} ---

[identical lines removed]

setrlimit(RLIMIT_NOFILE, {rlim_cur=1024*1024, rlim_max=1024*1024}) = 0
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=982083, si_uid=1000} ---

[many identical lines removed]

+++ killed by SIGKILL +++nohup: ignoring input and appending output to 'nohup.out'
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=982083, si_uid=1000} ---

[identical lines removed]

setrlimit(RLIMIT_NOFILE, {rlim_cur=1024*1024, rlim_max=1024*1024}) = 0
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=982083, si_uid=1000} ---

[many identical lines removed]

+++ killed by SIGKILL +++

I have no idea what to do with this, but what puzzels me is that the command was killed but then apparently restarted somehow (not by me).

Note how there are many of the SIGURG lines before each SIGKILL. Is it possible that the problem is not memory but whatever sigurg stands for? When there's too many of those, it gets killed.

Animosity022 · March 9, 2023, 1:43pm

Sorry not locked memory:

max memory size             (kbytes, -m) unlimited

tophee · March 9, 2023, 1:48pm

I already have that:

tophee · March 9, 2023, 2:03pm

Look at wgat dmesg just gave me:

[1039514.240832] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/session-1554.scope,task=rclone,pid=982083,uid=1000
[1039514.240863] Out of memory: Killed process 982083 (rclone) total-vm:4709980kB, anon-rss:3977840kB, file-rss:40kB, shmem-rss:0kB, UID:1000 pgtables:7992kB oom_score_adj:0

What does that tell us (except for that memory is the problem)?

ncw · March 9, 2023, 3:46pm

SIGURG is used internally by the Go runtime for pre-emptive multi tasking. So you can ignore those.

tophee:

Look at wgat dmesg just gave me:

[1039514.240832] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/session-1554.scope,task=rclone,pid=982083,uid=1000
[1039514.240863] Out of memory: Killed process 982083 (rclone) total-vm:4709980kB, anon-rss:3977840kB, file-rss:40kB, shmem-rss:0kB, UID:1000 pgtables:7992kB oom_score_adj:0

What does that tell us (except for that memory is the problem)?

Well it tells us that the out of memory killer killed rclone which is new info I think.

The OOM Killer is configurable (see /proc/PID/oom_score), and also a PITA. It often kills the wrong process in my experience.

From the above rclone is using 3.9 GB of RAM which doesn't seem unreasonable. You could adjust rclone's oom_score so it is less likely to be killed.

I would guess you've got some temporary task which is using a lot of memory and the OOM killer is choosing to kill rclone at that point rather than the temporary task (which is just the sort of stupid thing the OOM killer does - read the ∞ tales on the internet of the OOM killer killing peoples production database)

You can cause the OOM Killer to run rampant if you've tweaked vm_overcommit_memory or vm.overcommit_memory or vm.overcommit_ratio also.

I spent a long time battling the OOM killer on my previous laptop which didn't quite have enough RAM to do what I wanted to do and run slack/teams/etc. It would invariably kill the wrong process. I got the best fix by enabling ZRAM which is compressed swap to RAM and I got another 9 months use out of the laptop before I had to upgrade to a model which could take more RAM.

tophee · March 9, 2023, 4:48pm

Sometimes I'm really puzzled why certain things don't get fixed.

The difference to the case at hand is, of course, that we don't have any RAM shortage...

I have done no such thing.

I've now launched it with

nohup nice -n -10 rclone serve sftp pcloud:Backup/ --addr :2022 --user ********h --pass ******** t --log-file=/zfs/NAS/config/rclone/rclone.log --vfs-cache-mode writes --rc &

I guess I can remove the --rcnow, Will do next time.

Let's see if this helps. If not should I go to -15 or even lower?

ncw · March 9, 2023, 4:55pm

I'm not sure adjusting the nice of the process will change the OOM killer's behaviour, unless you were launching it with a +ve nice value.

Try finding the running rclone's PID and doing this as root

echo -10 > /proc/PID/oom_score

That will make rclone 1/1024 times less likely to be killed by the OOM killer. Maybe!

tophee · March 9, 2023, 5:06pm

OK, thanks. I guess chatGPT was halucinating once again when it told me to adjust the nice value.

I'm not sure whether 1 promille will make a difference here.

Also, I'm wondering whether I can set the oom_score at launch and chatGPT suggested this

echo "-500" > /proc/self/oom_score_adj && rclone [your rclone options and arguments here]

is that correct?

ncw · March 9, 2023, 5:51pm

I did a bit of research - I think that is correct - oom_score_adj is inherited across forks.

So

echo "-500" > /proc/self/oom_score_adj

Will change the score of the shell you are using and then when you come to launch rclone it will start with the same score.

rclone ...

You can cat the /proc/pid/oom_score_adj after rclone has started to check. Note the lower the value the less chance of being killed and -1000 is the lowest value.

tophee · March 10, 2023, 9:33am

Yes, rclone is now running with oom_score_adj = -500. However, what we're setting with that is not the actual oom_score, it's just the amount by which the oom_score is supposed to be adjusted. I have no idea how the actual oom_score (before adjustment) is being calculated, but I suppose that is part of the OOM mystery you mentioned. From what I can see, we cannot set the absolute OOM score, only adjust whatever his majesty the OOM comes up with. And my understanding is that the OOM score for every process is continuously being updated based in divine insights.

I believe that one somewhat more transparent factor that goes into the OOM score is the process hierarchy, i.e. child processes are more likely to be killed than parents. I suppose this means that I may actually get different results when I run rclone from the terminal compared to running it as a cron-job. Or even when I run it with nohup or without. Which can be quite disconcerting when you're troubleshooting and you believe you're doing the exact same thing...

Anyway, I am no longer surprised that rclone kept being killed because before I mad any adjustments, it had an OOM score of 667 or something like that. Yesterday, when I started rclone with

It got an OOM score of 334. So without the adjustement, the OOM was basically saying: you do not deserve to live. And when I checked this morning, it had even gone up to 440. So I increased the adjustment to -800. It's crazy.

Then again, all OOM scores need to be seen relative to the OOM scores of all other processes, I'm not sure what is the right way of getting a list of all OOM scores, but I do see a list of processes with their dom_score_adj values in dmesg and there I see that all docker-proxy processes (which presumably are somehow each associated with one docker container) have an adjustment value of -500 and the containerd-shim processes (which I assume stand for a docker container each) even have -998.

So someone (probably docker itself) is holding a protecting hand over all of my containers (although the processes running inside the containers don't have any adjustment) so that rclone, which I for some reason didn't put into a container, has a weak standing when the OOM comes by (most apps on my server run in containers).

Maybe I could have saved myself all these troubles if I had installed rclone in a container. I did try, though, but something didn't work (maybe it was the problem with rclone container not being accessible on the host itself?), so I went the supposedly easy way...

Anyway, I'll let this test instance of rclone run for now, and I hope it will just continue, but eventually I want to automate it again so that it starts on reboot. Running it as a system service seems the best way to go. I saw that systemd has a setting called OOMScoreAdjust which comes in handy...

Is there a recommended way of doing this (i.e. a config file)?

ncw · March 10, 2023, 10:05am

Try this for listing OOM scores with the largest ones last (I nicked most of this from stack overflow)

while read -r pid comm; do printf '%d\t%d\t%s\n' "$pid" "$(cat /proc/$pid/oom_score)" "$comm"; done < <(ps -e -o pid= -o comm=) | sort -k2 -n

Makes for interesting reading.

These processes on my machine obviously take care that they won't be killed

1195	1	containerd
1101	67	dbus-daemon
1662519	67	postgres
43878	67	snapd
2080	334	dockerd
2470674	500	systemd-journal

The highest scoring 70 processes on my machine are all chrome which probably tells you something about how I use chrome

I use systemd. I didn't know about the OOMScoreAdjust option - that seems ideal.

system · April 9, 2023, 10:05am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.