Remote data logger on unreliable satellite internet upload to GDrive

tmusolf · February 12, 2021, 9:37am

I have a remote data logger running on a raspberry pi, battery operated with a satellite internet connection. Every 15 minutes a new data sample is taken and a single 5kb file is sent to google drive using rclone copy srcdir destdir. No other options are specified on the command line nor the config file. V1.53.4

This works fine, but periodically the upload either fails or rclone takes maybe 20 minutes to complete the transfer. This is more than likely due to the limited bandwidth and spotty nature of the satellite internet connection. I haven't yet been able to capture these failures with -vv.

I'm not trying to optimize speed, I'm just trying to reduce the number of failed attempts and prevent the system from holding off other system activities by waiting too long to complete a transfer.

What I'd like rclone to do is keep attempting to send the file up to some timeout limit. If the timeout is reached the rclone command is terminated and I'll reschedule it at a later time.

From my reading I believe the following options are part of the solution
--retries
--retries-sleep
--low-level-retries
--timeout
--contimeout
--expect-continue-timout
--max-duration
--expect-continue-timout

It's unclear from what I've read how they might help address the spotty/unreliable internet problem and how they interact. Any recommendations or advice?

ncw · February 12, 2021, 10:45am

In theory --timeout (default 5 minutes) controls how long each connection can be idle before rclone cuts it off. In practice I have my doubts as to how well it works! I may re-implement it soon.

--contimeout is the timeout on the initial connection attempt.

Assuming --timeout kicks in (or some network breakage) then the next thing that will happen is a --low-level-retry - there are 10 of these by default.

The final thing is --retries (default 3) which try 3 times the whole sync.

So if you are just sending 1 file, it might get tried up to 30 times!

What I would do is have an outgoing directory of files to copy, then use rclone move to move them. It then doesn't matter if a single rclone move fails as the next one will pick it up.

Probably what you want to do is tweak down --low-level-retries and --retries.

You could also set --max-duration which will stop rclone when it reaches the cutoff time.
If you can get a log with -vv from rclone, you can see exactly what goes wrong - whether it is doing low level retries

tmusolf · February 12, 2021, 4:51pm

Very useful explanation. I am currently trying to capture a failure with -vv. I have a few further clarifying questions and confirmations.

So the first timeout is --contimeout to get the initial connection. Is it true if this times out then the entire copy operation is terminated, none of the --low-level-retry no --retries apply?

If a connection is established then --timeout starts as the actual transfer is occurring. If this times out or some other network failure (due to a dropped or super slow connection) it will retry --low-level-retries times. It will repeat this process up to --retries times. Does --timeout apply to each of the --low-level-retries attempts? If so, then with default value of --timeout=5m and 30 possible low-level attempts it could be 150m before it fails? How does the --expect-continue-timeout apply to these attempts?

And finally --max-duration applies to the entire process including the initial connection attempt through actually performing the copy?

If --contimeout, --retries or --max-duration is reached the process is terminated with a non-zero exit code?

ncw · February 13, 2021, 11:54am

The --contimeout is the first connection yes. Depending on exactly what error this returns it will be low-level retried or not. So if it returns a timeout error, then it will be low level retried, but if it returns out of memory then it won't. That kind of thing. That is true for all errors. --retries will apply regardless.

Potentially, yes.

Potentially yes. However the timeout only kicks in if no data has been moved for that time which is probably unlikely.

Expect continue is more of a low-level HTTP thing. It is how long rclone waits for a response to the expect header before continuing. It probably isn't relevant here.

Yes, max duration applies to the whole sync. I think it can be retried though with --retries

The process will exit with non-0 if rclone gave up retrying stuff and couldn't continue. So there may be --contimeout or --retries but if the data was transferred OK in the end, then rclone will exit with 0. If the --max-duration limit is reached rclone will exit with non-zero exit code.

tmusolf · February 13, 2021, 9:59pm

Again, thanks for the clarification on the options. I was able to capture a number of failed rclone logs from -vv. The entries on the attached log starting at 2/13/2021 11:30 through 13:15 show an extended period of spotty to no internet and then a slow connection then back to normal. Log file data using default values from a simple: rclone copy srcdir gdrive:destdir

2021-02-12-FailedExcerpt.log (94.2 KB)

These are the options I've been experimenting with.

First attempts at adjusted internet related rclone

options for slow/spotty internet connection

OPT_INTERNET = " ".join((
"--retries=5",
"--retries-sleep=10s",
"--low-level-retries=5",
"--timeout=5m",
"--contimeout=2m",
"--expect-continue-timeout=60s",
"--max-duration=5m"
))

These are my best guesses to my low mips/memory/internet RPi work better.

OPT_RPI = " ".join((
"--checkers=1",
"--multi-thread-streams=0",
"--no-traverse",
"--transfers=1",
"--check-first",
"--use-mmap",
"--buffer-size=64M"
))

ncw · February 14, 2021, 10:01am

The log looks about how I might expect, except rclone isn't retying the "network is unreachable" from the initial oauth connection and it probably should be.

These look ok to me.

I wouldn't use --no-traverse unless you only have a few things in the source. Google drive in particular hates --no-traverse!

Note that the default buffer is 16M and you can use 0 and things will work fine.

--check-first may use more memory as it builds the sync in memory first, but it will decrease cpu and memory by not interleaving checks and transfers. For small syncs it is probably a win.

tmusolf · February 19, 2021, 5:38am

I was able to capture a really bad connection - took over an hour to complete a transfer to GDrive that normally takes a 1minute. Again this was with basic rclone copy srcdir destgdrive: with no other parms. Check the transfer starting at 18:15

Other than using --max-duration to kill this and try again later if the connection is really slow/bad is there anything that can be learned from this log about how other parameters might be applied to make it more robust or have rclone determine via retries and timeout parms that things are just slow and it's time to give up?

It looks like both --retries and --low-level-retries and being used here based on the fact I see the same file list being changed multiple times.
2021-02-17-181500-GDriveOver1Hr.txt (345.3 KB)

ncw · February 19, 2021, 12:27pm

The --low-level-retries and --retries are tuned so that rclone transfers are successful if possible.

Though quite a lot of your transfers fail immediately as the request done when creating the filesystem fails. This is looking up the root_folder_id. You could put this in the config to save this transaction - this will allow everything to be retried. Rclone doesn't retry failures to create the backend at the moment.

Google drive root 'WaterLogger/data': root_folder_id = "0AEGVUEE5jRR3Uk9PVA" - save this in the config to speed up startup

I also note that rclone isn't retrying the "io/timeout" error which it really should be I think.

For some reason this works fine with v1.54 but not with v1.53 - there haven't been any rclone changes here and there is nothing obvious in the go changelog, so I suggest giving v1.54 a go here if you can.

v1.54 will certainly help here.

If I had one piece of advice it would be to change your workflow to use rclone move rather than rclone copy. That will mean rclone doesn't have to iterate all the existing stuff in the transfer directory which will speed things up.

Also setting root_folder_id will help a little bit.

system · April 21, 2021, 8:27am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.