Altering low-level retries to resolve transient errors

Hey @ncw — happy new year! I previously asked some questions about some transient copying errors we were seeing and I'm trying to figure out if I can tweak our settings to cut down on them.

What is the problem you are having with rclone?

I run a daily rclone copy from Swift (OVH) to S3 (R2) on about 50% of the days a single file fails to copy. This isn't a huge deal, and on the subsequent copy the file gets copied over. I'm trying to figure out whether I can make the operation a bit more resilient to these copy errors so that I can more accurately listen to rclone's exit code to provide me info on whether or not there was a larger issue with the copy command. Right now we get an email alert if rclone exits with a nonzero status, but so far that's simply an indication of the above problem (which we don't take any action on).

Unfortunately I'm unable to reproduce this error reliably with a small copy and more verbose logs. I've also set -retries 1 on the command to avoid the entire copy being retried on error.

I'd like to make sure that I understand exactly what is going on here, so I'll state a few of my assumptions:

  • I haven't altered the --low-level-retries value from default, so I presume that it's set to the default of 10, and this error is only surfacing once a single file has been retried 10 times and failed every time
  • Similarly, I assume that other files could be erroring out fewer than ten times, and that would simply proceed silently as successful copies once they succeed
  • I believe with my settings, even if a file fails (10x) early in the copy procedure, the entire copy will continue, so I end up with a successful copy minus the single failing file

It seems to me like 10 retries might be just about enough given that I see between 0 and 1 failed file copy a day; I'm wondering if simply altering this value to be higher (say, 20 low level retries) might be sufficient to result in this command succeeding almost all the time. I'd love your thoughts!

Run the command 'rclone version' and share the full output of the command.

rclone v1.63.1
- os/version: ubuntu 22.04 (64 bit)
- os/kernel: 5.15.0-1050-aws (x86_64)
- os/type: linux
- os/arch: amd64
- go/version: go1.20.6
- go/linking: static
- go/tags: none

Which cloud storage system are you using? (eg Google Drive)

Source: Swift (OVH)
Destination: S3 (Cloudflare R2)

The command you were trying to run (eg rclone copy /tmp remote:tmp)

rclone copy --stats-one-line copy swift-source: s3-dest:prod-bucket --filter-from /app/filter1.txt --filter-from /app/filter2.txt --filter-from /app/filter3.txt --filter-from /app/filter4.txt --max-transfer 15G --retries 1 -v

Please run 'rclone config redacted' and share the full output. If you get command not found, please make sure to update rclone.

[swift-source]
type = swift
env_auth = true

[s3-dest]
type = s3
provider = Cloudflare
env_auth = true
endpoint = https://XXXXXXXXXXXXXXXXXXXX.r2.cloudflarestorage.com/

A log from the command that you were trying to run with the -vv flag

Unfortunately I'm not able to run this command with -vv due to the number of files, and running a more limited set with -vv doesn't typically cause the error. Here's a redacted and condensed set of logs — the removed log lines are simply files copying over correctly and the stats printing.

2024/01/02 16:00:55 INFO  : XXXXXXXXXXXXX: Copied (new)
2024/01/02 16:00:55 INFO  : XXXXXXXXXXXXX: Copied (new)
2024/01/02 16:01:49 INFO  :           0 B / 0 B, -, 0 B/s, ETA - (chk#2843/12858)
2024/01/02 16:02:49 INFO  :           0 B / 0 B, -, 0 B/s, ETA - (chk#5876/15890)
2024/01/02 16:03:49 INFO  :           0 B / 0 B, -, 0 B/s, ETA - (chk#8971/18985)
2024/01/02 16:04:49 INFO  :           0 B / 0 B, -, 0 B/s, ETA - (chk#12039/22052)
2024/01/02 16:05:49 INFO  :           0 B / 0 B, -, 0 B/s, ETA - (chk#15060/25073)
2024/01/02 16:06:49 INFO  :           0 B / 0 B, -, 0 B/s, ETA - (chk#17737/27751)
2024/01/02 16:12:06 INFO  : XXXXXXXXXXXXX: Updated modification time in destination
2024/01/02 16:14:46 ERROR : XXXXXXXXXXXXX: Failed to copy: expected element type <Error> but have <html>
2024/01/02 17:38:49 INFO  :     2.122 GiB / 2.122 GiB, 100%, 0 B/s, ETA - (chk#305846/305919)
2024/01/02 17:38:50 ERROR : Attempt 1/1 failed with 1 errors and: expected element type <Error> but have <html>
2024/01/02 17:38:50 INFO  :     2.122 GiB / 2.122 GiB, 100%, 0 B/s, ETA -
2024/01/02 17:38:50 Failed to copy: expected element type <Error> but have <html>

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.