Running Rclone rcd in Kubernetes cluster in loadbalanced mode for HA

What is the problem you are having with rclone?

We are running rclone rcd pods on our k8s deployment in internal load balanced mode. From other pods on the cluster we talk to rc HTTP API to start copy/sync jobs in async mode.

Every job that starts the copy/sync request then makes subsequent HTTP calls to check for job status every 30 seconds.

This works great if we have only one rclone pod. However if we have it scaled to multiple rclone pods, then internal load balancer routes the subsequent status check requests to pods that are not running that specific jobid!
Output for missing job we get ofcourse is

{"duration":0.0000065,"endTime":"2024-02-14T04:51:11.082819972Z","error":"job not found","finished":true,"group":"job/1","id":1,"output":{},"startTime":"2024-02-14T04:51:11.082813072Z","success":false}

I think I understand the problem quite well. Every rcd server instance is independent and is not aware about other instances.

Question:

Do you have any specific recommendations to deal with this scenario? Only potential solution I can think of at the moment is to look for message "job not found" and keep rechecking until we get success or correct error message for a failure.

Is there some way we can have the job state "shared" between pods? Is the information being written to disk location on rcd servers that we can potentially have a common mount for? Will that cause any strange lock issues?

Other Thoughts:

I saw some posts about the issue where comments say that rclone is designed to be stateless, which is true, but now that we have job tracking introduced, its not exactly stateless in that regard.

As a future feature, it might be really helpful to have a "Cluster Mode" for rcd servers to somehow make the aware of each other ( may be via a shared file system)

Run the command 'rclone version' and share the full output of the command.

rclone v1.65.2
- os/version: alpine 3.19.0 (64 bit)
- os/kernel: 5.15.0-1053-azure (x86_64)
- os/type: linux
- os/arch: amd64
- go/version: go1.21.6
- go/linking: static
- go/tags: none

Which cloud storage system are you using? (eg Google Drive)

Azure blob storage

The command you were trying to run (eg rclone copy /tmp remote:tmp)

not relevant.

Please run 'rclone config redacted' and share the full output. If you get command not found, please make sure to update rclone.

not relevant.

A log from the command that you were trying to run with the -vv flag

not relevant.

There's another potential issue with running multiple load balanced instances of rclone rcd.

Every rcd instances tracks it's own jobIDs and these are just incrementing integers. So potentially we can have a conflict when two jobs have same IDs on different rcd instances, and our job status check requests cross connects.

May be going to guids for jobids is a better idea?

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.