Request for comments - `seq` union policy

ctrl-q · March 9, 2022, 1:25am

I am trying to implement a union policy that would try to fill each upstream in order, until one succeeds, as described here: Union storage: how to fill upstream sequentially?
I'd love to get people's thoughts on it.

The idea is this: every policy type (create, action, search) is broken down into attempts. Every attempt is a list of upstreams.
For all the existing policies, there's just one attempt containing all the upstreams.
For the sequential policy, each upstream is its own attempt

e.g. given a union of upstreamA: and upstreamB:, the attempts for every existing policy would simply be:

[
    [upstreamA, upstreamB]
]

For the sequential policy, the attempts would be:

[
    [upstreamA],
    [upstreamB]
]

The union backend then executes the user's command one attempt at a time, until one attempt succeeds. Once an attempt succeeds, the following attempts are skipped.

I have a working implementation over at GitHub - ctrl-q/rclone at feature/union-add-seq
One issue I'm finding with it so far is with the retries : the retry logic only starts if all attempts fail, that is, in the sequential policy, if both upstreamA and upstreamB fail. Instead, it should probably retry upstreamA, and only if it fails the retries, should it move on to the second attempt and try upstreamB

Any thoughts on this would be much appreciated, thanks!

ncw · March 9, 2022, 11:28am

The difference here is all about the error handling really! I think what you are saying is that instead of uploading in parallel to all possible upstreams, upload sequentially.

What happens if the first upload gives an error? Should that error be returned? Or should it go onto the second upload. Should the first error be returned at the end? I'd argue that it should otherwise this is effectively a new policy "upload to one remote" rather than "upload to all remotes" which is how it is specified at the moment.

Only the low level retry logic is in the backends, and for uploads, not even that. To keep state between the backend and the thing that does the retries would require some extra mechanism in rclone that doesn't exist at the moment... The backend could potentially keep a cache of most recent uploads and attach state to those - that would work.

Aside: One of the problems with the union backend in general is that when rclone does a retry, it uses the objects that it has already calculated, so when the union backend gives back an object from an underlying backend the retries never move on to something different.

ctrl-q · March 12, 2022, 3:37am

Thanks for your reply! I'll try to address some of your points

Correct, the goal here is to upload to the first successful upstream, rather than to all upstreams. It's useful when the user has a preference for upstreamA but would like to fall back to upstreamB

One example of this is with crypt remotes, like in the following config:

[cloud]
...
[upstreamA]
type = crypt
remote = cloud:crypt-standard
filename_encryption = standard

[upstreamB]
type = crypt
remote = cloud:crypt-obfuscate
filename_encryption = obfuscate

You would prefer to use upstreamA, but in case it generates paths that are too long for cloud:, you'd like to fall back to upstreamB

Since the purpose of this union would be to try the upstreams one by one until one succeeds, my opinion is it should log the error of the first upstream and move on to the next upstream. If and only if all upstreams fail should this policy return the errors.

Interesting. So the policy could keep track of how many times it tried to upload the file for each upstream.
This way, it could check if upstreamA is on the last retry and only then then move on to upstreamB.

Is there a way for a backend to know if it is currently in the last retry? Alternatively, is there a way to get the value of the --max-retries setting?

ncw · March 12, 2022, 4:56pm

Perhaps this policy is really ff - first found but with the retries fixed so they go to a different backend? (At the moment all the retries go to same backend).

Yes it could, and that would fit in with this being a fixed ff policy.

Not currently, but it would be possible to add it in the context.

That is readable from the config which is passed in the context to the backend.

ctrl-q · March 14, 2022, 12:42am

I think it's a little different from ff, because ff picks the upstream first, and then runs the operation against that upstream only, whereas with seq, we just naively try each one with no preselection of the upstream.
In other words, in the best case, seq will have tried only one upstream, and in the worst case, it will have tried all upstreams. ff always tries exactly one upstream.

It could be done this way, but I'm thinking the user would probably prefer that we retry the first upstream, then move on to the next and retry that one, etc, though your idea also makes sense. This would mean that the max number of retries would actually be --max-retries * len(upstreams)

ncw · March 14, 2022, 11:15am

That is true, but what people expect is that if one upstream errors then rclone moves onto the next which isn't happening at the moment.

So people would like an upstream to be marked as in error somehow and be taken out of the ff rotation for a while. This seems quite similar to what you are trying to acheive.

Exactly how many retries each upstream gets, and what exactly "marked in error" means are tricky implementation questions!

ctrl-q · March 17, 2022, 3:02am

Ah, I see your point. So then there's two independent but similar features here

The seq policy originally discussed in this thread

People should use this policy for errors that are expected, for example, the "long path" issue I described earlier, or when upstreamA is almost full, so large files would fail, but we'd still like to try it with small files.
If upstreamA is reliable and isn't restrictive on file sizes or paths (i.e., if you never expect it to fail), you shouldn't use this policy.

The main thing left for me to do here is properly handling errors and retries

Filtering out "bad" upstreams from the rotation

I think this is different enough to #1 because it's meant to handle unexpected errors, like downtime, expired credentials, etc.

I know you were speaking in the context of the ff policy, but this seems like it could be useful for all policies. What do you think?

Like you said, the challenge is in determining what makes an upstream "bad" and for how long. This would also require storing state, to make sure we keep track of errors from one rclone run to the next.

Here's an idea: let the user control this via flags, for example
--union-backoff-max-errors : total number of consecutive errors before an upstream is filtered out
--union-backoff-duration: how long before the upstream is tried again

One challenge I see here is properly counting errors across multiple concurrent rclone processes (e.g. if the user is running two rclone copy's simultaneously)

ncw · March 17, 2022, 3:38pm

As you've sketched the implementation, this isn't a policy as they are currently implemented... The policies just return a list of backends to do any given operation on. I'd describe seq as changing the upload strategy from all to first-success. This would apply whenever we are uploading to more than one backend. It could be made a policy but that would require re-working the policy interface.

It is undoubtedly useful for all policies yes!

In general rclone doesn't store state, so you'd have to discover the bad backends each time it ran.

That sounds like a good start yes!

system · May 17, 2022, 11:39am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.