Union "Fail Over"

I have three drives that have a lot of overlap. If I setup the union mount as a "FF" (or just use the default) it will get the file from the first mount point it finds. My question is suppose that mount point returns an error for some reason (quota, or whatever). Is it possible to have it try the next mount point?

For example suppose
Mount 1
File A

Mount 2
File A

Union of Mount 1 and Mount 2 with FF.

If I try to read A it would read from Mount 1. But suppose Mount 1 gives a quote error. Is it possible to fail to try to read from Mount 2 instead of just ... dying?

I use the union with a random search policy, it helps but I'm still not sure too that it fail overs to another remote in the union in the case of quota issues.

Also if you are having issues yu can't trust the INFO log as is not displaying 403 errors correctly...only the debug log

I guess that may work a little better, but I'm still more thinking about fail over if the API returns an error.

I agree this would be a nice feature. It would be nice to also mark a union member as dead so we don't bother it any more.

This would enable more RAID like use of union.

That's what I was thinking. Also make the error messages more useful so you know which element in the union was failing.

I support this, when using the union it would be perfect to see each remote caused the error at INFO level, and with debug it would be nice to see what remoted is being used for each file...if that doesn't add lots of overheard of course

I'm looking at providing at least read level "failover" in mergerfs for that particular usecase. It's been in my backlog for a while but recently more have asked for it. rclone has greater access to the backend and knowledge of the situation but my solution should help in more heterogeneous filesystem situations. I have a working prototype but I need to confirm the filesystem errors rclone returns when these situations occur.

3 Likes

When stuff goes wrong rclone is almost certain to return EIO - rclone maps most go errors into that.

OK. I currently catch EIO and ENOTCONN which covers a FUSE filesystem disappearing as well. While the feature is optional I don't want to too broadly apply the behavior. At least at first.

Just to be clear... rclone will return EIO when a read / transfer quota is hit? Cause there is EDQUOT as well.

We don't return EDQUOT - its very hard getting specific error messages from all the backends!

Does this means my dreams is coming true?

I have always dreamed of a union fail over between multiple remotes, mostly because of quota issues, but any other issue as well.

All my remotes contain identical copies of the files

Whilst quotas is not a problem for me at the moment I would love to do the same with multiple (own encrypted each) remotes all mostly identical in content in a merged-union'd remote of sorts to create a semi-redundant array of sorts.

Outside of the issue of marking a 'faulty' remote and timings around it (rc/health!), it would appear prudent to be able to list the underlying remotes in some order of try-priority whilst also opting to only make one underlying remote writable for obvious reasons...

Good to dream :slight_smile:

1 Like

I would certainly like to do this to make the union more like a raid1 mirror so if one is down it seamlessly uses another backend.

@Max-Sum how hard do you think this would be?

I just had this issue happen to me with a specific file. I have 9 remote in my union, not all are google drive.

Of the 9, 3 were google drive and those 3 was having issues with quota. I use the random search policy.

Tried to open the files like 20 times and no success. Shouldn't the search policy make it so that each open tries to open in another remote, so eventually it should open the files?

Yes it should, but it doesn't work like that yet as the object is resolved once and retries will use the same object :frowning:

But if I abort the ffmpeg command, and run it again, shouldn't it use another random remote?

Or does it have some kind of cache?

It does yes, the VFS cache. If you used vfs/forget to forget that one file then it would choose another.

What we really need to do is make the union object refresh itself on errors. I don't think that would be too hard actually...

In production I don't use any vfs cache on my union yet...

I think the union should have built-in failover yeah...otherwise what's the point?

It should work like this: file is opened, it will follow the union policy, in my case it's random...so it will randomly pick a remote and try to open a file, if that fails for whatever reason, it should call another random remote, and only give up if all remotes fail...

It would be sensible to mark failed files in the remotes as "dead" for 24 hours. I'm still having lots of 403 errors, mostly because rclone is not load-balancing between my remotes, so I would be able to test any branch and tell if it's working as intended or not quickly...

At the moment this selection is done at the moment of listing. Doing it at the moment of open would fix this and also the problems unioning a local and remote to make it work like mergerfs...

Hmm, will think more.

1 Like