Following with great interest the multiwrite union work that the rclone team is doing.
I installed the 3782 test beta and set up a union drive. Curious if there are there any specific settings or flags required to enable multiwrite on a union remote? [Fully understand that it is an early test.]
The configuration is very similar to mergerFS and uses a lot of the same terms, so if you have any experience with that it should be a fairly natural transition.
If not then read carefully and then do some tests
Basically, for multiwrite to make sense you have to assign a type to each union member (see docs). This is the way you actually control what rule each union member has - ie.
What roles are appropriate depend entirely on the setup you are trying to make.
You also need to select a policy (what rules to follow about where to write data).
Let me know if I can clarify anything. I am not an expert, but I think I have the gist of it. Anone with mergerFS experience (like Animosity) will be able to help a lot too since the types and policies are basically identical to mergerFS.
Everything so far seems very solid but I must be honest and say that I have barely had time to play with it yet, much less truly stress-test I've had my hands full with projects and requests from others I had to prioritize. (nearing the end of a previous project though so I hope to have more time soon). Might be an idea to advertise it a little on the forum. Have an official multiwrite union tester thread to share feedback and experienced perhaps? Helps to consolidate to one place to get it rolling
Cookbooks - yea. The config is a little more complex by necessity. Having a few ready-made examples for the most common requested setup types would be useful. We can't cover everything though because there is basically no limit to the possible permutations - but that is exactly what is so great about this
Once I am up to speed on this perhaps I could assist on that - or maybe even better to collab on this with other testers in an official thread like I said. Having a few people OK it helps weed out mistakes.
^^ Indeed, not. And thank you! My searches for 'multiwrite' and variants didn't yield that blob. Although that's why I created this question - now anyone searching can find the instructions
Regarding a cookbook: That would be brilliant. Starting with one or two simple examples of basic cases with rclone config show unionremote output would make it more accessible.
I'll read through the doc and do some testing.
Q: In my initial test I left all policies as default. Should this allow writing to relative folders on all remotes?
Q: Some of the policies check 'free space' and 'used space'. Does this mean rclone would check and periodically update used/free space for remotes like Google My Drive/Shared Drives?
Q: What happens for storage that does not have 'space' limits but is limited by file count (e.g. Google Shared drives, limited to 400k file/folder count)? I'm guessing this hasn't been checked yet, but asking in case it's already handled. If not, perhaps a policy based on least-file-count or variants could be useful?
Thanks again. Multiwrite unions will be a fabulous addition!!
That is decided by the defaults as you can see here:
Policy to choose upstream on ACTION class.
Env Var: RCLONE_UNION_ACTION_POLICY
Policy to choose upstream on CREATE class.
Env Var: RCLONE_UNION_CREATE_POLICY
Policy to choose upstream on SEARCH class.
Env Var: RCLONE_UNION_SEARCH_POLICY
Sorry I will explain what that entails in edit... one sec
For action category - EPALL: Basically this setup will mean - search and read from the first copy of the file you find (ie. if there are multiple copies across the union memebers it will take the first and best which is sensible and the fastest to do). Importantly this will mirror any file-moving or renaming to all the remotes - so if you do it in the union the change happens to all the remotes that happen to have a copy of the affected files/folders. This is really useful and typically the kind of behavior that is desired in most setups.
For create category (meaning writing of files) - EPMFS; This will make writes happen to whatever union-member has the most free space. Warning! not all cloud-drives report size accurately. If you can see your OS report max-capacity when you mount them (all of them) then this can be used to distribute the load. Otherwise you may want a different create policy. It is also not the best policy if you want to for example have a "master" that contains all data and the others are just supposed to be backups of the master or something like that - because as I said this will try to spread out the data as evenly as it can.
For search categoy - ff: This should be fine. This is also a "whoever answers me first" policy for listings.
So TLDR: You may want to consider the create policy especially. This is often the most relevant part for cloud-usage. EPMFS makes a lot of sense to a home server with a bunch of disks, but not necessarily for all cloud storage spaces.
It will write to the one with the least space (not copy to all). If you wanted mirror copies to all I think "ALL" would be the one to use. It really depends on what you want out of it
It's not like I have looked at the code for this, but I would assume it would ask for the current free/max for the drive(s) when a transfer happens - not periodically. I also would assume there is some sort of reasonable timer here so that it does not have to do it for literally every file, but like maximum once every few minutes as needed. If you want the exact implementation you have to look at the code on Github or ask Max-Sum This is just what I assume would make the most sense and thus is fairly likely.
You can basically see what happens with this if you mount the drive. Whatever your OS reports as used/free space is almost certainly what multiwrite union will also see. So an unlimited Gdrive for example will typically report as 1PB (petabyte, or 1000TB). This is exactly why the "upload to most free space" might not make sense because such a drive would always appear to be "almost empty" and thus be chosen as the upload target. (1PB is what Windows reports at least, but that might just be the maximum display value or something - Linux might report something different as the exact value, but in any case it will be "very very big" is the point).
I hope this helped. I am by an means an expert on this (yet) so take what I say here with a few grains of salt. I will no doubt get a better grasp on things as I do more testing myself.
Clearly explained, as always. Thank you. What you describe makes sense, and is what I would expect as well.
It will be interesting to see how it works with GSuite accounts.
As you guessed, some systems seem to report 1 PB while others report 1EB as available space.
My Drive can report back used space pretty quickly using rclone about
Shared Drives (formerly Team Drives) do not report space use, free space or file counts using about ... size and tree will do so, but take quite a while for sizeable drives.
It will be interesting to see how it works in practice. Perhaps for Shared Drives there could be some kind of cache that retains last calculated size and file count along with a refresh interval? Something that would allow using the existing flags.
It might be helpful if there were some kind of failover logic. If writing to remote1: fails X times then try writing to remote2: in the union. [ Not sure if Max Sum may have already included this logic? ]
I think this limitation may be fixable for teamdrives in theory... I think we discussed it at some point but can't quite remember what we concluded about it
Anyway, for unlimited drives it's probably far more interesting to use a policy that mirrors anyway, as "balancing space" seems a bit redundant if it's unlimited anyway.
BTW, looking at it now I think to mirror "ALL" would be the correct create policy and not "EPALL". Sorry for creating confusion. I think EPALL will only write files for which the folderpaths already exist (which of course they often may not be if the intent is to mirror).
(will edit the above response to reflect this).
I'll play with the various options when I get a few minutes. See how it works.
Multiwrite is potentially very useful for Shared Drives exactly because of the 400k limitation. Where people are storing millions of small files they can be spread across multiple Shared Drives and/or mirrored as needed.
Re file count: Is there any existing mechanism to store / cache statistics related to a remote: with rclone? I haven't stumbled across it while using vfs or cache remotes. And fully understand there are risks/issues with static drive stats for remotes.
[[ I have a python script that reads/writes drive/folder stats to a gsheet and/or text file. But that's not ideal for this kind of implementation. ]]
Does not exist, currently at least. So far rclone has very little persistence from session to session. I heard from NCW he is working on a lot of improvements to the VFS layer though, including some persistence to "files we need to upload" in the cache and such. I haven't heard of any file-stats stuff like this - but once a foundation for persistence exists this is the sort of thing that might be feasible to add at some point actually... maybe you should mention that as an idea to Nick
Copying and moving test files from the local drive to a_union: (the local union) works as expected
rclone copy /opt/tmp/test a_union:test <= Success
Copying and moving test files from the local drive to a_union2: failed. The error appears to be this usage field is not supported.
Tried a number of variations of copying - files to root, files to folders, folders to folders as well as rclone touch. All of these test resulted in the same usage field is not supported error.
2020/02/21 15:24:43 DEBUG : rclone: Version "v1.51.0-028-g9b533853-pr-3782-union-beta" starting with parameters ["rclone" "copy" "/opt/tmp/test3" "a_union2:" "-vvvP"]
2020/02/21 15:24:43 DEBUG : Using config file from "/root/.config/rclone/rclone.conf"
2020-02-21 15:24:45 INFO : union root '': Waiting for checks to finish
2020-02-21 15:24:45 INFO : union root '': Waiting for transfers to finish
2020-02-21 15:24:46 DEBUG : Google drive root 'test1': read info from team drive "zz_transfer"
2020-02-21 15:24:46 ERROR : 333.csv: Failed to copy: this usage field is not supported
2020-02-21 15:24:46 ERROR : sub1/111.csv: Failed to copy: this usage field is not supported
2020-02-21 15:24:46 ERROR : Attempt 1/3 failed with 2 errors and: this usage field is not supported
2020-02-21 15:24:46 DEBUG : Google drive root 'test1': read info from team drive "zz_transfer"
2020-02-21 15:24:47 INFO : union root '': Waiting for checks to finish
2020-02-21 15:24:47 INFO : union root '': Waiting for transfers to finish
2020-02-21 15:24:48 ERROR : sub1/111.csv: Failed to copy: this usage field is not supported
2020-02-21 15:24:48 ERROR : 333.csv: Failed to copy: this usage field is not supported```
Continuing tests. Wanted to share this early result
That error message isn't in rclone so I guess it is coming from Google Drive.
Can you run that test again with --dump responses which will produce an enormous amount of debug, but hopefully you'll see the phrase "this usage field is not supported" in there. I'm interested in that HTTP RESPONSE and if you can find it the corresponding HTTP REQUEST (you'll have to match up the hex numbers to find it)
Assuming you tried it with a Business GSuite account, that's expected since even they don't have anything for free space, they only report the used space, which would have still failed the same checks.