Rate-limiting the VFS cache speed to prevent the local disk from filling up

banneridealistdeviat · April 6, 2021, 4:33am

Hi there, I use the VFS cache feature (writes mode) with rclone mount to get access to cloud storage such as Google Drive and Backblaze B2 with rclone's encryption as a file system mount on my Linux computers. I do this to run borg with its repositories on the rclone mount for my daily manual offsite backups, and it works wonderfully. The only problem is that my upload speed (20Mbps) is vastly slower than my IO speed (2000Mbps), so the VFS cache fills up very quickly. The problem is at it's biggest when I do the initial archive creation, where I backup all my data from the start without any deduplication (I will effectively need to store a copy of the total backup size to cloud storage on disk tepmorarily). There is --vfs-cache-max-age which I can set to 0 seconds and --vfs-cache-max-size which I can set to 0 bytes so that it only stores the file on disk when its still uploading, but this doesn't solve the problem because the files are still open because they're still uploading, and the disk speed is way too fast compared to the upload speed so the VFS cache fills up with open files that can't be deleted yet. Even then, I still think this should be addressed because I might want to backup a large amount one day (maybe I took a lot of photos or downloaded a lot of Linux ISOs that day).

I would like to suggest a feature to solve this problem. I think if the write speed to the VFS cache can be rate-limited by rclone to match the upload speed (similar to --bwlimit), this would solve the problem a bit. This could probably be done without rclone in Linux by limiting disk speed with schedulers or cgroups, but it's still difficult to do because it can most likely only be done on disks themselves and not directories, so if you have only one disk connected to the system you would slow down the whole system, maybe unless you do something like make a loop block device. It would be more practical IMO to have this feature in rclone if possible.

Let's say there is now a feature to do this (eg. --vfs-bwlimit write_speed:read_speed) so we set write_speed to my upload speed (20Mbps). It will be okay, until the upload speed decreases (eg. something else on the network is using bandwidth, so my actual upload speed decreases to 10Mbps), or maybe even the internet cuts out for a minute making the mount freeze (but maybe the current write will still pass through VFS and therefore VFS cache, but I'm not sure). In the former case, the problem will occur but in a much smaller scale (not ideal but vastly better than currently). In the latter case (assuming current writes pass through VFS when there is no internet), it would be even worse (disk size increases by 20Mbps, but still not as bad as 2000Mbps without limiting).

So, I think this could be solved if rclone can calculate the current upload speed periodically (eg. set poll interval with --poll-upload-speed, maybe 1 or 5 seconds is a good default) and then dynamically set the hypothetical --vfs-bwlimit option while the program is running. Rclone can already limit bandwidth speed, so it seems like it's possible to implement. In the meantime, maybe there's some way by doing the loop block device thingy and write a script to detect rclone's upload speed and change the IO speed of the disk or block device, but what I said earlier about it being more practical IMO to implement this in rclone itself applies here.

Maybe there's another way to solve this problem that I haven't thought of, if so, please state in the comments. I'm looking forward to your thoughts and possible implementation in rclone if it seems like a good idea.

ncw · April 6, 2021, 4:42pm

What happens at this point? What errors do you get? Disk full?

The sizing of the VFS cache is a bit ~~sloppy~~ best effort at the moment.

In a perfect world rclone would know exactly how much stuff there was in the vfs cache at any moment and pause the writes until there was space for them.

I think that would be the most elegant solution to your problem.

This would need a bit of re-architecting the VFS so it does know exactly how much space is used at any moment, rather than only when it does the scan. I might actually have to find my computer scientist hat and write a proper LRU cache.

I don't think this can ever be perfect as when there is lots of concurrent operations there is opportunity for rclone to let two operations proceed when it shouldn't have done. However rclone could get within a few MB or the total I'm sure.

@leoluan - what do you think of this idea?

banneridealistdeviat · April 7, 2021, 12:42am

Well, I haven't actually used those options exactly yet (last time I used rclone I did max-size=4G) but after doing research it seems like that would make the files only cached while they're still open (for uploading). So I'm not sure if there's some error that was expected.

Anyway, because the disk speed where the VFS cache dir resides is a lot faster than the upload speed, the VFS cache increases in size higher than max-size gradually. I don't get any errors, nor does the disk get full (but that's only because I have enough free space at the moment; if I had only 4GB left and I set max-size to 4GB, then it would have filled up past the 4GB and the disk would have been full, probably causing errors).

Let's say I created a borg archive and added 100GB in the process. The disk speed where the VFS cache resides in is 2Gbps and the upload speed is 20Mbps. By the time borg is done, the VFS cache would be somewhere around 80GB in size (much higher than the 4GB max-size). Its 80GB and not 100GB because some of that data has been uploaded already, but not all of it yet. That would mean that borg is writing at 2Gbps because the write speed is determined by the VFS cache speed. But I still have to wait for rclone to upload all the stuff before I stop the rclone mount otherwise the data hasn't actually reached the cloud storage yet, it's still in the VFS cache. (it'll slowly decrease back down to 0b as the data gets uploaded). If the VFS cache speed was the same as the upload speed, then borg would be writing at 20Mbps and by the time borg is done, the VFS cache would be empty (depending on your settings).

That seems like another good solution. Wouldn't the mount freeze though when you try to write to it until it's done waiting for the existing data in the VFS cache to upload? Probably unless max-size is very small

It doesn't seem like there would be an absolutely perfect solution, the upload speed could slow down right after the upload speed is polled making it inaccurate until the next poll, but you would have to be very unlucky for it to be an issue so it's definitely better than the cache size definitely increasing to 80GB.

leoluan · April 7, 2021, 10:17am

When the cache writes outpace the uploads, I guess eventually the vfscache code will see ENOSPC errors and pass them to the applications. Rclone currently handles the ENOSPC error by discarding (resetting) cached items for files that are not dirty. This allows read IOs to continue. The cache reset logic is not applied to write IOs by design, because allowing writes to continue when the cache space is being depleted is dangerous in that further writes without closes on written files can cause the mount to become unreadable too (because there will be no space for read cache either).

Maybe a good solution to address the scenario of fast write and slow upload is to retry with exponential backoff on the ENOSPC error during cache writes. This would be like a hard NFS mount that does not return errors on network or server problems. The write IO will return when the cache purge thread is able to make cache space after finishing uploading some files.

This solution would not be bulletproof against a large concurrent working set of written files with total space larger than the available space in the cache's file system. If none of these files is closed and ready for upload, we can have a deadlock. This condition is more likely when concurrent processes write very large files (relative to the total cache space). But it should be rare in general.

Rclone can potentially generate warning/error messages when retrying upon ENOSPC errors and recommend user to take action to increase the cache's file system size/quota, etc.

ncw · April 7, 2021, 1:08pm

It would freeze until a file had been uploaded which could then be deleted to free space, yes.

That is an interesting and relatively simple idea. It would only need to be implemented in writeAt I think.

Indeed.

I still like the idea of doing it properly but that would need quite a lot of changes internally to the VFS

change cache polling into a continuously up to date LRU cache
update sizes when writing data (care needed with accounting)
check sizes when writing data so we don't exceed the maximum size).

I don't like the idea of deadlocks though - if the cache is 100% full of dirty data that is open what should rclone do?

leoluan · April 9, 2021, 12:45am

Can Rclone assume the remote has enough storage as an overflow disk space? (If not, it would be a legit reason to return ENOSPC to the application anyway.) If so, maybe a cache item can be implemented to be parts of equal-sized (say, 128 MB) temp subfiles, and some temp directory on the remote can be used as temporary space to store these subfiles when the cache space is 100% filled with dirty data. These temporary subfiles can be deleted after the dirty data is uploaded upon the close of the parent file. Rclone needs to ensure that there is at least 128 MB cache space remaining (plus some large safety margin), but that's a more tractable problem.

Obviously, this would incur an overhead if/when the temporary subfiles are uploaded to the remote before being removed from the local cache (to make space) and read back for upload when the parent file is closed. But it's a safety net when a user has a small cache space while running a large (possibly concurrent) write workload. When cache space is adequately configured, the dirty data would not need to be uploaded twice and the overhead can be avoided.

ncw · April 9, 2021, 2:54pm

Hmm, interesting idea.

Saving temp stuff on the remote will be slow though. And queuing theory says no matter how big you make the queue, you'll overflow it eventually if you are writing faster than you are uploading.

We could integrate the VFS with the chunker backend to allow partial uploads of big files - that would be another way of relieving the pressure.

I think slowing the writer is probably the only think we can do which will avoid too many complications.

leoluan · April 9, 2021, 5:03pm

Slowing the writers by itself may not be enough. A bunch of writers, each dirtying a separate file, can still exhaust the cache space before any of the dirtied files can be uploaded and deleted from the cache to make space.

Leveraging chunker code/feature to allow partial file uploads, combined with rate-limiting or retry on ENOSPC would solve the problem. But this would cause chunk files to be seen in the remote storage. Not sure whether this is acceptable in all key use cases, e.g., maybe users want to be able to read the files without going through Rclone and do not use chunker. (That's why I thought about the temporary directory/space approach.). Maybe the priority should be on making Rclone more efficient and robust if the other access paths are rarely required? In that case, autonomic chunking makes sense.

ncw · April 10, 2021, 7:55am

That is a possibility.

I think it would have to be a different mode - upload with chunks to be acceptable.

I want to avoid adding too much complexity to the VFS layer. It is already horrendously complex so adding corner cases to do uploads on disk full sounds like that is going in the wrong direction.

I think your idea to retry with exponential backoff on ENOSPC is a good one. It is fairly brutal in terms of user experience, but it should work eventually. A slightly harder fix would be to use the cache full instead of ENOSPC as a signal.

leoluan · April 26, 2021, 8:07am

@ncw I am beginning to work on adding retries with an exponential backoff upon ENOSPC errors in item.WriteAt. I tried copying /usr (6GB) to an S3 remote and it worked without running into EIO from out of space condition. The files were written to the cache and wait for their turns for upload (4 concurrent transfers) without causing errors in the "cp -R" command (other than the symlinks).

Two decisions to be made from here.

(1) Number of exponential retries - currently we are using LowLevelReties (default value: 10) as the number of retries in ReadAt. I think it might make sense to add a new parameter --vfs-persistent-retry=true to allow Rclone to retry forever in the ENOSPC condition? Maybe we can also allow user to change --vfs-persistent-retry to false at runtime? Can Rclone's remote control be used to do this?
(2) Cap of exponential backoff retry interval - It does not make sense to have a very large retry interval. Maybe we can cap it at 60 seconds?

ncw · April 26, 2021, 10:09am

Nice one - well done

Always a tricky one...

For filesystems you don't ever want them to give up really.

However in practical terms if it takes longer than 15 minutes the user is likely to think it is broken and do something else.

What I normally do is arrange for the total timeout for a default setting of low-level-retries to be about 10-15 minutes.

I'd like to avoid another parameter if we can as we've got too many already!

Parameters can be set via the rc quite easily.

Capping it a some value seems sensible to me.

What happens on each retry? Is it just checking internal state? Ie no API calls required?

leoluan · April 27, 2021, 1:47am

I agree this is a concern.

In the cache space is large, say, many GBs, then a large set of files may be written to the cache first to fill the cache, and then the files are uploaded slowly and it can potentially take more than an hour. In this case, the cache space helps to stage a lot portion of the write set. Maybe we should not fail the copy command/procedure on the latter files that cannot be written into the cache while waiting for the cache space to become available. Maybe we can provide a progress update? Is there a good way of reporting some progress number in this case and provide the remote control command for the user to direct Rclone to return errors if desirable instead?

ncw · April 27, 2021, 2:25pm

The rclone log will show transfers ongoing as will the rc, so that should be a hint.

There isn't really a feedback mechanism from a mount to the user to say "slow down" other than by not returning from API calls.

Note that on macOS we have a hard deadline of 10 minutes to return from a call otherwise rclone will be killed by the kernel.

I think if we are going to have a parameter it should measure the maximum pause time. We could hard code it for the moment while experimenting.

leoluan · April 27, 2021, 10:15pm

@ncw The multi-hour hold example I mentioned would be a corner case where a large/huge file takes a long time to upload. I.e., if a large file depletes most part of the cache space, then subsequent files will be starved until the large-file upload completes. Otherwise, the cache cleaning process should be able to self-regulate the rate of new writes through the enospc retries in most cases.

So I will add enopsc retries with an exponential backoff and a 10-minute accumulative delay at maximum before passing the enospc error to the user. Thanks for the discussion.

ncw · April 28, 2021, 7:41am

Sounds good. We can revisit later if it turns out to be a problem.

system · June 28, 2021, 3:42am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.