I am replacing rsync for a client who was previously doing off-site backup to a dedicated server and is now backing up to GCS.
I would like to use, and currently am using, rclone to do this; however, I have yet to see rclone complete the backup in a satisfactory fashion. Where the rsync job had previously taken 12-15 hours, rclone has continually run past that mark without apparent good reason.
One particularly frustrating thing that I have noticed is, that regardless of configuration, rclone appears to re-upload files during the same session. This is, put simply, infuriating. Testing a fifteen-plus hour backup (which is only one of many) that is mandatory for legal compliance, as one might imagine, already has me seeing red.
To give you an idea of what we are trying to accomplish here, rclone is configured with the GCS backend, and a cache backend for this bucket. The cache has a 2d (48h) lifetime, 10MB chunks, and 10GB of chunks.
Our source directory is (and this is where things get crazy), a live /home directory structure for an organization. It contains at least at least six million files (in one-hundred-and-fifty thousand or so directories) totaling 752GB. Among these six million files are the users dovecot mdboxes. It is statistically improbable that these will not be modified during a copy. I do not know if such modifications are the problem, but I do know that at least 400GB of that 752GB is email.
The expectation is that rclone will run every night at 10PM, and updated modified files (with GCS implicitly managing a 7-object revision history).
Unfortunately, I have yet to observe rclone exit. To try and troubleshoot this, I have been watching the backup throughout the day and noticed cache expiry messages for files that I am certain have already been uploaded. Perhaps I had run it once prior and the file got updated but I doubt it. Considering that some files appear to have eight revisions (the maximum), with as many as five having been generated today alone, I suspect that this is a fault of rclone and not the operator, as the times at which the objects were generated both appear to fall within the same rclone session.
As of writing, it is 15:18 CDT, rclone has been running 6h46m0s (thus, since 08:32 CST). At least one file/object has at least two revisions that fall within that time frame.
Put simply, how do I ensure that rclone only updates files once per session, even when the source has files that might change every second.
I am using rclone copy
. I have yet to find any literature suggesting that this will “revisit” files as it appears to be doing. I get a very large number of cache expiry log messages during the run, but the cache is fewer than seven hours old (and the lifetime is fourty-eight). I am under the impression that these are just files that have yet to be processed, and are therefore cache misses. I am running rclone with --no-update-modtime
(because, as it stands, it is already possible to rack up a 50$/day bill for storage.object.list
operations alone with the nature of this mess - it’s why i’m using cache); however, discontinuing use of --no-update-modtime
really shouldn’t be the solution here.