Rclone re-uploads files in same session (despite having been told not to)

I am replacing rsync for a client who was previously doing off-site backup to a dedicated server and is now backing up to GCS.

I would like to use, and currently am using, rclone to do this; however, I have yet to see rclone complete the backup in a satisfactory fashion. Where the rsync job had previously taken 12-15 hours, rclone has continually run past that mark without apparent good reason.

One particularly frustrating thing that I have noticed is, that regardless of configuration, rclone appears to re-upload files during the same session. This is, put simply, infuriating. Testing a fifteen-plus hour backup (which is only one of many) that is mandatory for legal compliance, as one might imagine, already has me seeing red.

To give you an idea of what we are trying to accomplish here, rclone is configured with the GCS backend, and a cache backend for this bucket. The cache has a 2d (48h) lifetime, 10MB chunks, and 10GB of chunks.

Our source directory is (and this is where things get crazy), a live /home directory structure for an organization. It contains at least at least six million files (in one-hundred-and-fifty thousand or so directories) totaling 752GB. Among these six million files are the users dovecot mdboxes. It is statistically improbable that these will not be modified during a copy. I do not know if such modifications are the problem, but I do know that at least 400GB of that 752GB is email.

The expectation is that rclone will run every night at 10PM, and updated modified files (with GCS implicitly managing a 7-object revision history).

Unfortunately, I have yet to observe rclone exit. To try and troubleshoot this, I have been watching the backup throughout the day and noticed cache expiry messages for files that I am certain have already been uploaded. Perhaps I had run it once prior and the file got updated but I doubt it. Considering that some files appear to have eight revisions (the maximum), with as many as five having been generated today alone, I suspect that this is a fault of rclone and not the operator, as the times at which the objects were generated both appear to fall within the same rclone session.

As of writing, it is 15:18 CDT, rclone has been running 6h46m0s (thus, since 08:32 CST). At least one file/object has at least two revisions that fall within that time frame.

Put simply, how do I ensure that rclone only updates files once per session, even when the source has files that might change every second.

I am using rclone copy. I have yet to find any literature suggesting that this will “revisit” files as it appears to be doing. I get a very large number of cache expiry log messages during the run, but the cache is fewer than seven hours old (and the lifetime is fourty-eight). I am under the impression that these are just files that have yet to be processed, and are therefore cache misses. I am running rclone with --no-update-modtime (because, as it stands, it is already possible to rack up a 50$/day bill for storage.object.list operations alone with the nature of this mess - it’s why i’m using cache); however, discontinuing use of --no-update-modtime really shouldn’t be the solution here.

rclone won’t revist files unless it does a retry. That is controlled by the --retries flag which is 3 by default. So if there was an error in the sync rclone will retry the whole thing.

You might want to set --retries 1 so rclone just does 1 try. (the flag should really be called tries, but that is a historical accident!)

You can see rclone doing retries in the log if you grep for Attempt.*failed

If you have enough memory, running rclone with --fast-list will do the minimum number of storage.object.list operations. That won’t work with cache though. I’d be tempted to try using --fast-list without cache.

My suggestion would be to use a volume snapshot (VSS on windows or LVM in linux) to first create a snapshot of the active filesystem. Then mount this filesystem and rclone it to your cloud destination. Once that operation is complete you would drop your volume snapshot. You have to have a level of technical skill to successfully implement something like this.

An easier way out would be to use a tool like Crashplan for business, Carbonite, Backblaze, etc. that solves this exact problem easily and you accomplish the same thing - backups stored in a public cloud - without the effort.

That’s what I’m thinking. The particular structure of the /home directory here seems to upset the cache backend. On the other hand, we have another directory (it’s a document dump) where the cache backend works well enough (granted the initial run took six hours just to build the cache).

I’m not familiar with the logic that rclone uses to traverse the directory structure, but my guess is that the long pauses that occur during copies of this particular directory (with either caching or --fast-list) are further indexing.

I would love to use snapshots, but unfortunately when this machine was set up (predates me) it hadn’t crossed anyones mind to do things in a sensible or modern fashion. Thus, the closest we have to volume snapshots are XFS dumps, which take a horribly long time to do anything with. While I could certainly induct the volume in to a new group, I suspect that the proposal to do so would frighten some parties.

For similar reasons, a commercial backup solution would not curry much favor (because you have to pay for it).

What we do have is a complete local mirror of the system, which is updated every night. I may run the backups off the mirror so as to avoid competing for disk IO, which would have the added benefit of not being “live” (in the sense that it’s not in use).

if your filesystem is XFS you can still take volume snapshots using LVM. XFS just doesn’t natively support snapshots like ZFS or other modern filesystems do, but that really isn’t a problem if you have a kernel built in the last 6 or 7 years. There are issues with mounting the snapshot however due to the use of UUID in XFS. There are workarounds if you google, so this issue is just a little speed bump and not a show stopper.

Now, since you have a mirror copy of the data it sounds like you already have a solution to this problem. If it is acceptable to have 1 day of data loss (since it is updated each night) then this is a good solution. i am curious about how you create the mirror. Do you xfsdump/xfsrestore or are you using some other means to accomplish this?

It’s done over the local network with rsync, which doesn’t seem to experience performance issues (where gsutil rsync and rclone do). rsync was previously in use as the means to mirror to the offsite backup which is being replaced.

I would personally have these set up with LLVM, but these are legacy servers and it would take quite some persuading to do so now.

As an aside, After some observation today, I have to wonder if the (inadequate) internet connection is actually serving as a bottleneck when retrieving a remote listing, as the document dump cache took six hours to generate initially, but all subsequent copy operations have completed in about an hour. As for the integrity of the cache, I am not concerned as this system is the only one storing any data.

If cache generation for the home directories takes an excessive amount of time, I will not hesitate to use the mirror system.

For syncing rclone will sync each directory in a top-down recursive way. So it starts at the root, syncs that, then syncs all the subdirectories, etc. This is slightly complicated by the fact rclone will sync up to --checkers directories concurrently.

When you use --fast-list rclone fetches the entire listing from the backend (which is likely not in the correct order) and re-orders it into that top-down recursive format in memory. Then syncing proceeeds as before.

I’m not sure what could be causing long pauses exactly - are you using --checksum? That can cause long pauses on big files which get checksummed.

If you run rclone with -v it will print stats every one minute which should tell you what it is doing. Or you can use the -P or --progress flag to see an interactive stat block.

Nope, --checksum is not in use.

While I would hesitate to write it off for the following reason, we do have a rather lackluster connection (the SLA/Contract was signed back when 20x10 was considered acceptable for business connections).

Incidentally, I came in to find that one of the rclone copy operations using cache had been running for nine hours (without doing anything), where it had typically run for one hour.

Hmm, can you check network activity when rclone appears to be stuck to see if it is actually transferring data, but the stats are wrong?

I am re-running an the job that had previously idled for nine hours. I’ll keep an eye on it.

At the moment, I’m watching strace output and there is certainly network activity. That having been said, I would certainly hope that the listing isn’t taking that long to fetch.

It ran without issue. No clue what was going on.

There was another report of strange things with GCS (which I can’t find now) I wonder if it was a temporary glitch with GCS networking…

Who knows. I’m still apprehensive about backup integrity at the moment because backups that should take far longer are completing in record time. Unfortunately, whatever backup/cloning tool we use is the sole measure of file status/freshness. Because we are successfully using rclone in other applications with absolutely no issue, I have to believe that it is working.

If you want some assurance that things are as they should be then run rclone check.