Preventing corrupted files, best practices for syncing

Hey,

just wanted to get your input on this topic. How do you personally make sure that your backups do not get corrupted?

I’m sure there are many strategies of syncing to a cloud provider with rclone, for instance:

A) The simplest one, writing directly to a mount point.

B) Writing to a local cache first, then periodically either

  1. syncing
  2. cloning
  3. or moving it to a remote.
    Both 1 and 2 require separate cleanup mechanisms but allow for retries and longer local availability, if you need it e.g. with Plex.

C) Copying all local files to two remotes to most likely be able to manually fix any corruption that might occur. While the local files still exist, parity checking could be implemented to automatically fix corruption.

D) Straight cloning to three remotes to be able to automatically repair corruption via parity checking.

I’ll stop here but you probably have your own way of doing it and i’d love to hear about it.

I am asking in part because i seem to have quite a few corrupted files in my ACD. I often aborted writing to a remote and changed between copy, move and sync. I even wrote to a remote A from two separate sources simultaneously while transferring that remote A to a remote B on a third connection at the same time. Is corruption to be expected in that scenario? If so, what are the main points to keep in mind to stay clear of corruption in the future and how do you handle it?

Regards

Hmm weird so far I did not encounter any corrupted files, however I strictly build media/video library and all my files are moved to acd with rclone move -c --delete-after eg once rclone move files and make sure checksum is same the local version is deleted.

The thing is there may be some corruption, however its hard to know with video files except if it was so bad that it file would not play at all ( did not have that yet )

When syncing with amazon, it will check that the file sizes match only. It won’t check checksums per the rclone docs. That being said, i’ve never had a corrupt file that I am aware of. I mean if a sync aborts in the middle a re-run will replace that file because the file sizes are different.

If you want to REALLY check for corruption on ACD, you could mount the ACD drive and use RSYNC to check md5s and replace where neccessary but be aware, that will take a long time depending on the amount of data.

#Mount it
rclone mount robacd-crypt:/Videos/ /data/Media2/Videos/

#Check it
rsync --inplace -rlvxi --checksum -n --out-format="%n" /data/Media1/Videos/ /data/Media2/Videos/ | egrep -v “/$|^sent|^total|^sending” | tee different.txt

#Then run this with that file as input:
rclone --include-from different.txt copy -v “/data/Media1/Videos/” “robacd-crypt:/Videos/” --ignore-times

Hmm? Amazon not using checksums?

From the docs:

Amazon Drive doesn’t allow modification times to be changed via the API so these won’t be accurate or used for syncing.

It does store MD5SUMs so for a more accurate sync, you can use the --checksum flag.

Maybe I am wrong, but doesn’t this do exactly that - using the md5 checksums from ACD to know if a sync is needed?

1 Like

sorry. I meant crypt doesn’t use checksums. I do the above for crypt on acd for certain file types (like xml). Thanks for keeping me inline! More coffee pls.

I haven’t seen corrupted files - I’d be interested if you can replicate that… None of the above should cause corrupted files - ACD only stores whole files not partial updates so in theory the file got there or not at all.

As for uploading, don’t upload via a FUSE mount - the rclone sync/copy/move will be more reliable and check the SHA1 checksum after upload (even when using crypt).

I know lots of people use your B 3 method

Thanks for all the answers!

To ncw:
You are probably right, my gauge for “corrupted” was simply “not playable by Plex”. But when downloading the file and playing it locally it works so it’s an issue with Plex, not rclone. If i do find any corruption I’ll be sure to report it.
I am a little amazed that rclone can handle all my abuse, thanks for this tool! I can saturate a duplex 1 gbps link when backup up from one cloud drive to another.

Since I don’t actually have any verifiable corruption I’ll stick with the proven way of caching locally and moving to a remote periodically.

@peatnik In plex make sure your video was analyzed before playing it, the easiest way to spot if plex analyzed video is to check video info bellow pic ( eg 1080p, dts etc… )
Example

If you dont see that info plex will always return error playing item.
You can click on 3 dots on the right side and run analyze for selected video

p.s. Set as scheduled tasks for plex to make analyzation for all videos.

You can set up Plex to analyze your data periodically during maintanance as a scheduled task, sure, but this is extremely annoying and wastes a ton of bandwidth if you have a large library. A better solution is to make sure your files don’t need to be analyzed manually or by a scheduled task in the first place.

One common mistake people make is that they cancel a library refresh in progress, or unmount their remote storage during a refresh, this soft-bricks any files which are new, and requires that they be manually analyzed.

Refreshing your Plex library with rclone is slow because the storage is remote, you’re limited by your internet speed (much slower than an HDD probably), and folder seeking is slow. If you patiently allow a scan to complete, you shouldn’t run into any problems.

Yup I had some “fuck up” yesterday and i unmount the drive while plex was scanning, but something weird is happening now as i manually analyzed some videos ( and made sure jobs run the analyze from 1am until 8pm today ) and yet videos that I manually analyzed yesterday are not analyzed today.

However I also made ramdisk yesterday for temp plex transcoder files but that should not be the problem.
( never runned out of space there and those files are only while plex is transcoding )

Plex should really implement that when someone press play and video is not analyzed yet it should do it before playing not just returning error and leaving user dry. ( analyzation takes iike 2,3 seconds … but my lib is over 11K video files so yea still 8h or so )

Files need to be re-analyzed every time you remount with rclone due to mod time of files (especially files if you use --no-modtime) and also folders changing to the current date. Plex thinks that the files were in some way modified and reanalyzes the files.

Also, it may be easier for you to simply set the Plex Transcode directory to /dev/shm which will let you fill up to half of your maximum RAM. This is better because it leaves unused RAM available to other processes unlike a RAMDISK. You can also resize /dev/shm with some work if half your memory is not a sufficient amount.

@jpat0000 you are life saver, did not know about --no-modtime that i did had.
As for ram disk the thing is i was thinking about that but with ram disk i can make sure it the amount is reserved so if server for some reason start eating a lot of ram, plex will still have it share.
Iam using 12GB atm ( out of 32GB server have (, and so far max 8.9 was takes - iam checking and logging usage every minute )

@ncw it would be great if you clould implement preserve timestamps as soon as possible, cause library rescans take 8h+ now and having that each time after remounting is quite nuts lol, will think twice before updating my beta version ( so far i always tested the latest one )