So I have a google drive folder called clients that I use for archival purposes (old projects etc) These are incredibly large as I work in video production. I have recently started using Rclone to back these projects up and I noticed the other day that some of them seem to have duplicate files appearing in them. As this is video, we are talking about 250 gig projects turning into 500 gig projects etc, so this could quickly become an issue if I don’t manage it.
So the obvious solution is to use dedupe, however I want to be incredibly careful setting in motion a destructive process on my client archive!
What I’m wondering is how exactly dedupe works when dealing with very large directories? Will it take into consideration dupes it finds anywhere within that structure? There are a few projects I have worked on where the same files may have been used intentionally, so deleting those files would be a serious problem, however I need a solution that doesn’t involve me having to regularly manage the dedupe checks. I want to set it and forget it with a cron job.
I have a bad feeling I may be stuck between a rock and a hard place with this. If I want set and forget then I will lose important files and if I carefully manage the deduping then it will be quite time consuming.
I’d love to be proved wrong right now
I’m also wondering what exactly is compared to decide identical files? Is it name and size? The help file mentions that need to have the same md5sum, but I’m not quite sure what that constitutes. It is quite possible I will have some folders with camera files that have identical names, but different files sizes, so I want to make sure they aren’t deleted.