When syncing files to OneDrive, I see intermittent warnings about duplicate files.
The directory being synced is a Mac OS sparse bundle directory containing several thousand "band" files with short hexidecimal filenames. The issue affects pairs of files whose filenames differ by a 0, for example d06b and d6b.
When the issue occurs, rclone outputs a duplicate message for one filename, and copies the other as a new file. File timestamps indicate that neither file has changed since the last sync.
2021/03/24 00:51:32 NOTICE: bands/d06b: Duplicate object found in destination - ignoring
2021/03/24 00:51:36 INFO : bands/d6b: Copied (new)
I believe this is a OneDrive API issue relating to how filenames are sorted and paginated. It appears that filenames like d06b and d6b are both normalized to the same value for purposes of sorting, resulting to an unstable sort order. When querying the list of filenames via a paginated API, this could occasionally lead to one item being duplicated across a page boundary while the other item is skipped.
Although I don't think this is an rclone bug, I wanted to report my observations and see if anyone has ideas on how to mitigate this issue when using rclone.
Rclone sorts the files itself so it isn't relying on Onedrive's sort order, however it does rely on onedrive not returning the same object twice in paginated directories so it could be a onedrive or an rclone bug there.
Could the directory be being modified elsewhere - this could potentially cause duplicates.
Rclone asks for directory listings in batches of 1000 at a time normally.
Can you do an rclone lsf on the bands directory and see if there are any duplicates in it? Something like rclone lsf onedrive:bands | sort | uniq -d to show the duplicates.
Maybe run it several times?
If you see duplicates can you run with rclone lsf ... -vv --dump bodies --log-file rclone.log and post the log file?
I wouldn't be surprised if this is a onedrive bug - I've reported 2 onedrive bugs in the last couple of months!
I was able to reproduce this with rclone lsf. A file called d06b was returned twice while d6b was omitted. The request logs show the duplicate object was returned as the last result of one paginated request, and the first result of the next, as predicted by my hypothesis.
However the bug did not affect another pair of files with the same sort value spanning a page boundary, 8c05 / 8c5, so the bug is somewhat unpredictable.
The full log file is 80M so I can't upload it in full, but here are the relevant requests: rclone-8c05.log (2.8 MB) rclone-d06b.log (2.8 MB)
There's definitely no concurrent modification going on.
The workaround solves the duplication issue but not the skipped file issue, which could cause files to be incorrectly deleted/not deleted or unnecessarily re-transferred, depending on the circumstances.
Yeah, in general, there's no way for rclone to know what file was skipped. For my specific use case, since the range of possible filenames is very restricted I can generate a list of potential skipped files by adding/removing 0s from the duplicate filenames. Then I can re-sync using --files-from and --no-traverse to circumvent the listing issue.
One easy think we could do is make the max number of items in a directory configurable. Then you could sync once with one value and again with a different value. Set them both to be prime numbers < 1000!
This parameter is available in some other backends already
--azureblob-list-chunk int Size of blob list. (default 5000)
--drive-list-chunk int Size of listing chunk 100-1000. 0 to disable. (default 1000)
--s3-list-chunk int Size of listing chunk (response list for each ListObject S3 request). (default 1000)