Unexpected duplicates on OneDrive with 0s in filename

What is the problem you are having with rclone?

When syncing files to OneDrive, I see intermittent warnings about duplicate files.

The directory being synced is a Mac OS sparse bundle directory containing several thousand "band" files with short hexidecimal filenames. The issue affects pairs of files whose filenames differ by a 0, for example d06b and d6b.

When the issue occurs, rclone outputs a duplicate message for one filename, and copies the other as a new file. File timestamps indicate that neither file has changed since the last sync.

2021/03/24 00:51:32 NOTICE: bands/d06b: Duplicate object found in destination - ignoring
2021/03/24 00:51:36 INFO  : bands/d6b: Copied (new)

I believe this is a OneDrive API issue relating to how filenames are sorted and paginated. It appears that filenames like d06b and d6b are both normalized to the same value for purposes of sorting, resulting to an unstable sort order. When querying the list of filenames via a paginated API, this could occasionally lead to one item being duplicated across a page boundary while the other item is skipped.

Although I don't think this is an rclone bug, I wanted to report my observations and see if anyone has ideas on how to mitigate this issue when using rclone.

What is your rclone version?

rclone v1.54.1

  • os/arch: darwin/amd64
  • go version: go1.15.8

Which OS you are using and how many bits?

Mac OS Mojave 10.14.6, 64 bit

Which cloud storage system are you using?

OneDrive

The command you were trying to run

rclone sync -c -v /.../backup.sparsebundle remote:backup.sparsebundle

The rclone config contents with secrets removed.

[remote]
type = onedrive
token = ...
drive_id = ...
drive_type = personal

Good theory...

Rclone sorts the files itself so it isn't relying on Onedrive's sort order, however it does rely on onedrive not returning the same object twice in paginated directories so it could be a onedrive or an rclone bug there.

Could the directory be being modified elsewhere - this could potentially cause duplicates.

Rclone asks for directory listings in batches of 1000 at a time normally.

Can you do an rclone lsf on the bands directory and see if there are any duplicates in it? Something like rclone lsf onedrive:bands | sort | uniq -d to show the duplicates.

Maybe run it several times?

If you see duplicates can you run with rclone lsf ... -vv --dump bodies --log-file rclone.log and post the log file?

I wouldn't be surprised if this is a onedrive bug - I've reported 2 onedrive bugs in the last couple of months!

I was able to reproduce this with rclone lsf. A file called d06b was returned twice while d6b was omitted. The request logs show the duplicate object was returned as the last result of one paginated request, and the first result of the next, as predicted by my hypothesis.

However the bug did not affect another pair of files with the same sort value spanning a page boundary, 8c05 / 8c5, so the bug is somewhat unpredictable.

The full log file is 80M so I can't upload it in full, but here are the relevant requests:
rclone-8c05.log (2.8 MB)
rclone-d06b.log (2.8 MB)

Great - I can confirm that d06b was the last in the last page and the first in the next page just as you said.

Just to confirm - there is no chance the directory was being modified concurrently elsewhere?

I'd suggest (since you are a technical person) you report it as a bug here: Issues · OneDrive/onedrive-api-docs · GitHub

For reference here are the bugs I've reported in the past - you can see they are quite responsive and do fix bugs.

I think the rclone-d06b.log should be enough to convince them that there is a problem and show them where it is.

If you do report it as a bug, then please link it here and I'll follow along too.

In the mean time, try this workaround which skips a duplicated entry at the start of a listing page

v1.55.0-beta.5352.6071db565.fix-onedrive-listing on branch fix-onedrive-listing (uploaded in 15-30 mins)

There's definitely no concurrent modification going on.

The workaround solves the duplication issue but not the skipped file issue, which could cause files to be incorrectly deleted/not deleted or unnecessarily re-transferred, depending on the circumstances.

I filed a bug with OneDrive, hopefully it'll get some attention: Duplicate/skipped items in paginated directory queries · Issue #1472 · OneDrive/onedrive-api-docs · GitHub

Ah, I forgot about the skipped files. Not much I can do about that :frowning:

Hopefully they will get on to fixing it. I subscribed to the issue and I'll chime in if I think I can be helpful.

Yeah, in general, there's no way for rclone to know what file was skipped. For my specific use case, since the range of possible filenames is very restricted I can generate a list of potential skipped files by adding/removing 0s from the duplicate filenames. Then I can re-sync using --files-from and --no-traverse to circumvent the listing issue.

One easy think we could do is make the max number of items in a directory configurable. Then you could sync once with one value and again with a different value. Set them both to be prime numbers < 1000!

This parameter is available in some other backends already

  --azureblob-list-chunk int   Size of blob list. (default 5000)
  --drive-list-chunk int       Size of listing chunk 100-1000. 0 to disable. (default 1000)
  --s3-list-chunk int          Size of listing chunk (response list for each ListObject S3 request). (default 1000)

Do you want to have a go at this?

Sure, might as well.

I merged that - thank you :slight_smile:

How does it work as a workaround?

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.