Spurious mkdir errors on mounted fileshare (+ size check errors)

1: no difference (btw: the data I used were already other files than I originally detected the problem with. Also, I originally used crypt, but reduced the problem as far as possible).
2: it only copied one file in total, presumably, because it (at least most of the time) only copies one file per folder. If I delete all files from the root folder, it correctly created sub-folders, each one containing exactly one file (but no further sub-sub-folders, even though those exist).
3: still the same problem. Logs:

Thanks, I think we can safely rule out the source.

I also think we can rule out rclone/Go file operations on smb drives, otherwise we would have heard of this before. I don't think it is that uncommon as source or target.

So it has something to do with your specific smb drive(r). The very different types of random behaviors indicates a fundamental issue where rclone/Go calls some underlying os/driver functionality which is incredibly unstable/untested.

I doubt we will be able to further pinpoint the exact cause without plenty of time and a source level debugger attached to the smb driver on the debian server.

I would therefore say it is time to think in alternatives that doesn’t use the debian smb server - or fixes the bug I suspect in it.

@Jupp56 @asdffdsa What do you think?

all very strange, different results, using ip, using hostname, etc...

i have had success using webmin to create samba shares, that always worked.
maybe that is coincidence or something about webmin.

Ok, thanks still for your and @asdffdsa `s help!
I'll try a different server/protocol then. Maybe have a deeper look into samba's debug logs.
I could of course use another tool for transfering - but rclone seems so appealing with it's ease of use and it's ability to easily encrypt and still verify, which is what brought me to it.

1 Like

rclone has good support for sftp.
rclone can emulate a sftp server using rclone serve sftp

It would be worth giving the latest beta a go if you didn't already.

This error seems similar to a go standard library bug which got fixed, but I can't remember when it got fixed!

The latest beta did not help.

I did, however, employ some debugging and found at least part of the reason while this fails: some of the stat() results in MkdirAll() seem to be lies. Someone somwhere in the line (go, windows, the samba server, debian/the linux kernel) reports the folders as being a file - for which mkdir_windows.go returns an OsPathError in line 26, which in and on itself is sensible of course. (or, unlikely: they exist as files for a while - but how and why? didnt find that in the source code).
Which would also explain the folders with only one file in them - the first stat call for a file's parent of course returns an "error: not found", not an incorrect file type, so folder creation works at least once.

I found this (not very reliable source) of someone stating a similar issue, but with Python: Wrong path type and size on mounted samba share on Windows 10

If it is like they state there, that the bug does not exist in Windows 7, it is likely I have found a bug in Windows 10 (and not go/samba/linux/the network connection/whatever).
If so, hallelujah and good luck to me getting anyone at MS to seriously look at that. I will try it out on different Windows 10 and Linux machines and on Windows 7. That would be quite serious, though, as it clearly violates data integrity and can lead to silent data corruption/data loss.

on my ubuntu server,
--- created a samba share
--- sync and check a local dir to that samba share

no errors

source dir info

rclone sync d:\data\l\go\user01.go\go\pkg\mod\google.golang.org \\\test --transfers=256 --checkers=256 --progress
Transferred:      289.027 MiB / 289.027 MiB, 100%, 3.420 MiB/s, ETA 0s
Transferred:         3576 / 3576, 100%
Elapsed time:      1m36.5s

rclone check d:\data\l\go\user01.go\go\pkg\mod\google.golang.org \\\test --checkers=256 --progress
2022-06-22 19:24:52 NOTICE: Local file system at //?/UNC/ 0 differences found
2022-06-22 19:24:52 NOTICE: Local file system at //?/UNC/ 3576 matching files
Transferred:              0 B / 0 B, -, 0 B/s, ETA -
Checks:              3576 / 3576, 100%
Elapsed time:        26.4s

@Jupp56 Good work and very interesting!

I agree it seems possible reproduce and to locate the issue more precisely, though there is still a way to go as illustrated by @asdffdsa

Perhaps you can capture the SMB packets with WireShark to find out whether the wrong stats results are due to a bug in the Samba server (Debian) or the SMB client layers (Windows+Go).

You may have to downgrade the communication to SMB-2 to work around the SMB packet encryption in SMB-3. This may make the issue disappear if the issue was indeed introduced in Windows 10 introducing SMB v3.11.

If the issue lies in the SMB client layers, then it may be possible to make a simple proof of concept using a small Windows PowerShell script inspired by the Python script you found. And then use this scripts to reproduce against a frequently used SMB server; preferably a Windows Server.

It is certainly possible to get Microsoft’s attention and a quick response. This forum thread contains a relatively new example. The key ingredients are a simple proof of concept using Microsoft software only and an example showing high probability of data loss for a high number of Microsoft customers (preferably governments or large businesses).

Very interesting result.

In the thread you linked it said

It looks like the samba client on Windows 10 bungles when too many stat calls are issued in a short time interval.

I'd read that as there is a concurrency problem.

The standard way of reducing the concurrency in rclone would be to use --checkers 1 --transfers 1 which I see you've tried already. This still does directory traversals and transfers at the same time though, so if you added --check-first it will stop doing that - that might be worth a try.

If this is the same issue as provoked by the Python script then it might be possible to provoke it just by performing rapid stats from a single thread. That is by something like this:

rclone check \\smb-01\ \\smb-01\ --size-only checkers=1

or a multi-threaded stress test:

rclone check \\smb-01\ \\smb-01\ --size-only

and as another test, ran the same rclone sync commmand as up above,
but this time over tailscale vpn.
again, no issues

rclone sync d:\data\l\go\user01.go \\\test --transfers=256 --checkers=256 --progress
Transferred:      775.337 MiB / 775.337 MiB, 100%, 674 B/s, ETA 0s
Transferred:        34955 / 34955, 100%
Elapsed time:     31m14.5s

rclone check d:\data\l\go\user01.go \\\test --checkers=256 --progress
2022-06-23 09:33:58 NOTICE: Local file system at //?/UNC/ 0 differences found
2022-06-23 09:33:58 NOTICE: Local file system at //?/UNC/ 34955 matching files
Transferred:              0 B / 0 B, -, 0 B/s, ETA -
Checks:             34955 / 34955, 100%
Elapsed time:      5m33.8s

i know this is basic stuff but maybe your system(s) needs to be updated.

for many years, i always use webmin to install samba, and then just a few clicks in webmin to create the samba share.
always worked for me.

here is my setup

root@ubuntu:~# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04 LTS
Release:        22.04
Codename:       jammy
root@ubuntu:~# smbstatus

Samba version 4.15.5-Ubuntu
PID     Username     Group        Machine                                   Protocol Version  Encryption           Signing
284454  nobody       nogroup (ipv4:  SMB3_11           -                    -
284454  nobody       nogroup (ipv4:  SMB3_11           -                    -

and i have attached the samba.conf as created by webmin.
note: for some reason, cannot upload .conf files so suffixed it with .txt
smb.conf.txt (8.8 KB)

and my windows os