Bisync Bugs and Feature Requests

I've been exploring bisync in depth over the past several months, and have come across a number of issues that I thought would be worth detailing here for the greater rclone community. Apologies for the length of this post -- rather than creating a separate post for each issue, I thought it might be more helpful to have everything in one place.

Let me start by saying that bisync (and rclone as a whole) is an amazing tool and I am very grateful for all of the thoughtful work that clearly went into its design. The following list is not meant as criticism but as a starting point for discussion about how to make the tool even better.

I've divided the list to try to distinguish between Suspected Bugs (things that are actually not functioning as the docs suggest they should) and Feature Requests (functioning as designed, but I wish they were different.)


Suspected Bugs

1. Dry runs are not completely dry

Consider the following scenario:

  1. User has a bisync job that runs hourly via cron and uses a --filters-file
  2. User makes changes to the --filters-file and then runs --resync --dry-run to test the new filters (--resync is required after filter change, as a safety feature)
  3. The results of the --dry-run are unexpected, so user decides to make more changes before proceeding with the 'wet' run
  4. Before user has time to finish making and testing the further changes, the hourly cron job runs. The user expects that this run will simply fail, as a result of the above safety feature when bisync detects modified filters. But instead, bisync does NOT detect the modified filters, and proceeds with the new filters that the user had not intended to commit, causing potential data loss.

Had the user not run the --dry-run, the safety feature would have worked as expected, preventing disaster. But because of the --dry-run, a new .md5 hash file was created -- essentially 'committing' the new filters without having ever run a non-dry --resync.

Notably, listing files are appended with a -dry suffix during dry runs, to avoid them getting mixed up with the 'real' listings. But filter .md5 files have no such protection, and as a result, bisync can't tell the difference between a 'dry' .md5 file and the real thing.

To put this another way (in case the above was unclear):

If --resync is required to commit a filter change, --resync --dry-run should not be sufficient to commit a filter change (but it currently is.)

2. --resync deletes data, contrary to docs

The documentation for --resync states:

This will effectively make both Path1 and Path2 filesystems contain a matching superset of all files. Path2 files that do not exist in Path1 will be copied to Path1, and the process will then sync the Path1 tree to Path2.

However, there are at least two undocumented exceptions to this: empty folders and duplicate files (on backends that allow them) will both be deleted from Path2 (not just ignored), if they do not exist on Path1.

The docs indicate that a newer version of a file on one path will overwrite the older version on the other path, which makes sense. But it does not suggest that anything would get deleted (as opposed to just overwritten). It is not truly a "superset" if anything gets deleted.

I'm aware that rclone is object-based and does not treat folders the same way as files. However, in other rclone commands, this usually just results in empty folders being ignored -- not deleted. Furthermore, since bisync (unlike copy and sync) does not support --create-empty-src-dirs, we cannot get around this by including the empty folders in the copy from Path2 to Path1.

Lastly, to the extent that a user is aware of this issue and seeks to find out whether they're at risk before they do it, this is difficult because of the known issue concerning --resync dry runs. Essentially: it's very hard to tell the difference between the deletes that can be safely ignored, and those that can't. A user might think they're safe, and then find out the hard way that they were wrong. (Not that anyone I know would make a mistake like that... :innocent::sweat_smile:)

For now, a possible workaround is the following sequence:

  1. rclone dedupe --dedupe-mode interactive Path2
  2. rclone copy Path2 Path1 --create-empty-src-dirs --filter-from /path/to/bisync-filters.txt
  3. rclone bisync Path1 Path2 --filters-file /path/to/bisync-filters.txt --resync

Note that you would probably want to repeat this any time you edit your --filters-file.

3. --check-access doesn't always fail when it should

Consider the following scenario:

  1. User intends to bisync Path1/FolderA/FolderB with Path2/FolderA/FolderB
  2. User places access check files at Path1/FolderA/FolderB/RCLONE_TEST and Path2/FolderA/FolderB/RCLONE_TEST as a safety measure to ensure bisync won't run if it doesn't find matching check files in the same places.
Path1
└── FolderA
    └── FolderB
        └── RCLONE_TEST
Path2
└── FolderA
    └── FolderB
        └── RCLONE_TEST
  1. User runs the following command, accidentally mistyping one of the paths:
rclone bisync Path1/FolderA/FolderB Path2/FolderA --check-access --resync

The access test does not prevent this, and the transfer proceeds, even though the paths have been mistyped and check files are not in the same places.

  1. User runs a normal bisync, with the path still mistyped:
rclone bisync Path1/FolderA/FolderB Path2/FolderA --check-access

The access test still does not prevent this, and the transfer proceeds, even though the paths have been mistyped.

Why? Because access tests are not checked during --resync. Therefore, in step 3 above, bisync actually created two new RCLONE_TEST files, thereby helping to defeat its own safety switch in step 4. The mangled directory structure now looks like this:

Path1
└── FolderA
    └── FolderB
        ├── FolderB
        │   └── RCLONE_TEST
        └── RCLONE_TEST
Path2
└── FolderA
    ├── FolderB
    │   └── RCLONE_TEST
    └── RCLONE_TEST

Is this user error? Yes. But preventing accidental user error is one of the main reasons this feature exists. And I think many new users, even having read the docs, would reasonably expect that their inclusion of --check-access would prevent a mess such as this in Steps 3 and 4 above.

It's worth noting that the following also would have succeeded in step 3, even though there would be no RCLONE_TEST file to be found anywhere in the second tree:

rclone bisync Path1/FolderA/FolderB Path2/FolderA/FolderC --check-access --resync

To the extent that not checking access tests during --resync was an intentional design choice (as the docs sort of imply but never totally spell out), I'm not sure I understand the rationale. If --check-access is intended as a protection against data loss, couldn't data loss just as easily happen during a --resync? Why should that be exempt? If the idea was for --resync to help us set the check file on the other side, we have plenty of better options for this, including touch, copyto, or simply running --resync without --check-access (which, while still not preventing the scenario above, would at least not give the user a false sense of security.)

Another simple example to show the logical inconsistency -- imagine that we have not created any check files on either side, and then we run:

rclone bisync Path1 Path2 --check-access --resync

This succeeds. But then we run a normal bisync:

rclone bisync Path1 Path2 --check-access

This fails. (As it should! But then it begs the question... why was the --resync allowed to succeed?)

Possible solutions:

  • Enforce --check-access during --resync (meaning the file must already exist on both sides)
  • Prevent --resync from running if --check-access has also been included

(Personally, I prefer the first one, so that I don't have to remember to actively remove --check-access from my normal bisync command when adding --resync. I love that adding --resync is currently all I have to do!)

4. --fast-list is forced when unwanted

The docs indicate that rclone does NOT use --fast-list by default, unless the user specifically includes the --fast-list flag. However, ListR appears to be hard-coded on this line, meaning that bisync uses it regardless of whether you asked it to or not. In my case, it's significantly faster without --fast-list, so I don't want to use it -- but there's no obvious way to disable it. I did eventually find that --disable ListR seems to do the trick, but this isn't really documented anywhere, and it's also inconsistent with behavior in the rest of rclone (where --fast-list is false by default).

Another reason I prefer not to use --fast-list is because I have tons of empty directories (for intentional reasons described in more detail below), and as a result, I get lots of false positives as described in this issue when --fast-list is used.

5. Bisync reads files in excluded directories during delete operations

There seems to be an oversight in the fastDelete function which causes files to include every file in your entire remote. Not only is it not filtered for only the files queued for deletion, but it's also not even filtered for the eligible files specified by the user (in the --filters-file or otherwise.) This means that even if you have a directory exclude rule, bisync will ignore it and loop through and evaluate every single file in that excluded directory. (I first noticed this because I use --drive-skip-gdocs, and with -vv I could see it skipping tons of individual gdocs in a folder that it wasn't supposed to be looking through. I also noticed that this happened only when there were deletions queued, and not when there were copies queued but no deletions.)

Unlike fastDelete, the fastCopy function right above it has code to (I think) filter for only the files we care about here. I am guessing that a similar filter was intended for fastDelete. I added it in an experimental fork I've been testing, and it seems to have solved the issue.

6. Deletes take several times longer than copies

The cause of this is the same as #5 above, but I'm including it separately as it's a distinct and not-insignificant symptom. I saw a massive performance improvement once the (ironically named) fastDelete function no longer had to loop through millions of irrelevant files to find one single deletion. :smiley:

7. Overridden config can cause bisync critical error requiring --resync

When rclone detects an overridden config, it adds a suffix like {ABCDE} on the fly to the internal name of the remote. Bisync follows suit by including this suffix in its listing filenames. So far, so good. The problem is that this suffix does not necessarily persist from run to run, especially if different flags are provided. So if next time the suffix assigned is {FGHIJ}, bisync will get confused, because it's looking for a listing file with {FGHIJ}, when the file it wants has {ABCDE}. As a result, it throws Bisync critical error: cannot find prior Path1 or Path2 listings, likely due to critical error on prior run and refuses to run again until the user runs a --resync.

FWIW: my use case for overriding the config is that I want to --drive-skip-gdocs for some rclone commands (like copy/sync/bisync) but not others (like ls). So I don't want to just hard-code it in the config file.

8. Documented empty directory workaround is incompatible with --filters-file

Bisync currently does not support copying of empty directories, and as a workaround, the docs suggest the following sequence:

rclone bisync PATH1 PATH2
rclone copy PATH1 PATH2 --filter "+ */" --filter "- **" --create-empty-src-dirs
rclone copy PATH2 PATH2 --filter "+ */" --filter "- **" --create-empty-src-dirs

However, this approach is fundamentally incompatible with using a bisync --filters-file. In other words, it's only really useful if you're bisyncing your entire remote. There is no warning about this in the docs, and if a new user were to try this recommended approach without scrutinizing it carefully, they could inadvertantly create thousands of folders in directories they hadn't wanted to touch.


Feature Requests

(AKA: my subjective hopes and dreams for the future of Bisync.)

1. Identical files should be left alone, even if new/newer/changed on both sides

One of my biggest frustrations with bisync in its current form is that it will sometimes declare a change conflict where there actually isn't one, and then attempt to "fix" this by creating unnecessary duplicates, renaming both the duplicate and original in the process.

For example, say I add the same new foo.jpg file to both paths, and the size, modtime, and checksum is 100% identical on both sides. The next time I bisync, I will end up with two files: foo.jpg..path1 and foo.jpg..path2 on both sides. What I would propose instead is that when bisync encounters one of these so-called "unusual sync checks", it should first check if the files are identical. If they are, it should just skip them and move on.

To put this another way: if a file is currently identical on both sides, bisync should not care how the files became identical. It should not matter whether the files were synced via bisync vs. some other means. We should not demand that bisync be the only mechanism of syncing changes from one side to the other.

I implemented a basic version of this in my fork by doing a simple Equal check, and it seems to be working well. Among the problems it solves is the documented Renamed directories limitation -- I can now just rename the directory on both sides, without the need for a --resync, and bisync will naturally understand what happened on the next run. It is also now agnostic to how the files came to be identical, so I am free to go behind bisync's back and use copy, sync, or something else, if I want to.

2. Bisync should be more resilient to self-correctable errors

The way that bisync is currently designed, pretty much any issue it encounters, no matter how small, will cause it to throw up its hands and say "HELP!" (i.e. refuse to run again until the user runs a --resync.) It's being intentionally conservative for safety, which I appreciate, but in some cases it seems overly-cautious, and makes it more difficult to rely on bisync as a scheduled background process (as I would prefer to), since I have to keep checking up on it and manually intervening (as even one errored bisync run will prevent all future bisync runs).

In particular, there are a number of issues that could be resolved by simply doing another sync, instead of aborting. Probably 9 times out of 10 that bisync asks me to intervene, all I'm doing is running --resync, without having changed anything. In my opinion, it would be really useful to have a new flag to let users choose between the current behavior and essentially a "try harder" mode where bisync tries its best to recover and self-correct, and only requires --resync as a last resort when a human's involvement is absolutely necessary.

Some of the issues I'm talking about include:

What I'm looking for is something philosophically similar to the Dropbox or Google Drive desktop client, which mostly stays out of your way and does its thing in the background, and almost never requires user intervention. For my purposes, I have a fairly high tolerance for leaving the filesystem in an imperfect state temporarily at the end of a bisync run, with the hope of correcting any issues on the next scheduled run. But I also acknowledge that there could be other use cases that require an all-or-nothing standard on each sync, and so that's why I would propose a new flag for this (with the current behavior remaining the default.)

3. Bisync should create/delete empty directories as sync does, when --create-empty-src-dirs is passed

So, I'll be honest here -- what I really want is for rclone as a whole to treat folders the same way as files, and treat empty folders the same way as non-empty folders. (As rsync does.)

But, in the meantime, the way that rclone sync --create-empty-src-dirs currently works is usually good enough for my needs (lack of folder metadata support is a bummer, but at least empty folders are copied and deleted reliably.) And I wish that rclone bisync --create-empty-src-dirs would behave in the same way. (Currently, it doesn't.)

This is one of the changes I made in my fork, as I have quite a lot of empty folders, and they are important to me. Among other reasons, I do a lot of music and video production, and I use apps like Apple Logic Pro which creates a "project" for you with its own internal folder structure. A single Logic project can have hundreds or thousands of files and folders in it -- lots of moving pieces, and if it ever can't find one that it's expecting to be there, the project can become corrupted. I haven't actually tested whether missing empty folders will cause Logic problems (I'd rather not find out), but even if it doesn't, it seems plausible that some other app at some point will have a similar problem, and it just strikes me as asking for trouble to go around deleting folders that some program presumably put there for a reason. (For example, from what I understand, empty folders are essential to cryptomator.) So bisync's lack of native support for them was a non-starter for me (and as noted above, the recommended workaround was also a non-starter for me because it can't be used with a --filters-file.)

It should be further noted that even if you don't need to use a --filters-file, the documented workaround still won't propagate deletions of empty folders, and --remove-empty-dirs is also not a good solution, because it will simply delete all empty folders (even the ones you wanted to keep.)

TL;DR: empty folders are important to me, and bisync's treatment of them should match that of sync.

4. Listings should alternate between paths to minimize errors

When bisync builds the initial listings in the "checking for diffs" phase, it currently does a full scan of Path1, followed by a full scan of Path2. In other words, the order is this:

Path1/FileA.txt
Path1/FileB.txt
Path1/FileC.txt

Path2/FileA.txt
Path2/FileB.txt
Path2/FileC.txt

This might be fine for relatively small remotes, but it presents an inherent problem with larger ones, because it means a lot of time could pass between the time it checks a file on Path1 and the time it checks the corresponding file on Path2, and the file could have been edited or deleted in the meantime.

For example, suppose that Path1 has 500,000 files and that we accept the documented benchmark of listing 667 files per second (FWIW, I have not achieved anything close to this, but we'll accept it for the moment.) After bisync checks Path1/FileA.txt, it will be another 12.5 minutes before it checks Path2/FileA.txt. 12.5 minutes is plenty of time for files to change, even for a lightly used filesystem. Now imagine it's 1 million files -- that would be at least 25 minutes. 5 million -- over 2 hours. (And again, that's regardless of whether any files changed since the last run.)

What I would propose instead is that the checking order should be changed so as to alternate between the two paths, with the goal of minimizing the amount of time between checks of corresponding files on opposite sides. For example:

Path1/FileA.txt
Path2/FileA.txt
Path1/FileB.txt
Path2/FileB.txt
Path1/FileC.txt
Path2/FileC.txt

Or, perhaps per-directory instead of per-file, to minimize API calls:

Path1/FolderX/FileA.txt
Path1/FolderX/FileB.txt
Path2/FolderX/FileA.txt
Path2/FolderX/FileB.txt
Path1/FolderY/FileA.txt
Path1/FolderY/FileB.txt
Path2/FolderY/FileA.txt
Path2/FolderY/FileB.txt

This seems similar to the way that rclone check already behaves (I haven't dug into the code to confirm) -- perhaps whatever it's doing there could be reused?

5. Final listings should be created from initial snapshot + deltas, not full re-scans, to avoid errors if files changed during sync

Given the number of files I'm syncing and the time that takes (as noted above), I very quickly realized that leaving out --check-sync=false would be impractical. Otherwise, the entire bisync would fail with a critical error if even one file was created/edited/deleted during the sync -- which could last minutes or hours. But, as a consequence, I've now introduced the possibility that files will be missed, and that Path1 and Path2 will become "out of sync". So to address that, I scripted a scheduled full-check with email notification to me if it detects any problems. But this is clunky, and I very much agree with the poster of this issue that it would be better to create the final listings by modifying the initial listings by the deltas, instead of doing a full re-scan of both paths.

Since the OP of that issue has not yet answered @ncw's question from last month, I will also add: no, this does not sort itself out if you run bisync again (but I agree that would be preferable.) The critical error returned is "path1 and path2 are out of sync, run --resync to recover", and from my understanding, this is by design.

Conceptually, what I'm proposing is similar to a "snapshot" model. We know what the state was at the start (the listings), and we know what we changed (the deltas). Anything new that happened after we took that listing snapshot, we don't care about right now -- we'll learn about it and deal with it on the next run, whenever that is.

6. --ignore-checksum should be split into two flags for separate purposes

Currently, --ignore-checksum controls BOTH of the following:

  1. whether checksums are considered when scanning for diffs*
  2. whether checksums are considered during the copy/sync operations that follow, if there ARE diffs

I would propose that these are different questions, and should be controlled with different flags. (Here's another user who seems to agree.) In my case, I would want to ignore checksums for #1, but not for #2 (because I have a lot of total files but very few of them change from sync to sync. But when they do, I want checksums used to ensure the integrity of the copy operations.)

*an aside about #1: my understanding is that currently, hashes do get computed and saved in the listings files, but they are not actually used when creating the deltas, as noted by the TODO in the code. Additionally, unless --ignore-checksum is passed, hashes do get computed for one path even when they are useless because there's no common hash with the other path -- for example, a crypt remote. Computing all those unnecessary hashes can take a lot of time (I was tempted to file this one in the "bugs" list).

7. Bisync should be as fast as sync

Bisync is (anecdotally) several times slower than its unidirectional cousin, sync. This seems to be mostly attributable to the process of building full listings twice (before and after the transfers), both single-threaded (Path1 must finish before Path2 can start). This becomes more and more noticeable the more files you have -- to the point where I sometimes find myself "cheating" by running sync instead of bisync when I know I've only changed one side and I want to sync it quickly. I then let bisync discover what I did later, at the normally scheduled time (which I can do only because of the change I described in #1 to avoid auto-renaming identical files.)

While I'm admittedly armchair quarterbacking a bit here, it seems that there's no fundamental reason that bisync would have to be slower than sync, by definition. After all, sync also must check the entire destination, in order to know what it needs to copy/delete (and as I posited above, I don't believe the second bisync listing is truly necessary, or desirable). I get that sync is stateless and bisync can't be, but I'm not sure why that should make much difference in terms of speed -- the step of loading the prior listing into memory takes relatively little time (the docs suggest ~30 sec for 1.96 million files) (and while we're on the subject, thought I'd mention that the other bullet points here still have some "XXX" placeholders.)

This is to say: while I don't have an easy solution to propose for this one, it seems at least theoretically possible to redesign bisync to be as fast as sync (or at least as fast as: load prior listings + sync + save new listings).

8. Bisync should have built-in full-check and notification features to help with headless use

While some kind of native email-notification-on-error feature would probably be a useful thing for rclone in general (not just bisync), there are two things that make bisync different than other rclone commands:

  1. It's only really useful if you run it more than once
  2. It's stateful, and an error in a single run causes all future runs to fail (absent user intervention)

I would also wager a guess that a large percentage of its users run it as a background process, via scheduled cron or the like (more so than for other rclone commands).

For these reasons, it's more important than usual to know if something went wrong, and harder than usual to tell. A lot of users will probably find themselves (as I did) hacking their own script together to do a regular full-check and notify them of any errors it finds. Otherwise, you have to keep checking the logs regularly (and who wants to do that?) or risk not knowing about a job that failed and therefore caused all subsequent jobs to fail. It would be great if such a feature were built in (as an optional flag), rather than requiring each user to reinvent the wheel themselves.

By "full-check", what I mean is essentially an rclone check (or cryptcheck, as the case may be), with the same src/dest and filters as bisync, for the purpose of detecting whether Path1 and Path2 are out of sync (especially important given how many users are probably using --check-sync=false, as described above.) This seems like essentially what --check-sync=only aspires to be, but it is insufficient in its current form (for me, at least) because it only compares files by name, and not by size, modtime, or checksum. (check is also multithreaded and has more robust output options.)

The notification doesn't necessarily have to be an email notification, but I propose email because it's platform-agnostic and pretty much everyone uses it, and probably checks it more frequently than their bisync logs folder. I realize the need for an SMTP server makes this tricky (new backend, maybe?), but by the same token, that's also what makes this a big ask for the casual user who just wants to use bisync in set-it-and-forget-it mode.

--

If you made it this far, thanks for reading, and I'd love to hear your thoughts!

--

Run the command 'rclone version' and share the full output of the command.

rclone v1.62.0-DEV
- os/version: darwin 13.3.1 (64 bit)
- os/kernel: 22.4.0 (arm64)
- os/type: darwin
- os/arch: arm64 (ARMv8 compatible)
- go/version: go1.20.1
- go/linking: dynamic
- go/tags: cmount

Which cloud storage system are you using? (eg Google Drive)

Google Drive

The command you were trying to run (eg rclone copy /tmp remote:tmp)

rclone bisync /Users/redacted/Rclone/Drives/GDrive gdrive_redacted: -MPc --drive-skip-gdocs --check-access --max-delete 10 --filters-file /Users/redacted/Rclone/Filters/bisync_gdrive_filters.txt -v --check-sync=false --no-cleanup --ignore-checksum --disable ListR --checkers=16 --drive-pacer-min-sleep=10ms

The rclone config contents with secrets removed.

[gdrive_redacted]
type = drive
client_id = redacted
client_secret = redacted
scope = drive
export_formats = url
token = redacted
team_drive = 
skip_shortcuts = true

A log from the command with the -vv flag

I'm not sure if it's possible to provide a log that shows everything I talk about here, but here's a fairly typical one for me: https://pastebin.com/3tTvLbCS
(This is without my fix for the 'fastDelete' issue described above. Note that it took 1 hour and 5 minutes to make one single deletion!)
If there's something else specific you want to see, let me know and I will try to capture it for you.

I am irrevocably biased but I wrote syncrclone because I didn’t like how rclonesync-v2 worked and bisync was based on that code.

I wrote a good faith summary here which go into some of the fundamental differences.

I think many of your requests are implicitly there. Most notably, the algorithm is fundamentally different such that there is no concept of resync. It only does one file listing, has the ability to call code at the end to notify, handles more of the edge cases, etc.

Bi-directional sync is fundamentally stateful but syncrclone is as stateless as possible to avoid that issue.

@nielash thank you for your thoughtful writeup.

Unfortunately we've lost our bisync maintainer so this code is effectively unmaintained at the moment.

Would you like to contribute to the maintenance of it? You appear to have a grip on the source which is great.

Perhaps @jwink3101 you'd like to help too - perhaps we should be heading towards the syncrclone way of doing things?

EIther way, I don't really have time to get to grips with the bisync source myself so I'm looking for help.

This looks really interesting, thanks! I love the approach of "Compare current state and use previous to resolve conflicts", and I agree that seems cleaner than how bisync currently works. This is basically what I'm after with the "identical files should be left alone" concept.

If this issue is still up to date, it looks like syncing empty directories is not supported, and the recommended workaround is to make them not-empty by adding a hidden file to each one. So I think that will have me sticking with bisync for now (which, to be fair, didn't support this either, until I changed it in my fork.) But at first glance, I do think syncrclone looks like a compelling option for those who don't share my paranoia about empty folders. :slightly_smiling_face: Your comparison table is very helpful.

That sounds great and I am curious how you accomplished it -- does it basically generate new listings whenever it can't find old ones, and do a "superset" copy instead of a sync for that session only? (Kind of the equivalent of running --resync automatically when necessary?)

@ncw I'm happy to submit a PR for the handful of small fixes I made in my fork, if that's helpful. I'm not sure I can commit to doing everything on my long wishlist, but I could at least contribute a few improvements that should go a long way. Also, I'm relatively new to Go, so I'd definitely want you to check my work and tell me if there are better ways to do certain things.

I can also suggest some documentation edits to turn some of these "bugs" into "known issues".

Since I haven't tried it out, I don't really have an opinion on that at the moment. But if you do want to head in that direction, I think @jwink3101 would be the natural choice to take the lead.

If this issue is still up to date,

It is. I am not overly interested in moving empty dirs but not wholly opposed either. I'd have to think about how hard it is. An alternative, if you're willing, is to use the pre_sync_shell option to write a random temp file in empty dirs, then use post_sync_shell to delete. You'd want to set the file to have the same modtime but that isn't too hard. It's a workaround but not bad

for those who don't share my paranoia about empty folder

I am curious about this. Maybe it is that I am used to tools like git, Backblaze Personal, and tons of others that don't care about empty dirs, but if I am concerned about a directory, I do just put a .empty file in. What are you paranoid about? No judgment. Just curious

That sounds great and I am curious how you accomplished it

First of all, I go into great detail here. The basic premise is that I first compare current state and remove all identical files. Or, as you say "identical files should be left alone". Then if a file is missing on one side or another, I use the past list to determine if its new or deleted. Finally, at that point, they must conflict. So I can resolve it with or without the past state.

Only then are moves tracked. A move means a file in the "new" pool can be matched to the "deleted" pool. I got the idea from one of @ncw's comment on my own questions here a long time ago.

By the way, you nailed it with " Final listings should be created from initial snapshot + deltas, not full re-scans, to avoid errors if files changed during sync". That is what I did but kept it off by default. While it's super tested, I still worried it would fail me. It hasn't but there are edge cases. It is now on by default (see avoid_relist option) but can be turned off.

Interesting (to me) aside, this is not how my other sync tool, PyFiSync works. PyFiSync was designed for SSH or local and via rsync. Since it used rsync, I could take advantage of moving and modified files. MUCH more complex algorithm. I actually added an rclone-remote to it but decided it was a waste. Something that used rclone first would be simpler (to be fair, I am not so sure I would still call it simple but it actually started out as "simplesync". My internal devel repo is actually still called that on my SSH server)

does it basically generate new listings whenever it can't find old ones, and do a "superset" copy instead of a sync for that session only

Yes but it is 100% implicit. That is the beauty. It's not an edge case; it is just the same code path. Matching "new" files are removed in the first step. Conflicting "new" files are handled as conflicts either way. And if it's not found in any old list, it's considered new. Again, that is the beauty. It is also why you can change filters. They may suddenly appear "deleted" but that is the expected behavior.

I am honored. Here is the issue. Programming for me is nearly all hobby. On the rare occasions my wife takes the kids out of town, I programmed 18+ hrs a day (meet dfb. Not yet posted here but early beta). But it is all Python. I was able to use Python for work for a long time (I did my PhD with Matlab but got tired of it so I moved to Python when I started working). Even for work, I do little programming work now.

So to learn golang enough to even start to help would be tough. I may be able to do a (vacation from work) day of live assist in the algorithm and the nuance but to actually be set free with code would be tough.

If we were to do that, I suspect we'd create an amalgamation of bisync and syncrclone since the latter works on a config file. It could be flag-based I guess. It would have to deal with how to handle new combinations of remotes. We'd likely want to borrow that and the where-to-store the DB from bisync.

If you do want to try to pair-program "bisync2", I am willing to take a vacation day from work but we have to plan it out. I am in Albuquerque, New Mexico, USA. We'd have to make the times work.

Yes PRs to fix bugs and add documentation would be most appreciated.

bisync has an extensive set of tests so if you find a bug you'd like to fix, if possible create a new test case first that fails and then fix that testcase.

I would review all code submissions as a matter of course, and I've mentored many new Go contributors :slight_smile:

My problem is that rclone is a big project now. There are over 250k lines of source! This is way too much to fit in my head which is why I haven't engaged with the bisync source very much (despite being only 3k lines) very much. I left that to ivandeex who did a great job and is a very talented developer. Unfortunately due to the situation in Russia at the moment he can't contribute to rclone any more, and in fact haven't heard anything from him for over a year :frowning:

So I'm really looking for help with bisync. It needs someone to really understand the nuances of the source code. Clearly you two @nielash and @jwink3101 are my best bet so far and I'l love to get you both involved in contributing code.

I know exactly where you are coming from. I started the rclone project at a time when I was doing 100% management at work as a busy CTO and I needed a bit of relief from that by writing some code. Now I do rclone more or less full time.

I hear you. I think you'll find Go easy to learn though. If you've ever programmed C it will be immediately familiar (it's like C without the hard parts and the type declarations backwards). Any experience with a statically typed language will help. Even if Python is your first and only language I don't think you'll have a problem learning it. The main thing you'll find missing in Go is inheritance and the main thing you'll have to learn is error handling! Otherwise Go really is quite a simple language. I recommend all experienced coders start from the Go tour

I'm open to suggestions. We could evolve bisync. In the short term that's probably the best thing to do.

We could also start afresh with bisync2 or maybe even port syncrclone into Go.

As far as rclone users are concerned, bisync is still in beta. As long as we provide a reasonable migration path (ie run these commands to migrate from bisync to bisync2) then I don't mind a fresh approach.

I have no particular feelings on flag based vs config file - whatever is easiest for the user is my guiding principle.

The DB part of bisync is based on a very robust key value store that ivandeex wrote a little wrapper around to make multi-process. I think that is a very useful bit of technology and if a DB is needed, that would certainly be my preference.

That is a generous offer. I'm based in the UK so currently on UTC+1 so that is 7 hours difference.

I suggest what we do is carry on generating ideas for the moment and see if we can come up with a plan!

Sure thing. Sorry if this is more detail than you ever wanted about empty folders (lol) but I'm laying this all out here to really make the case for why I consider them important and am reluctant to use a tool that doesn't. (And I totally acknowledge that others may disagree, including Git and Backblaze.)

Firstly, to reiterate my main concern:

A few other points worth noting:

  • There's an open issue for this, with multiple users reporting that bisync causes issues with Cryptomator (I'm not a Cryptomator user myself)

  • I'm primarily a Mac user, and macOS has a concept of "packages" where certain directories are essentially considered a "file" in Finder and for most other Mac purposes, but still considered a "folder" by rclone. For example, the Logic Pro projects mentioned above are packages using the .logicx extension (slightly off-topic, but: this is one reason I really want directory metadata support -- rclone currently cannot sync the modtime of Logic projects, even though macOS essentially treats them as files. Add me to the growing list of users willing to chip in to sponsor this feature!) Actually, the way I discovered Bug #2 in my original post was from doing a --resync and seeing some 'packages' get deleted, because rclone considered them to be empty directories.

  • There are some scenarios in which rclone itself requires the existence of empty folders. For example, the mount point for rclone mount (unless using --allow-non-empty which is not supported on Windows and usually a bad idea anyway), and bisync itself, which will accept an empty folder as a root path (on --resync only) but error out if it doesn't already exist (unlike sync which will create it for you on the fly.) So it strikes me as somewhat contradictory for rclone to take a position of "empty folders don't truly exist, but also they're so crucially important that we'll sometimes error without them". Obviously these particular examples are easily correctable and not a big deal, but my real point is that if rclone itself sometimes errors when it can't find a folder it expects to be there, other applications could too. This is why I'm not so quick to assume there won't be any consequences if I go about deleting willy-nilly the thousands of empty folders in my drive created by other applications over the course of decades. Will all those applications work just fine without them? Maybe. But how can I be sure? It's also kind of an impossible thing to test for, given 1.) how many different apps I use; and 2.) how many different operations I'd have to test in each app (what if the issue it causes is not immediately apparent?) And even if I could somehow test this and definitively prove that no harm is caused, any one of these apps could change their behavior in a future version, and by relying on a tool that discards empty folders, I could be doing damage for a while before I discover the issue.

  • rclone sync already supports --create-empty-src-dirs. In my opinion, the eventual goal should be for bisync to be a bidirectional version of sync, with full feature parity.

  • rsync supports empty directories (and directory metadata, for that matter).

  • Aside from my fears about breaking apps, I sometimes also use empty folders in my workflow as placeholders that will be filled with files later, but which I want to create earlier so that 1.) I can apply naming conventions to all the folders at the same time with a mass-rename, and 2.) so that it will be apparent later if I missed one. For example, at the end of working on an audio project, I like to export each individual audio track in several different formats that might be needed later (since I can't assume that the DAW and every plugin I used will still be around and usable 20 years from now). Let's say it's an album with 20 songs, each song has 30 tracks, and I want to save everything in 5 formats. I would first create a hierarchy of empty folders like this:

/
├── 01 Song1_Name
│   ├── 01 Song1_Name - Format1
│   ├── 01 Song1_Name - Format2
│   ├── 01 Song1_Name - Format3
│   ├── 01 Song1_Name - Format4
│   └── 01 Song1_Name - Format5
├── 02 Song2_Name
│   ├── 02 Song2_Name - Format1
│   ├── 02 Song2_Name - Format2
│   ├── 02 Song2_Name - Format3
│   ├── 02 Song2_Name - Format4
│   └── 02 Song2_Name - Format5
└── 03 Song3_Name
    ├── 03 Song3_Name - Format1
    ├── 03 Song3_Name - Format2
    ├── 03 Song3_Name - Format3
    ├── 03 Song3_Name - Format4
    └── 03 Song3_Name - Format5

(etc. to Song20)

And then I would go about filling each folder with files (one at a time) by exporting from my DAW (a long process that I might split up over several days, bisyncing at various points in between). If I were to create each folder later at the time of use, 1.) it would take longer (for lack of mass-rename), 2.) it would increase likelihood of making typos or other inconsistencies in my naming conventions, and 3.) I might not notice if I missed one (for example, if I forgot to export format #3 for song #16.) Whereas having the empty placeholder folders there serves as a virtual "checklist" -- it's immediately obvious if I missed one. (I typically zip up each folder when I'm done, and so a remaining folder or abnormally small zip file would stick out like a sore thumb.)

This is all to say: empty folders sometimes have a role in my workflow, and so I don't want to use a data sync tool that will consider them worthless and ignore/delete them. In my view, it's not truly a mirror unless my empty folders from Path1 exist on Path2 (and vice versa).

I appreciate the idea, but this seems less clean to me than my current solution. Some possible concerns:

  • If the process were to get interrupted, it would leave all the temp files there.
  • It could add a lot of time and API calls to each sync if doing this on a non-local remote (I have tens of thousands of empty folders in Google Drive, maybe more)
  • I'm a bit wary of writing temp files into application folders for the same reason I'm wary of deleting them -- how can I be sure it won't break something in some app I might use at some point? Basically my first principle with all of this is: if an app put the files and folders there itself, I'd better not assume that I can go in and mess with it without consequences.
  • It would look quite noisy on the Google Drive side to have thousands of temp files created and deleted constantly -- it would make it difficult to see which files are actually changing.
  • I'm also not sure how possible it would be to perfectly clean up after myself (restore to original state) on either side (for example, I'd imagine that the act of creating and deleting the temp file could cause some of the parent folder metadata to change.)
  • Potential for conflicts if bisyncing multiple machines at the same time (I currently bisync my Google Drive with my desktop and my laptop, for example.)

By contrast, my current solution essentially just makes the existing copyEmptySrcDirs parameter controllable by the user instead of hard-coded to false.

There's something I'm still having trouble wrapping my head around, and maybe it will make more sense to me once I actually try it out, but: if bidirectional syncing is inherently stateful, how can syncrclone sometimes still operate (safely) without knowing the state?

I think I understand the part about considering everything "new" when we don't know the prior state, and I get why that would be fine for the first run -- because the user has control over when that happens, and presumably they would not be running that first sync at a time when that would create a mess. But what about a random future run when the user isn't around and it can't find (or can't trust) the prior listings for whatever reason (like an error/interruption on the prior run)? If everything is considered "new" in this situation, how does it avoid erroneously merging the two sides together and causing deleted files to re-appear?

Again, entirely possible that I'm just missing something and that my questions will answer themselves by just trying it out (which I plan to do!)

Very sorry to hear that. Hope he's ok. :frowning_face:

My sense is that it would not take all that much time/effort to patch a few bugs and get the current bisync into a more usable state, and so that is probably worth doing regardless. Whereas developing bisync2 or porting syncrclone would be a much bigger lift, and so maybe that should be more of a Phase 2 of this project.

I'm also thinking that a good next step for me would be to install syncrclone and try it out, so that I have a better understanding of how it works and I'm not wasting either of your time without having done my homework. :grinning: I'm sure it will also give me a better sense of the pros and cons between bisync's philosophy and syncrclone's.

I'm in the middle, on USA Eastern Time (UTC -4), but I'm very often up and working at odd hours (like right now, haha), so I'm sure I can accommodate whatever's best for both of you. :grin: In the meantime, I will work on getting that PR together for you, and getting myself up to speed on syncrclone.

This is a good question. It is not that it doesn't use the state. It is that the past state is secondary. bisync uses the past state to decide what has been changed and then propagates and resolve those changes. syncrclone uses the current state to see where they disagree then past state to resolve. Theoretically, they are the same but in practice, you end up with all kinds of issues.

Interruptions are a beast and honestly syncrclone could be improved on this account (and may be in the future). I do go into some detail here on different cases I've considered. Everything is never considered "new" unless there is no prior list at all. If a run is interrupted, the original prior list is still kept. It isn't updated until the end. But it is possible to have previously deleted files return as opposed to being deleted. My philosophy is (a) make undesired outcomes as safe as possible (better to restore by accident than delete) and (b) backups are built in. On the latter point, rclone has --backup-dir but it is specified. syncrclone can be told not to back up but that is not the default. I don't think bisync has it at all but I could be wrong.

I mentioned that syncrclone could be improved. One of the issues is that syncrclone doesn't track the remote state until it is done. I have the same issue in one of my other tools, rirb. However, in my newer tool, dfb, I use the rclone rc interface to do each transfer and can update state one-at-a-time. Something to consider for syncrclone.

Empty Dirs

Consider me convinced that there is a need and it is worth thinking about.

I spent some time (not paying attention to my real job....) thinking about it and how to implement. I went through a lot of ideas but they introduce complicated edge cases (e.g. directory deletes, empty directories that are now filled, etc).

What I came up with, but still a rough idea, is to simulate something like my idea I propose as a workaround but for all directories. Every directory, empty or not, gets a simulated "file" that has size, mtime, and name based on the directories. Then the sync logic stays the same (the file will never be modified; just added or deleted but that is fine). And then I filter it out.

Anyway, this is just the idea. I will have some free time in two weeks I may test it out.

By the way, setting copyEmptySrcDirs (or equiv) won't make a difference since the sync logic is in syncrclone, not rclone.

Anyway, I am going to reopen #10 to track this and work on it when I get the chance

@ncw I just submitted the Pull Request, with proposed fixes for many (but not all) of the issues from my original post. Looking forward to hearing your thoughts, whenever you have a chance.

A brief summary:

Additionally, here's what it does and does not address in relation to my original post:

Suspected Bugs

  1. Dry runs are not completely dry

Fixed

  1. --resync deletes data, contrary to docs

Fixed

  1. --check-access doesn't always fail when it should

Fixed

  1. --fast-list is forced when unwanted

Not fixed, but workaround documented

  1. Bisync reads files in excluded directories during delete operations

Fixed

  1. Deletes take several times longer than copies

Fixed

  1. Overridden config can cause bisync critical error requiring --resync

Not fixed (documented as known limitation with recommended workaround)

  1. Documented empty directory workaround is incompatible with --filters-file

Fixed (by adding support for --create-empty-src-dirs)

Feature Requests

  1. Identical files should be left alone, even if new/newer/changed on both sides

Addressed

  1. Bisync should be more resilient to self-correctable errors

Experimental beta version added (disabled by default)

  1. Bisync should create/delete empty directories as sync does, when --create-empty-src-dirs is passed

Addressed (this fixes #6109)

  1. Listings should alternate between paths to minimize errors

Not addressed (but March sure looks interesting...)

  1. Final listings should be created from initial snapshot + deltas, not full re-scans, to avoid errors if files changed during sync

Not addressed

  1. --ignore-checksum should be split into two flags for separate purposes

Addressed

6a. Hashes should be used (not just stored)

Not addressed (but documented as limitation)

6b. Hashes should be ignored when they're useless (because no common hash)

Not addressed (but documented as limitation). Also I may have been hasty in calling them "useless", as I wonder if they are still useful for comparing to prior listing of the same side (not opposite side). Want to give that more thought.

  1. Bisync should be as fast as sync

Not addressed (but significantly improved)

  1. Bisync should have built-in full-check and notification features to help with headless use

Not addressed (but included an example check command in the doc)

Ok, I think I'm following now. The part about keeping the prior list is cool, and something I think bisync could benefit from. (I'm assuming you not only keep it but also use it, yes? Kind of like choosing an earlier historical snapshot to diff from? Bisync allows keeping the prior list for debugging purposes, but won't currently use it.)

The "better to restore by accident than delete" part worries me a little. It makes sense to me on the file-level but strikes me as a bit dangerous on the filesystem-level, because if you merge two entire repos together it could be very hard to untangle. (For example, I'm attempting to imagine what would happen if you had different versions of an application's source code on either side, and then ran the equivalent of --resync... probably kind of a mess, right? Yes, it saves the maximum number of individual files from deletion, but at what cost to the repo?)

Bisync's approach to this problem is not totally ideal either... it prioritizes safety, but at the cost of convenience:

Certain bisync critical errors, such as file copy/move failing, will result in a bisync lockout of following runs. The lockout is asserted because the sync status and history of the Path1 and Path2 filesystems cannot be trusted, so it is safer to block any further changes until someone checks things out. The recovery is to do a --resync again.

So I think this just kind of comes down to a difference of philosophy. Would you say the following is an accurate assessment?

When the prior state is unknown or untrusted:

  • syncrclone says "if we don't know, then we assume everything is new."
  • bisync says "if we don't know, then we don't know, and we'd better not guess."

I think that generally if I have to choose between resilience and safety, I probably come down on the side of safety (at least as far as data tools are concerned.) But I definitely see pros and cons on both sides, and bisync is certainly a far cry from the Dropbox / Google Drive desktop clients which seem to somehow accomplish both (despite their many other shortcomings).

That's quite an interesting idea.

Awesome, I appreciate that.

I think, if I'm understanding correctly, that this would still have a lot of the same concerns I mentioned in the previous post. But I may be misunderstanding what you mean by "simulated" -- is the idea that we're tricking syncrclone into thinking that a file is there that really isn't? Like some kind of virtual file? Or is it a real file that's really there, but it's "simulated" in the sense that it's a temp file that we will delete later? (To ask this another way: if I looked at Google Drive activity history after the sync, would I see that thousands of files were created and deleted? Or would I see nothing?)

For the moment, I'm pretty happy with the way --create-empty-src-dirs is implemented in my pull request (I'm curious what you think, if you have a chance to check it out.) But I definitely still look forward to exploring syncrclone in more detail (I downloaded it but haven't had time yet to really dive in), especially the avoid_relist option, which is probably the biggest thing bisync is lacking at the moment.

I’m on mobile so this’ll be short

from. (I'm assuming you not only keep it but also use it, yes?

Yes. It’s used to determine if a file is new on one side or deleted on the other

Depends on how it ends up but you should have to untangle much. And there is also the “tag” option to keep both present.

Also, what’s the alternative? I don’t know of a full-file system atomic approach. So any tool if interrupted, will leave things in a messy state.

I guess that’s true. It’s not really how I think about it but functionally, yes.

It’s not a guess per se. It’s the trade off of a robust algorithm that is implicit. It’s also pretty reasonable. If there is no previous file, why assume anything else. And for what it’s worth, resync does the same thing; just it needs to be specified.

I did hit this with my rirb tool. There, I do have a flag to tell it to know it’s new. It’s not strictly needed but makes it explicit. And that tool also suffers from needing to have a “perfect” transfer.

It’s NOT there. In the code, I add a file in the internal listing for every directory but it’s never created. I have this part coded up but I may abandon the idea. It’s starting to feel too hacky. I added it and the had to add so many tweaks to pass tests again that it feels too invasive. Instead, I’ll do it “right” and sync them. But that’ll take much longer so it’ll take me having the free time.

Yeah, this was a big one for me. It came much later because I was bothered by the extra listing. But as previously noted, there are some edge cases it breaks.


If I ever move syncrclone to the rc interface, I actually expect performance to be on par with a first-party tool. But it will also break some existing patterns so I am more weary. I can reuse a lot of code (the advantage of having written 5 different rclone-wrapping tools) but I am not sure I want to invest so much into it when the current one works well enough.

Sounds great, I'm excited to check it out.

My concern is not so much about the messy state left by the interruption -- it's more about what the tool decides to do next, and whether the attempt to recover actually makes things worse.

As a thought experiment: imagine that we wrote a very simple script that runs bisync over and over, and if it ever aborts with a critical error, it then automatically runs a --resync to recover, then continues as normal. This could be done fairly easily, but is it a good idea? I would suggest that it is not -- because you'll eventually end up with directories that are all mixed up.

Just for fun, I tried it just now by running a --resync using the source code for rclone v1.60.0 as Path1 and rclone v1.62.2 as Path2. Here's a log of what it did: https://pastebin.com/Juq7wVRv
And here's a diff of the result vs. the original v1.62.2: https://pastebin.com/2tGxc7yF

As for the alternative, one idea that comes to mind is some kind of "restore from snapshot" model where it can basically recall the state after the last known successful sync and start retrying from there, instead of from the newer untrusted state. This would basically require it to keep one extra listing at all times (until a newer one is confirmed to be successful), as sort of a rolling backup that it can revert to when something goes wrong with the newest one. There may be logical issues with this -- I'd need to give it more thought -- but it's the first idea that comes to mind.

Totally fair. But I would suggest that the "needs to be specified" part makes a lot of difference -- because the user has control over when that happens. They have the opportunity to clean the filesystem up first, do a dry-run, etc., and possibly elect not to go through with it if it's about to cause chaos. The trade-off, though, is that it's not as robust and autonomous. I agree that's a real downside. (The --resilient mode I proposed in the pull request sort of attempts to split the difference on this.)

Got it -- that makes sense.

That makes sense too.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.