Multiple write streams not causing fragmentation on ZFS?

matth3us · February 8, 2024, 3:46pm

What is the problem you are having with rclone?

Trying to figure out how rclone manages to avoid fragmentation when doing multiple write streams to a filesystem which does not support sparse/fallocate.

Im copying large files ranging from 1 to 20GB each, from Dropbox remote to local ZFS dataset. ZFS is CoW and does not properly support sparse files, also compression is enabled so doing a preallocation with zeros wont work.

When doing rclone sync with 8 write streams, I can see multiple files being written to the ZFS dataset. When transfer is complete, checking those files with

zdb -ddddd $(df --output=source --type=zfs "file" | tail -n +2) $(stat -c %i "file")

shows a single segment for each file. Fragmented files should be written in multiple segments.

How does this work?

My results look fairly different than:

Run the command 'rclone version' and share the full output of the command.

rclone v1.64.0

os/version: debian 12.4 (64 bit)
os/kernel: 6.5.0-0.deb12.4-amd64 (x86_64)
os/type: linux
os/arch: amd64
go/version: go1.21.1
go/linking: static
go/tags: none

Which cloud storage system are you using? (eg Google Drive)

Dropbox

The command you were trying to run (eg `rclone copy /tmp remote:tmp`)

rclone sync  -P -c --multi-thread-streams 8 --no-update-modtime --transfers 8 --inplace dbcrypt:test/ /mnt/test/ -vvv

rclone sync  -P -c --multi-thread-streams 0 --no-update-modtime --transfers 1 --inplace dbcrypt:test/ /mnt/test/ -vvv

Please run 'rclone config redacted' and share the full output. If you get command not found, please make sure to update rclone.

[dropbox]
type = dropbox
client_id = XXX
client_secret = XXX
token = XXX
chunk_size = 149M
batch_mode = sync

[dbcrypt]
type = crypt
remote = dropbox:backup
filename_encryption = standard
directory_name_encryption = true
password = XXX
password2 = XXX

matth3us · February 10, 2024, 12:58pm

And just to clarify; this is not an actual problem, I just find it surprising to work this way, and would like to understand why fragmentation does not happen.

If I understand correctly, rclone uses FALLOC_FL_KEEP_SIZE to create sparses. ZFS has pseudo-support for this:

github.com/openzfs/zfs

linux: implement fallocate(mode=0) compatibility

openzfs:master ← adilger:basic-fallocate

opened 11:25AM - 05 Jun 20 UTC

adilger

+317 -23

Implement semi-compatible functionality for mode=0 (preallocation) and mode=FAL…LOC_FL_KEEP_SIZE (preallocation beyond EOF) for ZPL. Since ZFS does COW and snapshows, preallocating blocks for a file cannot guarantee that writes to the file will not run out of space. Instead, make a best-effort attempt to check that at least enough space is currently available in the pool (12% margin), then create a sparse file of the requested size and continue on with life. The parameter of zfs_statvfs() is changed to take an inode instead of a dentry, since no dentry is available in zfs_fallocate_common(). Signed-off-by: Andreas Dilger <adilger@dilger.ca>  ### Motivation and Context Allow fallocate(mode=0/2) to be used in a reasonable manner for ZFS, even though it can never exactly match the semantics of non-COW filesystems. This partially addresses issue #326 but does not implement all of the available modes. ### Description Implement semi-compatible functionality for mode=0 (preallocation) and mode=FALLOC_FL_KEEP_SIZE (preallocation beyond EOF) for ZPL. Since ZFS does COW and snapshows, preallocating blocks for a file cannot guarantee that writes to the file will not run out of space. Instead, make a best-effort attempt to check that at least enough space is currently available in the pool (12% margin), then create a sparse file of the requested size and continue on with life. The parameter of zfs_statvfs() is changed to take an inode instead of a dentry, since no dentry is available in zfs_fallocate_common(). ### How Has This Been Tested? Basic build test. Has not had any functional testing, just for review and improvement. ### Types of changes - [ ] Bug fix (non-breaking change which fixes an issue) - [x] New feature (non-breaking change which adds functionality) - [ ] Performance enhancement (non-breaking change which improves efficiency) - [ ] Code cleanup (non-breaking change which makes code smaller or more readable) - [ ] Breaking change (fix or feature that would cause existing functionality to change) - [ ] Documentation (a change to man pages or other documentation) ### Checklist: - [ ] My code follows the ZFS on Linux [code style requirements](https://github.com/zfsonlinux/zfs/blob/master/.github/CONTRIBUTING.md#coding-conventions). - [x] I have updated the documentation accordingly. - [x] I have read the [**contributing** document](https://github.com/zfsonlinux/zfs/blob/master/.github/CONTRIBUTING.md). - [x] I have added [tests](https://github.com/zfsonlinux/zfs/tree/master/tests) to cover my changes. - [ ] I have run the ZFS Test Suite with this change applied. - [x] All commit messages are properly formatted and contain [`Signed-off-by`](https://github.com/zfsonlinux/zfs/blob/master/.github/CONTRIBUTING.md#signed-off-by).

Since ZFS does COW and snapshows, preallocating blocks for a file
cannot guarantee that writes to the file will not run out of space.
Instead, make a best-effort attempt to check that at least enough
space is currently available in the pool (12% margin), then create
a sparse file of the requested size and continue on with life.

I guess my question is more ZFS-specific than rclone.

When doing multi threaded copy/sync/move, rclone does fallocate on local FS, then writes to allocated space in multiple threads while making sure every thread writes to correct position of the allocated space? This should never work with ZFS, since because of CoW's nature all writes should go to the beginning of free space and not overwrite the allocated space but sure does not look like this is the case. Wonder what I'm missing here.

kapitainsky · February 11, 2024, 7:22am

It is interesting question but I think it requires some deep ZFS expertise to answer properly. I do not think rclone does anything special here as it is filesystem agnostic.

I would speculate that maybe as ZFS batches up chunks of data to be written to disk then even multiple streams can be nicely combined together in bigger single transaction group - especially for sync=disabled transactions which is most likely default.

matth3us · February 11, 2024, 11:32am

sync=standard is the default, so applications can control sync/async writes by themselves. ZFS aggregates writes in ZIL (ZFS Intent Log) which is flushed to disk every 5 seconds. In my case I'm pulling data from Dropbox at ~110MB/s in 8 threads, I can see multiple files growing in size on ZFS side, during transfer they seem to be fragmented, but once the transfer is done every file contains a single segment.

ncw · February 11, 2024, 11:56am

I don't know anything about zfs, but I do know that if you don't use sparse files with rclone then the OS will zero fill them when rclone seeks beyond the end of the written data.

This isn't very efficient as the files get written twice but the files won't be fragmented.

It may be zfs is very efficient at the zero writing though.

That is my conjecture!

system · March 12, 2024, 11:56am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.