Copy/Sync portion of a local file only

rnewson · January 10, 2022, 4:49pm

I'm using rclone to copy a set of append-only files to cloud storage and want to teach rclone to only copy these files up to the length they are at a moment in time, which may be before rclone itself is ever run. This must also be persistent so that rclone calls can be retried. The behaviour with --no-check-updated is very close to what I need but is not persistent and only captures the size of the file when then file is opened.

I'd like to add a new option, or extend an existing one, to rclone to allow the user to specify what portion of a file to copy from the source, something like;

rclone sync --include file1,100 --include file2,150 src dst

which would copy only the first 100 bytes of file1 and the first 150 bytes of file2.

Locally I have achieved a satisfactory solution for my needs. I add an extended attribute directly to the files in question which rclone uses instead of the file's actual length if present.

I've started this topic to get the community's view on what an acceptable API for this feature would be.

Animosity022 · January 10, 2022, 5:05pm

That won’t be possible as the cloud providers don’t support byte level copying in general.

rnewson · January 10, 2022, 5:26pm

I think you've misunderstood my post which is likely the fault of my description (I can say this is certainly possible as I'm already doing it).

Imagine a file on disk, of size X. Rclone today will copy bytes 0-X to your chosen backend. If that file then grows (to size Y), the next time you do rclone sync (or copy, etc), rclone will copy bytes 0-Y to your chosen backend.

All I'd like to do is override what rclone thinks the file size is.

My situation is this, I have a set of files (that are only ever appended to) that I'd like to copy to S3. I'd like to first record the sizes of all these files, and then ask rclone to copy only that part of them, so that even if rclone takes minutes, or hours, to copy these files, I still have, in S3, the files as they were at the moment I recorded their sizes.

I can achieve close to the effect I want with --no-check-updated if I set --concurrency to the number of files in the set. All the file lengths are captured (in memory) by rclone and rclone stops reading the source files at that point.

All I'm trying to achieve is persistence (in case rclone fails partway through the upload of this set of files, say).

Hopefully it's clear that the idea here has nothing to do with whether any cloud provider supports byte level copying.

Animosity022 · January 10, 2022, 5:48pm

I'm still not quite following as perhaps a use case might help.

If you copy a file at 12:00 and regardless of the size, it goes to the backend.

If the file is changed and you copy the file again at 12:01, it will copy the new file and replace the old file (minus certain flags, but generally that's the flow)

If if a file is being actively written to, rclone will abort the copy and based on some flags, maybe try again or fail out and produce an error.

I am not sure what you are requesting different from that flow that exists now.

Can you share an example of what you'd want to happen in terms of steps and results?

rnewson · January 10, 2022, 6:33pm

"If if a file is being actively written to, rclone will abort the copy and based on some flags, maybe try again or fail out and produce an error." -- that isn't always true, the --no-check-updated option exists specifically to allow rclone to complete the successful upload of a file that's being appended to during the upload. It causes rclone to record the size of the file when it opened it and suppresses the final "did the file size change?" check at the end.

My case is I'm backing up the shards of an Apache CouchDB database. These files are append-only, and so any prefix of these files represents an earlier state of the database. To make a useful backup I need to record the size of each shard file of a given database and only upload that much of each file to my object storage backend. These files can be large, and we may apply an upload bandwidth limit too, so the upload can take a while. Should there be any issue with the upload (beyond rclone's retry limit), I want to be able to try again, and copy the same prefixes of those files. The hosting server might crash too, and I'd like to try again after it comes back. Hence the need to inform rclone about which prefix of the file I want to copy through some means.

Another way to do this, which might help illustrate my intention, would be an LVM snapshot, and then tell rclone to copy the files from the mounted snapshot volume. LVM takes care of hiding that the files are growing, and it's this same notion of a logical file size instead of the actual file size that I'm looking to add as an option. I'd like to achieve this without LVM (or equivalent).

Here's exactly how I'm successfully doing this today, but I suspect extended attributes are a bit too weird for general users, and there are cross-platform issues (that don't affect me personally);

diff --git a/backend/local/local.go b/backend/local/local.go
index 3c00d5389..37f8a6ab5 100644
@@ -1296,6 +1298,14 @@ func (o *Object) setMetadata(info os.FileInfo) {
o.size = int64(len(linkdst))
}
}
  rclone_size, err := xattr.Get(o.path, "rclone_size")
  if err == nil {
          size, err := strconv.ParseInt(string(rclone_size), 10, 64)
          if err == nil {
                  o.size = size
                  o.fs.opt.NoCheckUpdated = true
          }
  }
}

This is clearly hackish and not suitable to merge but it shows how I'm achieving my goal. I opened the topic here to see if there's general interest in the feature and, if so, what form that should take in terms of command line arguments, etc.

rnewson · January 10, 2022, 6:37pm

That's, --local-no-check-updated, sorry.

Animosity022 · January 10, 2022, 6:45pm

Right. If you put a flag to turn off the behavior, it would not be true as it was a general use case I was providing much like it won't retry if I disable retries. I didn't account for every flag as we'd spend pages writing every flow and exception to the process.

Best spot would be to start a pull request with your code and comments generally flow better there than here from my experience. Your use case though seems better of just using a file system that allows a snapshot though as I don't generally see a huge flow of folks using what you are describing but I've been wrong before and I'm sure if I try hard enough, I'll be wrong again

You'll get more attention from the development folks on a pull request.

rnewson · January 10, 2022, 7:08pm

Thanks. The contribution guidelines on github directed me to open a forum topic first, hence the above, but happy to move to a PR instead.

Animosity022 · January 10, 2022, 7:09pm

Yep and we fleshed out some confusion (mine) to help you get a better start there so my work here is done

jwink3101 · January 11, 2022, 12:12am

It has its own limitations, but could use use rcat with individual rclone calls?

$ head -c <number-of-bytes> <myfile> | rclone rcat s3:<bucket>/path/to/file/dump

If your remote doesn't support streamupload, it is no better than copying the first <number-of-bytes> but since S3 supports it, it's not a problem.

Speaking of which, are the files too big that you can't copy them to a new location (and use some tool to only copy some bytes?)

Ole · January 11, 2022, 8:02am

Hi Robert,

I think it would be even better to open a GitHub Issue first to be able to discuss design, flag naming, syntax/parsing, unit test etc. before making the actual code and PR.

Not necessarily, I can see both Issues and PR’s stalling due to lack of time/interest/focus from the (core) development folks.

We like many others have more a lot more ideas/wishes than we have development time/competencies available.

Agree, this is probably the road were you have best chances of quickly getting a production quality solution with the lowest effort. (or alternatively the approach proposed by @jwink3101 above)

Though I may be wrong too

rnewson · January 11, 2022, 8:26am

That would work but it's not appealing for a lot of other reasons, but it's a thoughtful suggestion, thank you. I still want to benefit from rclone's managing of multiple uploads, the rate limiting, being able to run it as a daemon and query job status, etc. I would just write a separate (non-rclone) script to do the upload instead.

Yes, the files are too big to copy locally before uploading.

rnewson · January 11, 2022, 8:30am

I'm keen to avoid retrofitting existing systems with a snapshot facility (LVM or otherwise). If I were starting from scratch, sure.

I'll make an Issue for it then, thank you.

system · March 13, 2022, 4:31am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.