Question about the behavior of `rclone copy`

ztasre · November 23, 2020, 3:41pm

Can I get a brief summary of how rclone copy works?

Let's say I have files i want to copy over from S3 to Azure Blob, and I do this every single day as a cron job, rclone will not copy over files that already exist in Azure Blob from S3?
Let's say 1 is true, how does rclone determine whether the file is copied over? With a hash? Is this hash calculated on the client running rclone or do files in S3 come with a hash? I want to know if rclone still downloads all the files onto the client machine before determining if they are to be copied over.

asdffdsa · November 23, 2020, 5:19pm

hello and welcome to the forum,

this will explain how rclone copy works
https://rclone.org/commands/rclone_copy/

rclone will over-write files in s3, but that can be tweaked
rclone by default uses, mod-time and size and perhaps checksum, it depends on the flags

keep in mind, that s3 and azure use different hash types
https://rclone.org/overview/#features

still downloads all the files
no, rclone will try to compare mod-time and/or file size, again it depends on what flags you use.

albertony · November 23, 2020, 5:25pm

Hi, and welcome!

Let's say I have files i want to copy over from S3 to Azure Blob, and I do this every single day as a cron job, rclone will not copy over files that already exist in Azure Blob from S3?

True.

how does rclone determine whether the file is copied over?

From docs: "Doesn't transfer unchanged files, testing by size and modification time or MD5SUM. "

Now what does this actually mean? I find it a bit confusing myself, but this post sums up what kind of comparisons rclone performs to decide if copy/upload should be performed, with some configurable options:

There are 3 main syncing methods

* no flags - (size, modtime)
* --size-only (size)
* --checksum (size, checksum)

Then there are the modifiers

* --ignore-size makes all of the above skip the size check
* --ignore-times - uploads unconditionally (no checks)

Is this hash calculated on the client running rclone or do files in S3 come with a hash?

All of the information are retrieved from the source and destination backends, so rclone uses the file size, timestamp and hash information as reported by S3 and Azure blob in your case. If both S3 and Azure keeps file hashes of same type, and judging from this both uses MD5, then it will be used whenever rclone wants to compare hashes. If there are no common hashes, then rclone will not be able to compare hashes during copy/sync (but see note about check command below).

I want to know if rclone still downloads all the files onto the client machine before determining if they are to be copied over.

No, it does not. However, there is a check command which you can use to compare without copying anything, and it has an option --download to do exactly this: Download files from both remotes and compare them on the client.

Edward_Barker · November 23, 2020, 5:29pm

Another great post with good points to merge into the docs!

albertony · November 23, 2020, 6:00pm

Thanks. As I am sure you can agree with, writing documentation for something like this is hard work. Before even starting to write something, it takes time to fully understand and get a clear picture of how it actually works..

...and I must admit I am still a bit confused if and when the checksum is used. For instance, I find the following two posts ~~stating different things~~ edit: at first look they do seem to give different answers to the question if checksum is considered by default, but on second thought they are just written from different perspectives (see my own clearification in a later post):

Edward_Barker · November 23, 2020, 6:08pm

Yup, always the same. Trawling through reams of contradictory or vague statements! Two good things about rclone though are the rich seam of expertise in this forum, and it can be relatively easy to set up tests with a couple of remotes.

I find the following two posts stating different things

I was thinking you had stumbled on one of those areas.

albertony · November 23, 2020, 7:04pm

To try to clear the confusion (hopefully not add to it) created by myself, regarding the default mode and checksum handling:

Rclone will use file size and modification time, to decide if a file is different, and only that.
- First compare sizes.
- If sizes are equal, then compare timestamps.
If a file is deemed equal (size and timestamp both equal), then the file will be skipped.
If a file is deemed different (different size and/or different timestamp), rclone will use checksum to decide what to do with it.
- If checksums are equal it will just update the timestamp of the destination to match the source, and not actually copy the file.
- If checksums are different it will copy the entire file, replace the existing.

Edit: This means, in default mode, checksums will not be compared unless either file size or modification times are found to be different first. (But there is an option --checksum that will change this, and there is also a separate command check that can be run after the copy to verify all checksums).

The use of checksums requires the source and destination to support the same algorithm. Updating of timestamps is not supported by all backends, so rclone may have to re-upload the file even if the checksums are equal. And then you have all the options to change the default behavior: --checksum, --size-only, --ignore-size, --ignore-times, ... I guess here lies much of the confusion.

ncw · November 23, 2020, 9:42pm

That was a great explantion @albertony

Indeed. For everything rclone does there is an option to turn it off. It didn't start out that way but the public gets what the public wants!

By and large the syncing in rclone works exactly the same as it does in rsync, down to the naming of the flags.

The core algorithm with no flags is exactly how you described it.

Maybe we should write out the algorithms with the main modifier flags

no flags
--checksum
--size-only

Maybe diagram with flow charts or similar?

In general don't use --ignore-size or --ignore-times those are for advanced uses.

Actually the most useful thing would be if we told new users which flags to ignore or made a hierarchy of most "useful flags" -> "flags to avoid" when syncing. Eg

--checksum
--size-only
...
--ignore-size
--ignore-checksum
--ignore-times

I wonder whether rclone should have a separate sync page with this info on and description of the sync modifier flags?

(where sync is shorthand for sync/copy/move/copyto/moveto)

albertony · November 24, 2020, 7:22am

Sounds like a good idea to me!
To have a new page describing the main "sync" process in depth, together with the already existing pages for "Installation", "Usage", "Filtering" etc. (Would be good to find a new term for this sync/copy/upload/whatever process, other than "sync" so that it is not so easy to make the assumption it is limited to the sync command only... oh no, naming things..)

Edward_Barker · November 24, 2020, 9:37am

Rclone copy, sync and move are all about 'transferring' files. At the most basic level copy behaves like rsync. sync is like rsync with a --delete flag and move like rsync with a --remove-after flag. They would fit nice together on one page. I did have a go at tackling that (without the nourishing capabilities of the flag soup) in the basic Linux syntax examples on the rclone wikipedia page. https://en.m.wikipedia.org/wiki/Rclone

darthShadow · November 24, 2020, 11:57am

Rclone is commonly a front-end for media servers such as Plex,[9] Emby or Jellyfin [10] to stream encrypted content direct from consumer file storage services.[9]

This may be too much publicity of the wrong kind since it both encourages abuse of cloud-storage services and also piracy because I don't think most of the folks using these kinds of setups have obtained the content legally.

Edward_Barker · November 24, 2020, 12:23pm

A fair point. I have no idea whether the content is illegal or not and have to give the benefit of the doubt though I do occasionally wonder why it is encrypted.

As an objective Wikipedia article I do think it relevant to list this significant use alongside the others.

One thought I have been toying with for a while is deleting 'encrypted' in the sentence.

The jellyfin reference is important because it is a non rclone.org one. I had quite enough trouble proving rclone was noteworthy!

I have avoided tackling the potential abuse of cloud service limits and encryption of possibly encrypted material though there are probably enough balanced sources to nearly do it now.

Wikipedia is open to all to edit (with caveats for those with a financial etc. interest) and I would be pleased if you wanted to make a change to the page or its talk page. Alternatively any other suggestions here would be useful. It was something on my mind.

Ed.

ncw · November 27, 2020, 3:45pm

Do you (plural) @albertony and/or @Edward_Barker want to work on a "sync" page? I don't want your very nice explanation to get lost @albertony!

Edward_Barker · November 28, 2020, 11:42am

I wouldn't want to stand in front of @albertony with this! I might be better continuing to root around on some of the less chatty stuff and there will eventually no doubt be some alleys to follow once my errors have been expunged from the 'filtering' page.

A sync page definitely has my total support.

Ed.

albertony · November 28, 2020, 6:46pm

I might have a go at this, but don't know when. Would not at all mind if @Edward_Barker or someone else starts on it first. Created an issue (feature request) for it, think it is less likely to be forgotten there, and also its a better place for further discussion of any details in the solution.

ncw · November 29, 2020, 10:14am

Thanks for making the issue. We can discuss more there.

system · January 29, 2021, 6:14am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.