What is the problem you are having with rclone?
This is documentation related. The features I need are already present in rclone but were not clear until now
I was struggling to understand why rclone was re-copying files that had not changed.
After a lot of testing I realised it was because I have the source and target remotes the same, which enables server-side copy. It sounded like server-side copy is a good idea but in my case it actually performs worse than downloading/uploading objects.
I think the documentation could be made clearer to indicate how server-side copy relates to metadata. Suggestion below.
Run the command 'rclone version' and share the full output of the command.
rclone v1.64.2
- os/version: amazon 2 (64 bit)
- os/kernel: 4.14.322-246.539.amzn2.x86_64 (x86_64)
- os/type: linux
- os/arch: amd64
- go/version: go1.21.3
- go/linking: static
- go/tags: none
Which cloud storage system are you using? (eg Google Drive)
s3
I am copying data from an s3 bucket to an s3 bucket via an access point.
The command you were trying to run (eg rclone copy /tmp remote:tmp
)
Example commands and stats below. The order is important - performance varies depending on if the destination is empty or full. The source data, source bucket, and destination access point are identical. The destination prefix changes between command 2 and 3.
# Command 1: Implicit server-side copy, empty destination.
time rclone copy afs1:<source bucket>/<source prefix> afs1:<dest access point>/test_same --transfers 96 --checkers 96 --stats 10000h --stats-log-level NOTICE
2023/11/08 09:34:06 NOTICE:
Transferred: 0 B / 0 B, -, 0 B/s, ETA -
Transferred: 156795 / 156795, 100%
Server Side Copies:156795 @ 7.305 GiB
Elapsed time: 5m20.3s
real 5m20.442s
user 2m29.040s
sys 0m25.132s
# Command 2: Implicit server-side copy, full destination after #1.
time rclone copy afs1:<source bucket> afs1:<dest access point>/test_same --transfers 96 --checkers 96 --stats 10000h --stats-log-level NOTICE
2023/11/08 09:39:55 NOTICE:
Transferred: 0 B / 0 B, -, 0 B/s, ETA -
Checks: 156795 / 156795, 100%
Transferred: 156799 / 156799, 100%
Server Side Copies:156799 @ 7.307 GiB
Elapsed time: 5m5.1s
real 5m5.244s
user 4m16.633s
sys 0m39.394s
Notice how the second command takes around the same time, and re-copies everything.
In command 1 & 2 no metadata is added to the destination objects.
# Command 3: No server-side copy, empty destination
time rclone copy afs1:<source bucket>/<source prefix> afs1-2:<dest access point>/test_diff --transfers 96 --checkers 96 --stats 10000h --stats-log-level NOTICE
2023/11/08 09:48:27 NOTICE:
Transferred: 7.307 GiB / 7.307 GiB, 100%, 33.643 MiB/s, ETA 0s
Transferred: 156800 / 156800, 100%
Elapsed time: 6m16.4s
real 6m16.502s
user 3m38.027s
sys 1m17.418s
# Command 4: No server-side copy, full destination after #3
time rclone copy afs1:<source bucket>/<source prefix> afs1-2:<dest access point>/test_diff --transfers 96 --checkers 96 --stats 10000h --stats-log-level NOTICE
2023/11/08 09:51:29 NOTICE:
Transferred: 0 B / 0 B, -, 0 B/s, ETA -
Checks: 156800 / 156800, 100%
Elapsed time: 2m31.9s
real 2m32.020s
user 2m3.796s
sys 0m18.942s
Notice how command 3 is a bit slower than 1. But 4 is much faster than 2.
In practice, if rclone is used to replicate an "incrementally changing" dataset, than the destination is likely to be empty only the first time, and subsequent runs would be much faster if command 3&4 were used when only a few files are being changed.
This is probably only true when there are many tiny files (per above), as I would understand larger files would take longer to download and upload.
The rclone config contents with secrets removed.
rclone config show
[afs1]
type = s3
provider = AWS
env_auth = true
region = af-south-1
location_constraint = af-south-1
no_check_bucket = true
server_side_encryption = aws:kms
[afs1-2]
type = s3
provider = AWS
env_auth = true
region = af-south-1
location_constraint = af-south-1
no_check_bucket = true
server_side_encryption = aws:kms
A log from the command with the -vv
flag
(not included)
Suggestion
I think the docs regarding server side copy could be improved as follows:
1
Explicitly state that when a server side copy is done, the source object metadata is copied - RClone will not add further user metadata such as mtime or checksum to the destination objects. Subsequent copies might not perform better even if objects are not changed. If server-side copy is disabled, then objects will be re-copied at least once by Rclone (but possibly more if I understand correctly how metadata is maintained on objects). But, assuming the remotes are unchanged when rclone runs again - server-side copy will happen and at least for S3 it does not seem to detect matching objects as efficiently as rclone itself could.
There is probably a better way to word this, and really server-side probably works fine when files are larger than in my case.
Worth mentioning, my buckets are encrypted with different KMS keys - causing different ETags - maybe s3 backend does not determine a match in that case.
2
I think that the docs on server-side copy should also call out how to disable server-side copy if you want to avoid the aforementioned issue.
--disable copy
At first I thought it wasn't possible because there was no global flag that I found. The forum helped
Below is an example showing the performance is better for my second run with this flag.
# Command 5: Server-side copy disabled, empty destination
time rclone copy afs1:<source bucket>/<source prefix> afs1:<dest access point>/test_same2 --transfers 96 --checkers 96 --stats 10000h --stats-log-level NOTICE --disable copy
2023/11/08 09:59:15 NOTICE:
Transferred: 7.307 GiB / 7.307 GiB, 100%, 33.033 MiB/s, ETA 0s
Transferred: 156800 / 156800, 100%
Elapsed time: 6m8.3s
real 6m8.445s
user 3m36.750s
sys 1m20.406s
# Command 6: Server-side copy disabled, full destination
time rclone copy afs1:<source bucket>/<source prefix> afs1:<dest access point>/test_same2 --transfers 96 --checkers 96 --stats 10000h --stats-log-level NOTICE --disable copy
2023/11/08 10:17:23 NOTICE:
Transferred: 0 B / 0 B, -, 0 B/s, ETA -
Checks: 156800 / 156800, 100%
Elapsed time: 3m42.7s
real 3m42.775s
user 2m3.230s
sys 0m20.286s
As expected, similar performance to #3 and #4 but without the need for duplicate remote in the config.