Copying data of a live service, would like to re-evaluate checksums if they fail

What is the problem you are having with rclone?

I am attempting to copy the entirety of one bucket into a newly-created subfolder in another. The source bucket is being used by a live service and I'm attempting to backup it's contents in a best effort manner. Since the source bucket is in use, some files are changing during the transfer, and do not appear on the destination bucket after the copy has completed. It's worth noting the source bucket has many files (~2 million) which is part of why files are changing, the transfer can take some time

I would like to keep the files which fail the checksum, particularly after all low level retries have completed, while still performing the checksum rather than turning it off so that files aren't corrupted by the transfer itself. Ideally, having rclone re-evaluate the source hash to check if it's changed would solve this.

Run the command 'rclone version' and share the full output of the command.

rclone v1.64.0

  • os/version: ubuntu 22.04 (64 bit)
  • os/kernel: 5.15.0-1049-aws (x86_64)
  • os/type: linux
  • os/arch: amd64
  • go/version: go1.21.1
  • go/linking: static
  • go/tags: none

Which cloud storage system are you using? (eg Google Drive)

AWS, GCP and Azure

The command you were trying to run (eg rclone copy /tmp remote:tmp)

rclone --config=<config_file> object-remote:<source_bucket> object-remote:<dest_bucket>

Please run 'rclone config redacted' and share the full output. If you get command not found, please make sure to update rclone.

[object-remote]
type = s3
provider = AWS
env_auth = true
acl = private
list_version = 2
no_check_bucket = true
bwlimit = 5M
transfers = 128
checkers = 128
retries = 16
checksum = true

The 16 retries are an attempt to get a successful backup, or to catch files which are missed, but it's very wasteful

I don't think a log would be particularly helpful since I'm describing an odd use-case that may not be supported rather than trouble getting the tool to work

Trying to copy live (constantly changing) data is never good idea. There is no really good way to ensure data integrity using only rclone IMO.

I think better option for copying data of a live service would be to use S3 bucket versioning. Enable versioning and create lifecycle rules to keep versions for limited time. Then you can run your copy operation on the state of your bucket as it was e.g. 1h ago.

I think this is a really good suggestion, I'll see how I might be able to get that to work

welcome to the forum,

not sure if you are doing a one-time transfer or something to be repeated on a schedule/loop?

to reduce the number of api calls, might want to check out
--max-age, --fast-list, --no-traverse, --no-check-dest and https://rclone.org/s3/#reducing-costs

and these are global flags, as such, do nothing in the the config file.
--bwlimit, --transfers, --checkers, --retries, --checksum

Thanks for the tip about those flags! I had no idea there was a difference, so I'll be sure to fix that. I'm doing this on a schedule to have up-to-date backups for this service. Ideally it would be a daily backup or faster