Keeping 2 buckets sort of in sync

Hi guys, I'm writing to 2 buckets with the same data, one has a retention policy of 90 days, the other a year... I can't just sync them, I need anything lost with one write to fill back into the other, any suggestions? I plan on running a cron once a day to sort it out. It's a few petabytes of data, so anything I do as far as scanning is pretty slow. Thanks to the community for being so helpful, you've helped me through a few hurdles so far :o)

I don't quite understand what you are trying to achieve.

Is it

  • when an object is missing from the 90 day bucket
  • you copy it from the 1 year bucket?

How do you determine when an object is missing? Why not set the retention policy of both to 1 year? I've probably mis-understood something!

Hey Nick, so we have an outside collection unit trying to post videos to these redundant buckets, each large bucket has 1 petabyte of data, the small one has 90 days of data (still a lot of data). Trying to figure out how to maybe rsync between these guys, without removing anything from large or small buckets. Thanks for the suggestions!

Is this the work flow

  • outside collection unit posts videos to 90-day bucket and 1-year bucket
  • you want to make sure that two copies of each video that they posted exist (maybe one of the uploads failed)
  • you don't want to copy all the data from the 1-year bucket into the 90-day bucket

Is that it?

Can you describe also how the files are arranged? Are they in dated folders for instance or arranged some other way?

Hey Nick, yes, that is the flow, the files are stored with a timestamp + a bit of other info, it should be very rare that we can't write to both, but we have to cover just in case... possible to rsync on a date parameter? I could just go back 1 or 2 days and run it on a cron, just throwing some ideas around, I appreciate any suggestions. Thanks

All files are just stored in the root, no concept of directories in S3, just metadata, so we just make sure they are all unique, here is an example: 23332_1004-1-1-1-6-1-1_1575241148.579.ts

Do they always have the same names of the files? If so, isn't it just a matter of limiting the age of the files checked?

You could just do a copy across each (both ways a_b and b_a) ignoring the actual file dates. You could use a checksum (--checksum) which would be really expensive or a file size and name (--ignore-times). You could also limit the copy to the lower of your retention which is less than 90 days (--max-age). Maybe 60 days back? or even less.

S3 has no concept of directories, but rclone can use the directory separators to break syncs down into smaller pieces.

OK!

That would be possible...

Are the files uploaded with rclone? I'll assume not

Let's say you've got the two buckets called little and large

What you could do is something like this

rclone copy --use-server-modtime --max-age 2d --checksum --no-traverse s3:little s3:large

What that will do is find files uploaded in the last 2d (--max-age and --use-server-modtime) and copy them from little to large if the --checksum does not match. --no-traverse will ensure it doesn't list the destination - it will only do HEAD requests to check the objects are there.

You'd then run it again reversing little and large.

This will, alas, have to list the large bucket - I can't see a way around that without writing a specialised script which knows how your file names work.

If you want to see what files would be checked then do

rclone ls --use-server-modtime --max-age 2d s3:little

If you are using two buckets in the same region then rclone will use a server side copy. However if you have two different remotes, say s3east: and s3:west then rclone will use your bandwidth to do the copying.

I hope that helps!

1 Like

Thinking more, I'd probably use --size-only instead of --checksum above as it will be slightly more efficient and you are only really interested in file presence or not.

In fact what would you want to happen if the file was present in both places but different?

thanks guys! the files should be the same in both, they come from the same source, let me digest this and try some things out. I've been uploading files for 4 months with rclone, production will start writing them in a bit, so it will be a mix of both for a bit.

much appreciated :o)

Good luck! Let us know if you need any more help.

Hi guys, the copy works great, glad I can specify how far to look back, I have 1 last question... If I have the same file name in both buckets, but the sizes are different (we had issues writing the whole video to one bucket), is there a way to only overwrite a file if it is bigger?

I've been testing with this scenario, both names are the same:

Small bucket - xxxx.ts = 17.75 MB

Large bucket - xxxx.ts = 15.6 MB

hopefully something easy here. In my tests I'm clobbering the bigger file with the smaller one.

Thanks!

There isn't an option to keep the bigger. The source will overwrite the destination if its different. You could specify min/max size but that isn't what you're looking for.

Thanks, I was going through all the flags and was wondering if I just missed it. I'll see what i can do with min/max, thanks for the suggestions.

Might be best to script that though. You could identify differing sizes using dry-run and parse.

# rclone copy h g --size-only --dry-run -vv 2>&1 | grep "Sizes differ"
2020/04/14 13:46:17 DEBUG : 1: Sizes differ (src 2 vs dst 7)

Then perhaps compare those sizes and transfer just the one to the right location depending on that check. That output will give both sizes and you can then process it based on the output (omit from the copy and manually adjust it)

Better yet, use check? but you won't get the sizes of the files output so dry-run with copy might be better!!

# rclone check h g -vv 2>&1 | grep "Sizes differ"
2020/04/14 13:50:16 ERROR : 1: Sizes differ

You could do your syncs as above with --ignore-existing which will sort out all the missing files first but leave the differing sizes alone.

Then doing the rclone copy -vv --dry-run little: large: and grepping out the Sizes differ as @calisro suggested would be one way to go.

Thanks Nick, I like the --ignore-existing approach, this should be a rare situation, but at least if 1 was good I wouldn't clobber it, and can recover it if needed.

appreciate the suggestions guys!

1 Like

The compare works great, I think I'll just run this on a cron and send an email to investigate if there ever are any, thanks again guys!

1 Like

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.