Reverse Chunker

with many cloud providers being stingy with IO/s, backing up a server with several hundred thousand 4kb files can take days. if they could be compiled into 100mb chunks transparently by rclone, it would cut the transfer time down by a few days. i'm not sure how this would work technically, but i guess that's why i'm not on the rclone team and you awesome people are :slight_smile: cheers on the wonderful product

jared

Is this a one time sync or does it need to be kept updated?

If it is a one-time sync then you can use tar and pipe it into rclone rcat.

Updating big files on cloud providers is impossible for most providers so lots of small files you wish to update regularly is very hard.

However restic which is a backup program which can use rclone as a backend can do this in a very clever way combining lots of small files into larger chunks (optionally with compression).

The disadvantage is that you can only read the data with restic unlike a nice portable tar file or the 1:1 object mapping rclone normally does.

hey there :slight_smile:
this is a daily occurrence, as we use rclone as the backup for our production servers. and the biggest reason is as you mentioned: the 1:1 object mapping that rclone does. i'm hesitant to use a backup tool that holds exclusive read access to all our data :confused:

i s'pose this would require some kind of differential sync; only updating certain blocks of the file. this would also be a nice feature :innocent: but i can see now how this would be difficult. as it stands, it would require re-uploading the entire chunk just because a 4kb file changed. i guess my plan isn't as cool as i thought :unamused:

i'll def look into the rcat for archiving though. thanks!

jared

If you like that 1:1 mapping, then there are lots of tweaks you can do to make rclone go faster.

Which provider are you syncing to?

What flags are you using for the sync?

really? "lots" of things? :slight_smile: cool
we use microsoft sharepoint and onedrive personal. both are used as crypt remotes, and from there it's extremely nooby:

rclone sync --verbose --progress --transfers=lotsiffilesaresmall path/to/source remote:destination/path

the only thing i've every looked into is a --max-age flag. but as it only works with copy and doesn't delete deleted files, and also doesn't upload renamed files, (windows doesn't count renaming as modifying) i've abandoned it.

so that's where we are. start the syncs at 9-10 P.M. and they're usually done by morning :smiley:

jared

Both of those have a lot of rate limiting. There are some tips to making your own user agent which might help with that.

Using the --max-age flag and copy is a technique I call doing a top-up sync. It doesn't catch deletes or renames, but it does mean that there is a copy of your data safe. So you could do a top-up sync once an hour and a full sync once a day or something like that!

You should find a top-up sync with rclone copy --max-age runs much quicker. It may run quicker still if you add --no-traverse (not sure about onedrive/sharepoint - you'll have to try it).

What is taking the time in your syncs? Is it the initial checking phase or is it the transferring phase? Rclone runs them both concurrently normally so it can be difficult to tell. You can use the --check-first flag to make rclone run them sequentially and this is a good idea if your servers have HDD rather than SSD.

at first run this does not seem to make much of a difference, regretfully. (i'll test it some more) does microsoft have a form we can fill out to allow infinite requests as well as a free 10% share in their cloud business? i'd be ok with the classic limit of "only 1 applicant per household" on it.

it basically goes by requests. you can upload/transfer a terabyte or two and it won't make a fuss if the files are few and large. but if the files are many and small, or just scanning the remote for changes, they are quick to swat you. they'll allow you between 20,000 - 30,000 checks before they throttle you back to increments of ~2,000 checks every bunch of minutes. we have approximately 2 million files that we sync every night, so it's the multitude of requests that takes time. (the onedrive personal accounts are not nearly as stingy. they'll allow us probably half a million requests before they even think about throttling. that's why we switched to them for our daily syncs, working around the 1tb limit with rclone's union feature. (it's awesome) and using sharepoint more for archiving.) but many of these files are users' personal photo archive with hundreds of thousands of small phone pictures that hardly get touched but once a year. that's why i was thinking the reverse chunker would be almost beneficial for many of them, even now that i realize (using a 250mb chunk) one 3mb photo changing would result in needing to upload 250mb. it's also why i've been such a gigantic nag about periodically checking with the other users on the forum if nobody's accumulated interest in running a local database for tracking changes. that would reduce the sync time to seconds. just some thoughts.

jared

If the remotes don't get changed outside your backup routine then something like the cache backend is what you want. This keeps a record of what is in the backend so you don't have to keep re-reading it.

You'll note, however, the cache backend is deprecated. I'm currently making a plan for turning the VFS cache into a new cache backend which will solve this problem in a maintainable way.

they don't. that's the only thing those remotes get used for.

this is a feature i've asked numerous guys on the forum about; nobody told me it's already in existence. what the fiddle. after running a few tests it seems to do exactly what i'm after. i just scanned a 400,000 file directory 7 times in one minute, where it took hours and hours before. in fact, this is so good, i'm getting suspicious now......

:ok_hand::+1::heart:

p.s. one question. what is a "chunk" in this case? i'm not very familiar with how databases work..

The cache backend works well for what it does.

However it was an evolutionary dead end as far as rclone development goes as what people wanted was it more tightly integrated with rclone mount.

The cache backend can store data from files which it does in chunks. I don't think you need this feature though, you should just store the metadata and none of the data. I can't remember how you configure this though!

all is good. as long as i know what it is so i don't misconfigure it and make it take longer than it should. might as well get greedy now.

everything i wanna do with it. not using rclone mount on our storage servers. yet :innocent: thank you for taking the time to listen to my ranting and finding the solution :pray:

jared

1 Like

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.