How to automatically(?) put folders starting with a certain letter in a Google Drive for that letter? 26 drives

DataHoarder · March 9, 2020, 4:19pm

The title was hard to phrase, but long story short I have a few million files. I was uploading them into a single Google Drive through Rclone Browser encrypted. However, in this process it's dawned on me how...inefficient this is. I contacted Google support and they told me that a a Shared Drive has a limit of 400,000 files, and that I should split up my files by that factor through Shared Drives.

The only way this can work is if I create 27 Shared Drives; each drive represents a letter, 1 drive represents other characters (for example: "-"). I have almost 10,000 folders. Each folder's content is updated periodically so sorting by date to upload isn't an option.

What I need to happen is to use a command (or I guess 27 commands), that will clone the folders that start with "A" to the Google Shared Drive for "A", then one for "B", etc. All the files must be uploaded encrypted.

How would I go about this? Can I automate this into a single action? I was using Rclone Browser for the ease/convenience so I was hoping something with ease. I'm not a programmer.

Thank you

thestigma · March 10, 2020, 3:26am

It would be pretty easy to do what you ask by using the rclone filter system:
https://rclone.org/filtering/

For example, to move all files starting with "a" a command should look something like this:
rclone move C:\somefolder GdriveA:\ --include a*
(this is a windows example, although Linux will be pretty much the same)

So to automate this, all you would need is really a script that runs all these commands in sequence. A relatively easy task. You can then set that to run automatically on a schedule.
If you are a filthy casual I can probably help you make that if needed. Linux or Windows? I am the most familiar with scripting in batch but this is not hard to do in bash either.

It is also simple to make this filter on for example a+b+c+d if you don't want to use more shared drives than you really need.

It is correct that there is a 400.000 limit documented. I am not sure if this is a "hard" limit or just a recommendation by Google as I have never had a reason to surpass it yet,

If it is possible for you to divulge some more details about your exact use-case that would be helpful though, because even if this is easy enough to do what you ask - it seems inefficient.
Remember that Gdrive can only start transfers of files at a rate of about 2-3/second, so tons of really small files will always perform slowly.

Is archiving small files into a larger file perhaps an option, or do they frequently change randomly?
If the data size is small but the amount of files is huge then it might make more sense to script making a zip and uploading that. Even if that results in having to re-upload more data when something changes than what might be strictly necessary it may still perform significantly better if the data is small but you would be bottlenecked by the new-transfers-pr-second limit rather than bandwidth. What solution is viable really depends on the details, so I would like to know more specifics if you are ok with sharing that That might result in me being able to suggest a superior alternative.

thestigma · March 10, 2020, 3:49am

Oh, and it is also worth mentioning that the multiwrite-union remote (currently in betatesting) would allow you to combine multiple drives and spread the files across them. This would probably the most user-friendly solution at the cost of a bit more advanced setup (and also the fact that it is not yet tested thoroughly enough to be put into a mainline release).

You can think of this solution as the rclone equivalent of JBOD aka "spanning". If you are unfamiliar with these terms I can go into more depth about it, but a google search will also explain the gist of it.

ncw · March 10, 2020, 10:35am

I'm pretty sure it is a hard limit as we've had people reporting the error on the forum.

asdffdsa · March 12, 2020, 7:16pm

you do not have to use a drive letter for each mount, as there is a hard limit for that.
but you can create as many mounts as you want
rclone mount remote: b:\mount\a

thestigma · March 13, 2020, 3:24am

Yes, good point. If you actually needed to make that many mounts it would be much cleaner to organize as folders.

Please do note that if you use Windows (and therefore WinFSP to support mounting) you need a fairly new release version of WinFSP to support folder mounting as this was added relatively recently. maybe close to half a year now? Not sure. If you use Linux (FUSE) then this is not a concern.

DataHoarder · March 13, 2020, 1:28pm

I realized that some folders start with numbers, too. Creating that many drives seems like such a pain, especially with the 'crypt wrapper' and 'cache wrapper' for each drive. Maybe you can help me think of something.

The files are very small, averaging about 200kb. The problem is that the folders update randomly. It's an archival project, so the copy being backed up is just a backup, not the 'live' copy. However, right now this live copy is my only copy. Being an archival project it would be a bit ironic not backing up.

I would consider making one large, or a few zipped files, but I looked into that and that becomes a bit..cumbersome. No zipping format (I looked namely at tar and gzip) can compress the image and video files, so the compressed file just ends up being a massive file (which I don't have that much storage for as the current live folder is about 2TB). It also takes a very, very long time to even process a few thousand of the files, let alone a few million. Then I would have to do this fairly regularly. Hence why uploading the files separately adds future potential so I don't have to hog the bandwidth limits of the network every time I want to update the backup.

Right now my solution is just to have a folder on Google Drive (not a shared drive). I mounted the gdrive to a letter using RClone Browser. Then I just used the "copy" functionality in Windows file browser to copy groups of folders into the mount. I'm not sure what future errors this will give, or how updating will go.

I genuinely wonder what businesses use to backup their stuff to the cloud? Or is there server hosted internally?

asdffdsa · March 13, 2020, 1:36pm

if you are doing a simple copy, you might not need the mount command, which is slow.

and if you must use the mount command, you might not need

cache
vfs-cache-mode

thestigma · March 13, 2020, 9:43pm

Enterprise use stuff like Google Cloud storage (or similar services). Those have very few limitations of any sort - but you also pay-per-use. This may actually be a valid option for you here if you really want fast performance and no limits - like if you wanted to run a business based on this data. The main costs of Gcloud is in Egress (when you fetch data from the server), so if this happens in a very limited fashion (not often and/or small total data amount) then it's not actually very costly. It's not at all affordable for regular people who want to store tons of data though (and that's where Gsuite is the affordable for most of us data-storage enthusiasts).

But anyway.... from your use-case, I agree it's probably not practical to use archiving (due to frequent random changes). Not in any simple-to-make system at least.

I can fairly easily script this backup-system for you and just automatically distribute the files to several shared-drives like you originally suggested. This is not really a problem, even if there is a large character-set in use (ie. many possibilities for what the first letter/number/specialcharacter is). rclone has all the filtering capabilities that are needed for this task, and the rest that is required is just knowing how to use rclone + general scripting knowledge.The only thing you would need to decide is how many 400.000 file shared drives you would want to distribute it across and consider what you may need to cover you current + immediate future needs. (I have actually done something very similar to this before, although for different end-goals).

The second solution is to use the new multiwrite union, but this is not out of testing yet - and unless you have some experience with mergerFS or similar systems it is fairly complex to understand the options you need to set to achieve what you want.
While this is a very nice and flexible/scalable method with many advantages - I'm not sure I would really recommend that you use it for some important backup task at this point. It probably needs at least a few more months to iron out the last few bugs and kinks. There is a good reason it is not yet included in the standard rclone release. (also, it would be perfectly possible to upgrade to such a solution later).

One question I have is: How do you need to be able to access the data?
Is this purely backup - and you can afford to copy back files manually if there was an emergency? (makes it pretty easy)
or are there any programs that needs to be able to able to directly access the backup'ed files on the cloud- and if so, do they need just read, just write, or both? (this would require a fair bit more work).

If you need help on the specifics of this you can send me a PM and we can talk about the specifics details. Otherwise if you just want tips on how you can DIY it I will be happy to give advice, but I think it mostly will come down to the nitty-gritty details of the project here

DataHoarder · March 14, 2020, 1:06am

You're a real life saver!

Do you know if this method would use the 'crypt' interface? I'm hoping the files are encrypted.

I don't really need access to the files, to be honest. It's a pure backup. I will only access it in the event my live HDD crashes, etc. I would be manually uploading the content every few weeks or so. All the files are organized in folders, so recovery wise I would just download the folder(s) and get rclone to decrypt it. Whether that be through the web Google Drive interface, rClone, or Google Drive File Stream.

I'm not actually sure how to go about the separating into drive parts. Do you know how I could see how many files are in " a* "? It dawned on me that a* might be more than 400,000 files. I doubt it, but maybe!

thestigma · March 14, 2020, 2:04am

This would not really matter.
It would be just as easy to encrypt the files in this case as to not do it. I don't see it being relevant to the problem. Choose as you wish. Do be aware that if we store encrypted then we won't be able to use --track-renames however. This is a smart-function that saves you from having to re-upload a lot of files just because the location (on the cloud) moved. Instead it can identify the files by hash and just server-side move them instead. This is great to have, but on tiny files would be pretty minimal anyway. And besides - your use-case seems unlikely to have the backup files being moved around manually very often.

So TLDR: Ecrypt if you wish.

This makes the job easiest to script. Ideal.

That's fine. You can just run the script manually - although it would not be hard to automate this on a schedule either.

Using rclone encryption you would just decrypt-on-the-fly. No need to do it in 2 steps. Via rclone (and the correct crypt remote) the files will be visible (and accessible) as normal even thoughhey are stored on the server in encrypted form. You certainly could download and decrypt locally, but I don't see much point in that.

You should definitely run some tests to get a rough idea...

rclone size C:\path\to\files
will tell you the data-size and number of files you have (or just use your OS tools)

rclone size C:\path\to\files --include a*
will tell you data-size and number of files that start with a

So check a few letters and see what numbers you come up with.
(you certainly can split it up even more if needed, simply by filtering on multiple letters rather than one)

One problem you might see is that your filenames aren't well distributed among the alphabet. For example you might have 200.000 "a" files but just 20.000 "b" files. Filenames don't really tend to be random after all. You'd have to provision based on the worst-case to avoid problems in the future if the archive might grow.

There would be ways to normalize this distribution but that would certainly add a lot of extra complexity to what would otherwise be a fairly simple task.

system · May 13, 2020, 10:04pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.