Idea, reverse of chunker to limit amount of remote files

Hi,

Before bothering to raise a Github feature request I thought I'll raise it here - I notice that as gdrive team drives become more popular that more people are hitting the 4k-files limit.

My suggestion is a feature where it is basically the reverse of the new chunker.
A remote could be populated by a abstraction layer, which can create several large "files" (containers?) on the remote which within them can contain a allocation-table (allocation files?) and map to position points within said huge "files" to the actual content being queried.

This would give the ability to store a significantly larger amount of files than what is artificially being limited.

The concerns are of course that this is probably excessively expensive to both pre-populate the larger files/containers, and expensive operations to retrieve data (having to look into a allocation-section, then reference correct range in another object).

Rough thoughts on layout:

revchunker-container-remote:
revchunked-container-file-001.FAT
revchunked-container-file-001.DATA

Read request for /README.txt:

  • open revchunker-container-remote:revchunked-file*.FAT
  • search and get chunked-container-filename of blocks containing /README.txt
  • Open,retrieve revchunked-container-file-001.DATA from the blocks

Write for /README.txt:

  • open revchunker-container-remote:revchunked-file*.FAT
  • search and get chunked-container-filename of blocks containing /README.txt
  • Can write size fit in given blocks? If not - create new revchunker-container-file and write
  • Update file-allocation table with associated blocks
  • Clear initial block if re-allocated to different container-file due to larger / smaller size

Thoughts? Or is it slightly too crazy and expensive?

You could make a read only file system like this quite easily...

However most cloud storage systems don't let you update existing files which will cause a real problem for this type of scheme.

So the answer to this is to chunk the data... However you then have to tradeoff the size of chunk

  • bigger chunks mean less of them and sequential reading lots of data will be fast
  • smaller chunks mean less reading data for updates

I was thinking chunks of about 64k to 1M are probably the right size.

It would be possible to add a bit more to the chunker backend to make something like this happen.

I did make a prototype of this idea. You could use it to create an ext4 filing system on and write data like that. It was really/really slow though!

Thanks, interesting, I did not realise updating a file is such a pain.

I was leaning towards a solution chunking/containering it on the remote to very large files (~100gb?) to essentially get rid of those (number of) file limits. I suspect 64k-1M 'chunks' will quickly get to 4K files for most uses.

The latter item about ext4 is a interesting idea - lots of ext4 images on a remote, somehow loopback mounted and striped - sounds like dangerous and slow fun :slight_smile:

PS: Just re-read, its 400k file limit for gdrive team drives, still a issue but not as constrained

Just to clarify as it's 400,000 files and folders, not 4000.

1 Like

Hmm, yes... 400k files * 1MB is 400GB so big but not massive. Increasing the chunk size is possible of course but if you are running (ext4 say) on it then you'll spend a lot of time downloading the 1MB chunks to update 512 bytes in them before uploading them again.

Not being able to update existing files is a pain! You could use a log structured file system to work around it but then you'd need a compaction phase at some point which would be pretty expensive in terms of network operations.