Optimize Google Drive storage by not creating directories

Because of how Google Drive stores files, it's very inefficient to look up a file by path: foo/bar/baz requires searching for folders named foo to find the ID, then searching for folders named bar that have that ID as parent and finally searching for files named baz with the ID returned by bar.

I was wondering if you'd consider adding an optional (behind a flag) optimization where rclone simply never stores directories upstream? Specifically, store everything in a single (root) directory and put the whole path in the file name. The only disadvantage would be that empty folders wouldn't be stored, but I think this is not a big deal in most cases and the performance gains could be significant.

Yes you are right.

Quite a lot of the cloud backends work like this. Rclone has a module called dircache which caches the ID to name lookups to try to help with this.

Interesting idea...

Listing the directory would be very time consuming though - are you thinking that rclone should cache it locally?

This could be implemented by an overlay backend as I don't think it is google drive specific.

Quite a lot of the cloud backends work like this.

Interesting, I thought this was Google Drive specific. In that case, a separate overlay backend does make more sense. I also realized that this is actually the default behavior of bucket-based remotes according to rclone mount docs:

The bucket based remotes (eg Swift, S3, Google Compute Storage, B2, Hubic) do not support the concept of empty directories, so empty directories will have a tendency to disappear once they fall out of the directory cache.

So I guess this could be just a generic way to treat a remote as bucket-based even if it's not that by nature.

Listing the directory would be very time consuming though - are you thinking that rclone should cache it locally?

Yes, but such a cache can be maintained very efficiently. Currently, rclone needs at least <number of directories> requests to build a full file tree. No directories would mean only needing <number of files> / 1000 (since 1000 is the maximum number of files returned IIRC), which is already probably very useful for rclone sync.

Also, updating it subsequently can be done in a single request (if you sort by modification date when you request a list of files). With this, the cache can potentially be very long-lived without needing a full rebuild. The only ill-effect that comes to mind would be deleted files appearing visible (but they wouldn't be readable anyway).

Interesting insight...

I'll just note that if you are doing a recursive traversal rclone will use a fancy listing algorithm on Google drive which typically does about 1/10th of the requests. So called --fast-list.

Another great idea! You can set modified date so it would need to be creation date I think. Just checked the docs https://developers.google.com/drive/api/v3/reference/files/list looks possible! Though the docs have a caveat about 1000000 files...

Note that drive has the changes API for receiving updates which works quite well.

If you are using a mount you can load the listing into memory with the VFS/refresh API call (which uses fast list) and that is cached from there on.

Soon you'll be able to run normal sync commands through this cache too...

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.