Gdrive mounted specific parallel read optimization

reb0rn · June 15, 2021, 5:31pm

What is the problem you are having with rclone?

I need to optimize read of large 50GB file on google drive mounted as drive, read is done in parallel and at random file location at 64x places where is just 8-16K read to each place totaling <1MB of data
I tried just mounting drive crashed issue was --timeout so I set it to 1000s and now it works but needed time is still not gr8 to get that read some 30s is needed

I tried --drive-chunk-size 256K an did did not help, looking at other setting I can not think of any that would help

I know that google drive allow parallel read of big files and looks like that work but I guess issue is minimal chink of 256K ? as I need to read only 8-16K per location and up to 64 location at once or as fast as possible

What is your rclone version (output from `rclone version`)

v.1.55.1

latest fuse driver

Which OS you are using and how many bits (eg Windows 7, 64 bit)

ubuntu 20.04 64bit

Which cloud storage system are you using? (eg Google Drive)

Google drive

The command you were trying to run (eg `rclone copy /tmp remote:tmp`)

Paste command here

rclone mount gog1: /gog1 --daemon --timeout 3600

Animosity022 · June 15, 2021, 6:17pm

Drive chunk size is just for uploads:

https://rclone.org/drive/#drive-chunk-size

I don't think there is any magic as if you are randomly seeking a file, you have to deal with cloud latency.

If you post a debug log, we can see what's going on and if anything is not working properly.

reb0rn · June 15, 2021, 7:06pm

here is my log its quite large, I don`t think there is some issue with rclone I just want to optimize if possible, while i do test which last ~1minute 16s network hit 80MB/s issue is that data which need to be read is for top ~5MB but small chunks of 8-16kb as it would be read from HDD but I know gcloud is not same and file is chunked and accessed by 256K minimum I guess

rclon.log (1.2 MB)

Animosity022 · June 15, 2021, 7:15pm

From the log, I can see your reads are opening and closing the file a lot.

grep 'Flush: err=<nil>' rclon.log | wc -l
     198

and

grep OpenReadOnly rclon.log | wc -l
     198

You may want to check out vfs-cache-mode full as that uses sparse files to cache what you read from the file so if you are only using chunks, it'll not use that much data locally and it'll help for your use case.

So in my case, I limit the cache size to 750GB and you can see the difference in reported size and actual use:

root@gemini:/cache# du -sh
674G	.
root@gemini:/cache# du -sh --apparent-size
1.1T	.

reb0rn · June 15, 2021, 7:21pm

yes test is done for 5x at once as I can not do just 1 test, each test read small 8-16k chunk randomly on 64-72 times that's why it read it 198 times, in real use case it would only read it 64-72x times once in 4-5h but each time read is on different place of file I think I only header of file is read up to 8 times rest 64x is random in the ~100Gb file

vfs-cache-mode full
not sure if it would help as it will never ever read same chunks from file... maybe only file header but I still do not know what part/size of header it is (1MB or more have no idea atm)

Animosity022 · June 15, 2021, 7:38pm

Each file open and close is dealing with latency as that's the challenge with any cloud based storage. If you aren't ever reading the same piece of data, you'll always have that time to open/seek/read/close and repeat the process many times if that's how it is supposed to work.

There aren't any flags to deal with that as you can't really tune latency.

reb0rn · June 15, 2021, 7:47pm

Yeah I suppose is like that, just if read is done in parallel do file need to be open and close?
Maybe application I use can be optimized but they say it was mode for Backblaze B2 cloud but I wanted to try it with google drive

Animosity022 · June 15, 2021, 7:49pm

The application being used would decide if it opens or closes the file as that's not rclone.

asdffdsa · June 15, 2021, 7:50pm

gdrive has high latency for random access reads.
perhaps read this post, which discuses random access for gdrive as compared to wasabi.

https://forum.rclone.org/t/google-drive-latency-improvements/24236/4

reb0rn · June 15, 2021, 8:21pm

Yeah what ever I try its same or worst then default mount each time...
I kinda wish they make mistake and I make it better

asdffdsa · June 15, 2021, 8:26pm

perhaps you can get a free trial at wasabi and try compare that to gdrive.

but as Animosity022 wrote,
The application being used would decide if it opens or closes the file as that's not rclone.

joh · June 24, 2021, 5:13pm

Hello Reborn,

Are you able to tune the read performance? i am having the same limitation as well but i am using onedrive instead of google drive. the one which i can do it in a very fast way would be using azure file share which is very low latency io.

That would be my last option as the azure file share is not cost efficient for my large files.

reb0rn · June 24, 2021, 5:32pm

I did not, I accepted gdrive is what it is... I might try to test more later when I get time

joh · June 25, 2021, 3:46am

Alright, if have any new finding on the improvement, i shall notify it here. thanks.