Reading files in s3 via rclone mount

dprestegard · March 23, 2021, 10:16pm

What is the problem you are having with rclone?

I'm only seeing rclone mount use a single connection s3 when reading files on the mounted drive

What is your rclone version (output from `rclone version`)

1.54.1

Which OS you are using and how many bits (eg Windows 7, 64 bit)

64 bit Windows Server 2019

Which cloud storage system are you using? (eg Google Drive)

S3

The command you were trying to run (eg `rclone copy /tmp remote:tmp`)

rclone mount s3:genie-pse-inbound-stg z: --vfs-cache-mode full --no-modtime --network-mode --multi-thread-cutoff=1 --multi-thread-streams=4 --transfers=4 -vv

The rclone config contents with secrets removed.

[s3]
type = s3
provider = AWS
env_auth = true
region = us-west-2
server_side_encryption = aws:kms
sse_kms_key_id = REDACTED

A log from the command with the `-vv` flag

2021/03/23 22:08:46 DEBUG : rclone: Version "v1.54.1" starting with parameters ["c:\\rclone\\rclone.exe" "mount" "s3:genie-pse-inbound-stg" "z:" "--vfs-cache-mode" "full" "--no-modtime" "--network-mode" "--multi-thread-cutoff=1" "--multi-thread-streams=4" "--transfers=4" "-vv"]
2021/03/23 22:08:46 DEBUG : Creating backend with remote "s3:genie-pse-inbound-stg"
2021/03/23 22:08:46 DEBUG : Using config file from "C:\\Users\\Administrator\\.config\\rclone\\rclone.conf"
2021/03/23 22:08:46 INFO  : S3 bucket genie-pse-inbound-stg: poll-interval is not supported by this remote
2021/03/23 22:08:46 DEBUG : vfs cache: root is "\\\\?\\C:\\Users\\Administrator\\AppData\\Local\\rclone\\vfs\\s3\\genie-pse-inbound-stg"
2021/03/23 22:08:46 DEBUG : vfs cache: metadata root is "\\\\?\\C:\\Users\\Administrator\\AppData\\Local\\rclone\\vfs\\s3\\genie-pse-inbound-stg"
2021/03/23 22:08:46 DEBUG : Creating backend with remote "\\\\?\\C:\\Users\\Administrator\\AppData\\Local\\rclone\\vfs\\s3\\genie-pse-inbound-stg"
2021/03/23 22:08:46 DEBUG : fs cache: renaming cache item "\\\\?\\C:\\Users\\Administrator\\AppData\\Local\\rclone\\vfs\\s3\\genie-pse-inbound-stg" to be canonical "//?/C:/Users/Administrator/AppData/Local/rclone/vfs/s3/genie-pse-inbound-stg"
2021/03/23 22:08:46 DEBUG : fs cache: switching user supplied name "\\\\?\\C:\\Users\\Administrator\\AppData\\Local\\rclone\\vfs\\s3\\genie-pse-inbound-stg" for canonical name "//?/C:/Users/Administrator/AppData/Local/rclone/vfs/s3/genie-pse-inbound-stg"
2021/03/23 22:08:46 DEBUG : vfs cache: looking for range={Pos:0 Size:421} in [{Pos:0 Size:421}] - present true
2021/03/23 22:08:46 DEBUG : Network mode mounting is enabled
2021/03/23 22:08:46 DEBUG : Mounting on "z:" ("\\server\\s3 genie-pse-inbound-stg")
2021/03/23 22:08:46 DEBUG : S3 bucket genie-pse-inbound-stg: Mounting with options: ["-o" "attr_timeout=1" "-o" "uid=-1" "-o" "gid=-1" "--FileSystemName=rclone" "--VolumePrefix=\\server\\s3 genie-pse-inbound-stg"]
2021/03/23 22:08:46 INFO  : vfs cache: cleaned: objects 5 (was 5) in use 1, to upload 1, uploading 0, total size 50.882M (was 50.882M)
2021/03/23 22:08:46 DEBUG : S3 bucket genie-pse-inbound-stg: Init:
2021/03/23 22:08:46 DEBUG : S3 bucket genie-pse-inbound-stg: >Init:
2021/03/23 22:08:46 DEBUG : /: Statfs:
2021/03/23 22:08:46 DEBUG : /: >Statfs: stat={Bsize:4096 Frsize:4096 Blocks:274877906944 Bfree:274877906944 Bavail:274877906944 Files:1000000000 Ffree:1000000000 Favail:0 Fsid:0 Flag:0 Namemax:255}, errc=0
2021/03/23 22:08:46 DEBUG : /: Getattr: fh=0xFFFFFFFFFFFFFFFF
2021/03/23 22:08:46 DEBUG : /: >Getattr: errc=0
2021/03/23 22:08:46 DEBUG : /: Readlink:
2021/03/23 22:08:46 DEBUG : /: >Readlink: linkPath="", errc=-40
The service rclone has been started.

I'm trying to adapt a legacy Windows app to the cloud, and the media I need to read is on s3. As long as I do a network mount, it seems to work fine. However, performance isn't great as it only uses one connection to s3. I've tried many combinations of arguments but can't ever seem to get it to use more than one connection at a time.

I'm running this in AWS EC2, so I know I can get very high performance reading from s3 using multiple connections. The instance has a burstable 10 Gbps NIC, and I'd ideally like to get at least 1 Gbps aggregate. Unfortunately with one connection I can only get about 300 Mbps reliably.

I can do very fast multi part downloads with the AWS CLI, but in this case I can't trigger a download action, I really need to just present a file system to the app so it can take whatever action it needs to take with the best storage performance I can muster. I'm hoping to arrive at an rclone mount command that will aggressively prefetch any requested files with several connections to s3. That way the files will be warm on disk for the app to read as quick as it pleases.

Thanks in advance for your time!

ncw · March 24, 2021, 10:23am

Currently --vfs-cache-mode full uses as many connections as there are open file handles (handwaving) so if you open the file to read sequentially then it will only be read with one stream.

There is a project underway to change this which will hopefully be ready for 1.56 (not for 1.55).

You'll find if you use rclone copy then it too will use multiple connections - it benchmarks at least as fast as the aws cli

Do you have a list of files you want pre-loaded?

You can simulate multiple read streams to warm up the file in the cache. Say you have a 4G file. You'd set one process to read the first 1G, the second the second 1G etc. This will download the file as 4 streams.

You could do this very simply with dd

dd if=/mnt/4Gfile of=/dev/null seek=0 count=1024 bs=1M &
dd if=/mnt/4Gfile of=/dev/null seek=1024 count=1024 bs=1M &
dd if=/mnt/4Gfile of=/dev/null seek=2048 count=1024 bs=1M &
dd if=/mnt/4Gfile of=/dev/null seek=3172 count=1024 bs=1M &

You can even do this while your app is using the file system...

Thinking aloud - I could also do a vfs/prefetch rc (API) call which would prefetch a file using the above technique - you could say how many streams you wanted it downloaded in.

dprestegard · March 26, 2021, 12:05am

Thanks so much for the response!

Yeah, the app seems to just read sequentially. Glad to hear some clever stuff is in the works. Looking forward to trying that!

The way the larger system is orchestrated I don't ever know precisely where a task will execute, so I don't know what file I'll need before the app starts reading the file. This is a distributed workflow engine with n worker nodes using a message queue to perform tasks, and due to legacy limitations there's no way (currently) to interrupt the flow to inject a command to download from s3 (or any other command for that matter).

ncw · March 29, 2021, 6:02pm

OK it sounds like the new project will be just the thing for you. Watch this space!

system · May 29, 2021, 2:03pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.