Read ahead on S3

dprestegard · June 29, 2022, 6:11pm

This is related to a post I made awhile back

What is the problem you are having with rclone?

I'm using rclone to mount an S3 bucket to a Linux instance in AWS EC2 (to support an app that requires a POSIX mount), and specifically trying to read a large file (hundreds of GB). The app is a video encoder and is likely issuing a lot of relatively small read IOs.

I can get okay performance (approx 900 Mbps), but this is on a very big instance with a 25 Gbps network interface and a lot of cores. This amount of network traffic is only enough to keep about 8 vCPUs busy.

I think I'd like rclone to read ahead more aggressively via multi-part download (I've got 32 vCPU to use).

When I use AWS EFS (basically a managed NFS server) I can use almost all my vCPUs and sustain several Gbps, so I'd like to match that if possible. I know S3 can go this fast as well since a simple multi-part download using the AWS CLI can sustain several Gbps.

Run the command 'rclone version' and share the full output of the command.

rclone v1.58.1

os/version: amazon 2 (64 bit)
os/kernel: 4.14.281-212.502.amzn2.x86_64 (x86_64)
os/type: linux
os/arch: amd64
go/version: go1.17.9
go/linking: static
go/tags: none

Which cloud storage system are you using? (eg Google Drive)

AWS S3

The command you were trying to run (eg `rclone copy /tmp remote:tmp`)

rclone mount aws_s3:${bucket} /s3/rclone/${bucket} --no-modtime --read-only --vfs-cache-mode full --vfs-read-ahead 256Mi &

The rclone config contents with secrets removed.

[aws_s3]
type = s3
provider = AWS
env_auth = true
region = ${REGION}
location_constraint = ${REGION}

A log from the command with the `-vv` flag

Not sure which logs would be useful

asdffdsa · June 29, 2022, 6:16pm

hi,

about that app, to process a video

does the app download need to 100% of each video or what percentage of the file?

dprestegard · June 29, 2022, 6:18pm

No, it streams the data in as it needs it. It starts the encoding process right away. I think it just has a single thread doing IO, and probably using relatively small chunks like 64 KB or something. I have no idea how to find out tho

asdffdsa · June 29, 2022, 6:28pm

sometimes,
if i need to process a set of files in a rclone mount.
then i pre-load the files into the vfs file cache,

something like
rclone md5sum /s3/rclone/${bucket}/file.ext

dprestegard · June 29, 2022, 6:38pm

Unfortunately I can't issue on-demand commands to the machine in question due to the control plane architecture. I can issue commands during boot-up (like, mounting the buckets I know I'll need), but once the system is running my only interaction with it is through a REST API. Long story.

Thanks for the suggestion tho

asdffdsa · June 29, 2022, 6:50pm

ok, understood.

Run rclone listening to remote control commands only

Rclone implements a simple HTTP based protocol

edit: i just realized that you were the one that started that other topic
i should have read that topic in more detail before i posted.....

ncw · June 29, 2022, 9:59pm

I had a sponsorship deal to implement this, but unfortunately it fell through.

Maybe your company would be interested to pick it up?

system · July 29, 2022, 9:59pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.