S3 ListObjects[V2] performance optimization for a flat structures

Motivation

Imagine you have a bucket in s3 with millions keys under the same prefix:

source-bucket/a/b/c/2020-01-01-10-00-00.log
source-bucket/a/b/c/2020-01-01-10-00-01.log
source-bucket/a/b/c/2020-01-01-10-00-02.log
... 
source-bucket/a/b/c/2022-12-31-10-00-00.log
...
source-bucket/a/b/c/2024-12-31-10-00-00.log

You want to copy all the objects for Dec 2022. You need to filter keys by prefix "2022-12-".

Rclone doesn't support a path as a prefix for keys:

rclone copy my-s3-source-remote:source-bucket/a/b/c/2022-12- my-destination-remote:destination-bucket/d/e/f

We can use files matching pattern as a possible solution:

rclone copy my-s3-remote:bucket/prefix my-destination-remote:destination-bucket/d/e/f --include "2022-12-*"

With this approach Rclone sends ListObjectsV2 with a/b/c as a prefix. In the result Rclone will list all millions of keys under the prefix and do the filtering later.

Optimized way is to send ListObjectsV2 request with a prefix equals to [directory]/[key-prefix]: a/b/c/2022-12-. With this approach S3 returns only keys starting with a/b/c/2022-12- bypassing unnecessary keys.

"--fast-list" flag doesn't help with flat structure.

Proposal

Add a new flag --s3-list-key-prefix that modifies the ListObjectsV2 (and ListObjects) request to include a server-side filter for keys. This significantly improve performance for a flat structures.

Implementation

[draft] Objects list performance optimization for a flat structures in S3 by alxfv · Pull Request #1 · alxfv/rclone · GitHub

1 Like

I actually am in the process of fixing this using an out of memory sync routine for rclone which should be seamless.

Check out this issue: Excess memory use when syncing millions of files in one directory · Issue #7974 · rclone/rclone · GitHub

I just posted a binary for people to try.

@ncw thank you for the quick reply! My case is similar in the nature of data ("100 million files at the root. They are mostly small files < 1mb."), but instead of copying all of the files I need to copy <1% of them. To do this fast I believe there is only one option: filter keys in the ListObjects[V2] method. Using this method we filter files by prefix almost instantly: it's incredibly fast operation for S3/Ceph/...

I've created an draft implementation based on your branch as an example.

I've tested your branch on a bucket with millions of files in a single directory. Copying of 5000 files takes ~10 sec using filtering in ListObjectsV2 and minutes without it (99% of times spent on ListObjectsV2).