Rclone checks for unnecessary files and folders when using the copyto command

What is the problem you are having with rclone?

Hello! We have 2 S3 accounts. I want to copy 10 specific files as quickly as possible, which are listed in the sync_list file. Each file weighs 10-50 KB. But rclone starts checking the entire S3-bucket in the account before it starts copying the files it needs. This takes a huge amount of time as we have millions of files in S3-bucket.

Run the command 'rclone version' and share the full output of the command.

rclone version
rclone v1.65.2
- os/version: debian 11.8 (64 bit)
- os/kernel: 5.10.0-26-amd64 (x86_64)
- os/type: linux
- os/arch: amd64
- go/version: go1.21.6
- go/linking: static
- go/tags: none

Which cloud storage system are you using? (eg Google Drive)

ceph

The command you were trying to run (eg rclone copy /tmp remote:tmp)

rclone copyto config_old:bucket_docs config_new:bucket_docs -P -vv --fast-list --no-check-dest --retries 1 --transfers 1000 --files-from sync_list

The rclone config contents with secrets removed.

rclone config show
[config_old]
type = s3
provider = Other
env_auth = false
access_key_id = *****
secret_access_key = *****
endpoint = *****
acl = private

[config_new]
type = s3
provider = Other
env_auth = false
access_key_id = *****
secret_access_key = *****
endpoint = *****
acl = private

A log from the command with the -vv flag

2024/03/09 17:41:52 DEBUG : rclone: Version "v1.65.2" starting with parameters ["rclone" "copyto" "--progress" "config_old:bucket_docs" "config_new:bucket_docs" "-P" "-vv" "--fast-list" "--no-check-dest" "--retries" "1" "--transfers" "1000" "--files-from" "sync_list"]
2024/03/09 17:41:52 DEBUG : Creating backend with remote "config_old:bucket_docs"
2024/03/09 17:41:52 DEBUG : Using config file from "/root/.config/rclone/rclone.conf"
2024/03/09 17:41:52 DEBUG : Resolving service "s3" region "us-east-1"
2024/03/09 17:41:52 DEBUG : Creating backend with remote "config_new:bucket_docs"
2024/03/09 17:41:52 DEBUG : Resolving service "s3" region "us-east-1"
2024/03/09 17:43:33 DEBUG : 010223: Excluded
2024/03/09 17:43:33 DEBUG : 010323: Excluded
2024/03/09 17:43:33 DEBUG : 010423: Excluded
2024/03/09 17:43:33 DEBUG : 010523: Excluded
2024/03/09 17:43:33 DEBUG : 010623: Excluded
2024/03/09 17:43:33 DEBUG : 010723: Excluded
2024/03/09 17:43:33 DEBUG : 010823: Excluded
2024/03/09 17:43:33 DEBUG : 010923: Excluded
2024/03/09 17:43:33 DEBUG : 011023: Excluded
2024/03/09 17:43:33 DEBUG : 011123: Excluded
2024/03/09 17:43:33 DEBUG : 020123: Excluded
2024/03/09 17:43:33 DEBUG : 020223: Excluded
2024/03/09 17:43:33 DEBUG : 020323: Excluded
2024/03/09 17:43:33 DEBUG : 020423: Excluded
2024/03/09 17:43:33 DEBUG : 020523: Excluded
2024/03/09 17:43:33 DEBUG : 020623: Excluded
2024/03/09 17:43:33 DEBUG : 020723: Excluded
2024/03/09 17:43:33 DEBUG : 020823: Excluded
2024/03/09 17:43:33 DEBUG : 020923: Excluded
2024/03/09 17:43:33 DEBUG : 021023: Excluded
2024/03/09 17:43:33 DEBUG : 021123: Excluded
2024/03/09 17:43:33 DEBUG : 030123: Excluded
2024/03/09 17:43:33 DEBUG : 030223: Excluded
2024/03/09 17:43:33 DEBUG : 030323: Excluded
2024/03/09 17:43:33 DEBUG : 030423: Excluded
2024/03/09 17:43:33 DEBUG : 030523: Excluded
2024/03/09 17:43:33 DEBUG : 030623: Excluded
2024/03/09 17:43:33 DEBUG : 030723: Excluded
2024/03/09 17:43:33 DEBUG : 030823: Excluded
2024/03/09 17:43:33 DEBUG : 030923: Excluded
2024/03/09 17:43:33 DEBUG : 031023: Excluded
2024/03/09 17:43:33 DEBUG : 031123: Excluded
2024/03/09 17:43:33 DEBUG : 040123: Excluded
2024/03/09 17:43:33 DEBUG : 040223: Excluded
2024/03/09 17:43:33 DEBUG : 040423: Excluded
2024/03/09 17:43:33 DEBUG : 040523: Excluded
2024/03/09 17:43:33 DEBUG : 040623: Excluded
2024/03/09 17:43:33 DEBUG : 040723: Excluded
2024/03/09 17:43:33 DEBUG : 040823: Excluded
2024/03/09 17:43:33 DEBUG : 040923: Excluded
2024/03/09 17:43:33 DEBUG : 041023: Excluded
2024/03/09 17:43:33 DEBUG : 041123: Excluded
2024/03/09 17:43:33 DEBUG : 050123: Excluded
2024/03/09 17:43:33 DEBUG : 050223: Excluded
2024/03/09 17:43:33 DEBUG : 050423: Excluded
2024/03/09 17:43:33 DEBUG : 050523: Excluded
2024/03/09 17:43:33 DEBUG : 050623: Excluded
2024/03/09 17:43:33 DEBUG : 050723: Excluded
2024/03/09 17:43:33 DEBUG : 050823: Excluded
...
...
Transferred:             0 / 0 Bytes, -, 0 Bytes/s, ETA -
Elapsed time:       46m2.0s

Our list of files

cat sync_list
210123/23333995/23333995.txt
050323/150145599/150145599.txt
010123/23941421/23941421.txt
060123/77107102/77107102.txt
040323/118053120/118053120.txt
200223/106163744/patch_1676922509.txt
110223/91373395/patch_1676086505.txt
070323/16677173/patch_1678179158.txt
110223/106851934/patch_1676081042.txt
290323/76198099/patch_1680086978.txt

Try --no-traverse - that should help.

Yes, with --no-traverse 10 files copied very fast without checking
But sync_list is test list of files.
My real list sync_list_all contains 12 millions files

When I start command with real list sync_list_all

rclone copyto config_old:bucket_docs config_new:bucket_docs -P -vv --fast-list --no-check-dest --retries 1 --transfers 1000 --no-traverse --files-from sync_list_all

I get very low speed (~100-200 files per minute.).
Here is log:

2024/03/09 21:01:52 DEBUG : rclone: Version "v1.65.2" starting with parameters ["rclone" "copyto" "--progress" "config_old:bucket_docs" "config_new:bucket_docs" "-P" "-vv" "--fast-list" "--no-check-dest" "--retries" "1" "--transfers" "1000" "--no-traverse" "--files-from" "sync_list_config_09032024_new_98683000-end"]
2024/03/09 21:01:52 DEBUG : Creating backend with remote "config_old:bucket_docs"
2024/03/09 21:01:52 DEBUG : Using config file from "/root/.config/rclone/rclone.conf"
2024/03/09 21:01:52 DEBUG : Resolving service "s3" region "us-east-1"
2024/03/09 21:01:52 DEBUG : Creating backend with remote "config_new:bucket_docs"
2024/03/09 21:01:52 DEBUG : Resolving service "s3" region "us-east-1"
Transferred:              0 B / 0 B, -, 0 B/s, ETA -
Elapsed time:        30.1sTransferred:                    0 B / 0 B, -, 0 B/s, ETA -
Elapsed time:        30.6sTransferred:                    0 B / 0 B, -, 0 B/s, ETA -
Elapsed time:        31.1sTransferred:                    0 B / 0 B, -, 0 B/s, ETA -
Elapsed time:        31.6sTransferred:                    0 B / 0 B, -, 0 B/s, ETA -
Elapsed time:        32.1sTransferred:                    0 B / 0 B, -, 0 B/s, ETA -
Elapsed time:        32.6sTransferred:                    0 B / 0 B, -, 0 B/s, ETA -
Elapsed time:        33.1sTransferred:                    0 B / 0 B, -, 0 B/s, ETA -
Elapsed time:        33.6sTransferred:                    0 B / 0 B, -, 0 B/s, ETA -
Elapsed time:        34.1sTransferred:                    0 B / 0 B, -, 0 B/s, ETA -
...
...
Elapsed time:       9m3.1sTransferred:                    0 B / 0 B, -, 0 B/s, ETA -
Elapsed time:       9m3.6sTransferred:                    0 B / 0 B, -, 0 B/s, ETA -
Elapsed time:       9m4.1sTransferred:                    0 B / 0 B, -, 0 B/s, ETA -
Elapsed time:       9m4.6sTransferred:                    0 B / 0 B, -, 0 B/s, ETA -
Elapsed time:       9m5.1sTransferred:                    0 B / 0 B, -, 0 B/s, ETA -
Elapsed time:       9m5.6sTransferred:                    0 B / 0 B, -, 0 B/s, ETA -
Elapsed time:       9m6.1sTransferred:                    0 B / 0 B, -, 0 B/s, ETA -
Elapsed time:       9m6.6sTransferred:                    0 B / 0 B, -, 0 B/s, ETA -
Elapsed time:       9m7.1sTransferred:                    0 B / 0 B, -, 0 B/s, ETA -
...
...

My solution

But when I using my own script in python without --files-from key which read sync_list_all and asynchronously run rclone copyto command for each line with 1000 parallel workers, I get 27000 files per minute.

Here is my python script

def rclone_copyto(line, line_number):

    retries = args.r if args.r else 1

    try:
        rclone.copyto(f'{rclone_old_config}:{rclone_old_bucket}/{line}',
                      f'{rclone_new_config}:{rclone_new_bucket}/{line}',
                      show_progress=False,
                      args=['-P', '-vv', '--fast-list', '--no-check-dest', f'--retries {retries}']) # https://rclone.org/flags/
        my_logger.info(f'line_number = {line_number} file {line}')
    except Exception as e:
        my_logger.exception(f'ERROR: Sync failed in line_number = {line_number}')

def main():
    checks()

    globals()['sync_file_path'] = args.f

    max_workers = args.w if args.w else 1000

    with open(sync_file_path, 'r') as f:
        lines = [line.rstrip() for line in f]
        line_start = args.s if args.s else 1
        line_end = args.e if args.e else len(lines)
        lines = lines[(line_start-1):line_end]

    globals()['my_logger'] = get_logger()

    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        executor.map(rclone_copyto, lines, range(line_start, line_end + 1))

    my_logger.info(f'Threaded time: {datetime.timedelta(seconds=(time.time() - threaded_start))}')

Question

Why rclone do that so slowly without my script with --files-from key?

Rclone spends a long time at the start checking those 12 million files actually exist. If you increase --checkers then it will check them quicker and start the transfers quicker. You could try --checkers 1000 to match your --transfers 1000.

Rclone checks all the files in the --files-from list before transferring any of them which maybe it shouldn't.

Why it checking files? I pass --no-check-dest flag. It should skip checking. No?

I do a lot of interfacing rclone and Python, and for my latest tool (dfb), I had to use copyto. Doing each copyto with its own CLI call was very slow and burned a ton of API calls.

I highly recommend using the rc interface as it will save you all of the authorization calls. It makes a gigantic difference.

You are welcome to rip out my rc interfacing code. It won't be too bad. You'd need to pull in a few utilities but it can be done. Then you can use threads or call_async_and_background to have many going on at once.

Or write your own. I am only a hobby developer.

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.