Chunker not deactivating for small files and wasting API calls

TLDR: It looks like some situations lead to chunker being still active on small files, which wastes API calls and severely restricts throughput for more restricted backends (e.g. Box). Chunker should automatically ignore files smaller than the limit and revert to the regular backend in those cases.

I am backing up some of our server's data to Box - around 58TB, with large files (>15GB) comprising ~95% of the data, and the remaining 5% data made of millions of small files. Because of the large files, I used chunker. The large files were a higher priority, so I pushed them first. I was able to finish those ~54TB in a matter of days. However, once it got to those millions of smaller files, my transfer rates plummeted. I was getting a throughput of 40 files per minute, which steadily dropped to 6 files per minute over the course of over a month. What would happen is that files would just sit in the queue at either 0% or 100% for minutes before going through. I've spent weeks tuning parameters to reduce API calls, but nothing seemed to really work.

Looking a the log, I've noticed that the transactions for each file looked like this:

2023/07/... INFO  : my_file.rclone_chunk.001_znavdm: Moved (server-side) to: my_file
2023/07/... INFO  : my_file: Copied (new)

It seems chunker was making chunks even though these files are a few MB. I decided to restart the transfers keeping everything the same except for using box as the backend, instead of box-chunker. Just by doing that, my transfer rate jumped from 6 files/min to >200/min, so it's very likely that chunker was the culprit.

For reference, here's my rclone invocation, chunker configuration, and rclone version (the latest at the time of writing this). I think some combination of flags is preventing chunker from seeing the file sizes on the local machine.

Invocation

rclone copy --files-from="file_list.txt" \
   --fast-list --multi-thread-streams=20 \
   --ignore-size --ignore-checksum \
   --ignore-existing --log-file=logfile.log \
   --log-level=INFO -P --transfers=120 \
   --checkers=120 --skip-links \
   indir "box-chunker:outdir"

Chunker

[box-chunker]
type = chunker
remote = box:
chunk_size = 15G
hash_type = sha1

rclone version

rclone v1.63.1-DEV
- os/version: centos 7.9.2009 (64 bit)
- os/kernel: 3.10.0-1127.13.1.el7.x86_64 (x86_64)
- os/type: linux
- os/arch: amd64
- go/version: go1.20.6
- go/linking: static
- go/tags: none```

can you post your chunker configuration from your rclone.config? I strongly suspect that it is your configuration and not a bug.

Thanks for the quick reply! I've added that info to the question.

can you check your box: remote directly? to see how small files look like?

They look exactly like they do in our server - same size and content, no appended suffixes, etc. Anything in particular I should look for?

This is ok.

Why use you old rclone? Not sure if related but it is waste of time to investigate some ancient versions:)

Yeah, that's a good point :slight_smile: I was wary of asking about this issue given the slightly older version, but I didn't see any mention about this in the changelog so I went ahead. I'm going to give a try with fresh version to see if the issue persists and come back here.

I do not think it will make difference (but of course not sure) but IMHO it is just wrong to use old version. Especially for such fast moving project like rclone:)

Do not use your linux repo version. Just install the latest one (uninstall repo one first):

sudo -v ; curl https://rclone.org/install.sh | sudo bash

and later to update you can just run:

sudo rclone selfupdate --stable

rclone does not have any dependencies and binary is self-contained so there is no advantage using outdated repo version

I can't reproduce this issue.

Could you please choose some folder with few small files - you can even create some sample one.

then run:

rclone sync folderWithSmallFiles box-chunker:test -vv

and post output here

It's a good move, and I'm glad we did it for the sake of being thorough. I've just updated to the latest version, and I'm running into the same issue using box-chunker - the small files unnecessarily get split into chunks and then renamed. I changed that in the original post as well.

We need some log file not only story:) I think you are into something here... we just need solid data

this is just fragment of log file...

to be precise:

  1. post something showing files sizes - so we know they are not "chunkable"

  2. full log

  3. rclone ls box:test so we know what is result

This is why I suggested to create some small example test folder in your source with some random small files. So log will be short but will contain all info

Thank you for the information. And apologies for making you ask twice - I didn't see your request to run a test case earlier. Here's the log on a test set of small files.

Input files

-rw-r--r--. 1 dricardo dricardo   83 Jul 19 14:22 files_test.txt
-rw-r--r--. 1 dricardo dricardo 3.9K Dec 21  2022 fix-onedrive-zip
-rw-r-----. 1 dricardo dricardo 2.8K Jul 19 14:23 log.txt
-rw-r--r--. 1 dricardo dricardo 304 May 16 17:54 R/iframe_pdfs.R
-rw-r--r--. 1 dricardo dricardo 452 May 17 13:24 R/rnotebook_setup.R

Log

2023/07/19 14:28:33 DEBUG : rclone: Version "v1.63.1-DEV" starting with parameters ["rclone" "copy" "--files-from=files_test.txt" "--fast-list" "--multi-thread-streams=20" "--ignore-size" "--ignore-checksum" "--ignore-existing" "--log-file=log.txt" "--log-level=DEBUG" "-P" "--transfers=120" "--checkers=120" "--skip-links" "/home/dricardo/scripts" "box-chunker:test_rclone"]
2023/07/19 14:28:33 DEBUG : Creating backend with remote "/home/dricardo/scripts"
2023/07/19 14:28:33 DEBUG : Using config file from "/home/dricardo/.config/rclone/rclone.conf"
2023/07/19 14:28:33 DEBUG : local: detected overridden config - adding "{HK82T}" suffix to name
2023/07/19 14:28:33 DEBUG : fs cache: renaming cache item "/home/dricardo/scripts" to be canonical "local{HK82T}:/home/dricardo/scripts"
2023/07/19 14:28:33 DEBUG : Creating backend with remote "box-chunker:test_rclone"
2023/07/19 14:28:33 DEBUG : Creating backend with remote "box:test_rclone"
2023/07/19 14:28:33 DEBUG : box: Loaded invalid token from config file - ignoring
2023/07/19 14:28:33 DEBUG : box root 'test_rclone': Token expired but no uploads in progress - doing nothing
2023/07/19 14:28:34 DEBUG : Saving config "token" in section "box" of the config file
2023/07/19 14:28:34 DEBUG : Keeping previous permissions for config file: -rw-r--r--
2023/07/19 14:28:34 DEBUG : box: Saved new token in config file
2023/07/19 14:28:36 DEBUG : Reset feature "ListR"
2023/07/19 14:28:36 DEBUG : files_test.txt: Need to transfer - File not found at Destination
2023/07/19 14:28:36 DEBUG : fix-onedrive-zip: Need to transfer - File not found at Destination
2023/07/19 14:28:36 DEBUG : R/.gitignore: Need to transfer - File not found at Destination
2023/07/19 14:28:36 DEBUG : R/iframe_pdfs.R: Need to transfer - File not found at Destination
2023/07/19 14:28:36 DEBUG : R/rnotebook_setup.R: Need to transfer - File not found at Destination
2023/07/19 14:28:36 DEBUG : Chunked 'box-chunker:test_rclone': Waiting for checks to finish
2023/07/19 14:28:36 DEBUG : Chunked 'box-chunker:test_rclone': Waiting for transfers to finish
2023/07/19 14:28:37 DEBUG : files_test.txt: skip slow SHA1 on source file, hashing in-transit
2023/07/19 14:28:39 DEBUG : fix-onedrive-zip: skip slow SHA1 on source file, hashing in-transit
2023/07/19 14:28:40 DEBUG : R/iframe_pdfs.R: skip slow SHA1 on source file, hashing in-transit
2023/07/19 14:28:40 DEBUG : R/rnotebook_setup.R: skip slow SHA1 on source file, hashing in-transit
2023/07/19 14:28:41 DEBUG : R/.gitignore: skip slow SHA1 on source file, hashing in-transit
2023/07/19 14:28:49 INFO  : files_test.txt.rclone_chunk.001_2bjbak: Moved (server-side) to: files_test.txt
2023/07/19 14:28:49 INFO  : files_test.txt: Copied (new)
2023/07/19 14:28:50 INFO  : R/iframe_pdfs.R.rclone_chunk.001_2bjen0: Moved (server-side) to: R/iframe_pdfs.R
2023/07/19 14:28:50 INFO  : R/iframe_pdfs.R: Copied (new)
2023/07/19 14:28:50 INFO  : R/rnotebook_setup.R.rclone_chunk.001_2bjepr: Moved (server-side) to: R/rnotebook_setup.R
2023/07/19 14:28:50 INFO  : R/rnotebook_setup.R: Copied (new)
2023/07/19 14:28:50 INFO  : fix-onedrive-zip.rclone_chunk.001_2bjdg7: Moved (server-side) to: fix-onedrive-zip
2023/07/19 14:28:50 INFO  : fix-onedrive-zip: Copied (new)
2023/07/19 14:28:50 INFO  : R/.gitignore.rclone_chunk.001_2bjfvb: Moved (server-side) to: R/.gitignore
2023/07/19 14:28:50 INFO  : R/.gitignore: Copied (new)
2023/07/19 14:28:50 INFO  :
Transferred:   	    9.324 KiB / 9.324 KiB, 100%, 367 B/s, ETA 0s
Checks:                 5 / 5, 100%
Renamed:                5
Transferred:            5 / 5, 100%
Elapsed time:        17.0s

2023/07/19 14:28:50 DEBUG : 9 go routines active

rclone ls

rclone ls box:test_rclone

       83 files_test.txt
     3932 fix-onedrive-zip
        3 R/.gitignore
      304 R/iframe_pdfs.R
      452 R/rnotebook_setup.R
1 Like

Thank you very much. perfect

can replicate it. It is maybe not a bug as such but definitely undesirable behaviour of chunker remote.

It writes temp files for every file including ones below of chunk threshold.

When file is not going to be written in multiple chunks it is totally unnecessary to add extra move (and of course API calls) to achieve nothing.

It will affect especially many small files transfers.

Now @rdalban would you maybe like to try to fix it?:slight_smile:

Chunker remote is missing maintainer at the moment - and even if working, definitely has some bits that can be polished.

Here is the list of outstanding issues recorded for chunker which need attention:

None is critical but overall chunker remote is not optimally implemented at the moment.

And thank you for finding this problem:)

1 Like

Looks like BOX move put spotlights on chunker again:)

I appreciate you being so responsive and looking at this so quickly - thank you!

I don't have experience with Go, but I'm happy to give it a go (pun intended) some time in the next couple of weeks. I'll also make a pull request to mention this in the chunker documentation so other folks are aware.

Did you find this issue only with Box, or is this a problem with every backend?

1 Like

It is chunker implementation problem - will be the same with any other remote.

1 Like

BTW with your setup it is perfectly good and clever workaround to transfer small files directly to box:. It wont affect accessing your chunker remote later.

1 Like

Great - I'll keep that in mind and implement a two-stage backup plan for this kind of situation while this is still an issue. Thank you again!