Weirdly slow listing of http files on a public APT repository

very slow listing (and syncing) of a HTTP folder on Index of /ubuntu/pool/dists/jammy/

It seems to never reply or get any data.
The weird thing is that listing and syncing some other folders of the same site from the same computer works great and fast.
Getting the folder listing in my internet browser just takes a few seconds.

rclone version

rclone v1.68.1
- os/version: ubuntu 20.04 (64 bit)
- os/kernel: 5.15.153.1-microsoft-standard-WSL2 (x86_64)
- os/type: linux
- os/arch: amd64
- go/version: go1.23.1
- go/linking: static
- go/tags: none
rclone v1.69.0-beta.8367.a19ddffe9
- os/version: ubuntu 20.04 (64 bit)
- os/kernel: 5.15.153.1-microsoft-standard-WSL2 (x86_64)
- os/type: linux
- os/arch: amd64
- go/version: go1.23.2
- go/linking: static
- go/tags: none

Which cloud storage system are you using?

http

The command you were trying to run (eg rclone copy /tmp remote:tmp)

rclone ls --config /dev/null --http-url https://r2u.stat.illinois.edu/ubuntu/pool/dists/jammy :http:

But here is for example a command that works fine on the same website:

time rclone ls --config /dev/null --http-url https://r2u.stat.illinois.edu/ubuntu/dists/ :http:

2s

There are less files, but the other command takes forever.

config

not using any config

A log from the command that you were trying to run with the -vv flag

2024/10/16 16:09:22 DEBUG : rclone: Version "v1.69.0-beta.8367.a19ddffe9" starting with parameters ["/home/kforner/workspace/rclone-v1.69.0-beta.8367.a19ddffe9-linux-amd64/rclone" "ls" "--config" "/dev/null" "--http-url" "https://r2u.stat.illinois.edu/ubuntu/pool/dists/jammy" ":http:" "--log-file=debug.log" "-vv"]
2024/10/16 16:09:22 DEBUG : Creating backend with remote ":http:"
2024/10/16 16:09:22 DEBUG : Using config file from ""
2024/10/16 16:09:22 DEBUG : :http: detected overridden config - adding "{ETAaW}" suffix to name
2024/10/16 16:09:22 DEBUG : Root: https://r2u.stat.illinois.edu/ubuntu/pool/dists/jammy/
2024/10/16 16:09:22 DEBUG : fs cache: renaming cache item ":http:" to be canonical ":http{ETAaW}:"

can try --http-no-head to speed the listing.

to see what rclone is doing use -vv --dump=headers

1 Like

https://r2u.stat.illinois.edu/ubuntu/pool/dists/jammy contains about 25k files so I think it just takes some time..

First link contains 8 files and takes 1.5s for me... If I would approximate it then second link will take many hours...

If you add --dump headers,responses you will see in details what is going on - rclone is busy querying all these files

This indeed speeds up everything. A LOT. Any idea why rclone does not do this by default?

Normally rclone does a HEAD request for each potential file in a directory listing to:
find its size
check it really exists
check to see if it is a directory

Yeap. RTFM myself...

As using --http-no-head has some drawbacks other solution could be to speed up everything by increasing massively number of checkers (they will work in parallel):

time rclone ls --config /dev/null --http-url https://r2u.stat.illinois.edu/ubuntu/pool/dists/jammy :http: --checkers 256
      944 main/bioc-api-package_0.1.0-1.2204.1_all.deb
   295922 main/r-bioc-a4core_1.52.0-1.ca2204.1_all.deb
  1054580 main/r-bioc-affxparser_1.76.0-1.ca2204.1_amd64.deb
   ...
    19188 main/r-cran-ztype_0.1.0-1.ca2204.1_all.deb
   479532 main/r-cran-zvcv_2.1.2-1.ca2204.1_amd64.deb
    38452 main/r-cran-zyp_0.11-1-1.ca2204.1_all.deb
    49048 main/r-cran-zzlite_0.1.2-1.ca2204.1_all.deb

real	0m20.311s
user	0m6.017s
sys	0m4.347s

20s for 25k files is acceptable I think.

If not then 1024 checkers finish job in 8s for me:)

Thank you all.
That makes perfect sense!

Be warned that if you keep listing stuff like this all the time your IP can be blacklisted by their systems. Of course depends what they have setup but such massive flood of requests looks very similar to DDoS of some sort:)

It would make sense to list all these content ones and then use it locally - maybe you can use rclone mount and increase dir-cache-time to "forever". Something like 9999h. At least this is how I would work with such source.

no worries, i saved you a step...

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.