HTTP not listing from "directory lister"


#1

Hi!
I am trying to list files from a website that uses “Directory Lister” to list files,
But sadly rclone isn’t able to list from it.

Would be really glad if anyone could help.

PS:
Using rclone version:
rclone v1.45

  • os/arch: windows/amd64
  • go version: go1.11

#2

Looking at the directory lister example

http://demo.directorylister.com/?dir=code

This then links to files like

http://demo.directorylister.com/code/hello-world.c

This won’t work with the current scheme as rclone is expecting the links to look like this

http://demo.directorylister.com/code/
http://demo.directorylister.com/code/hello-world.c

If you extract the names from the directory listings, then rclone can fetch the files for you or mount them or whatever…

eg put this into files.txt

hello-world.c
hello-world.css
hello-world.html
hello-world.java

The names can have directories in, but should start from the root

Then

$ rclone --http-url http://demo.directorylister.com/code/ --files-from files.txt lsf :http:
hello-world.c
hello-world.css
hello-world.html
hello-world.java

$ rclone -v --http-url http://demo.directorylister.com/code/ --files-from files.txt copy :http: code-copy
2018/12/02 14:06:27 INFO  : Local file system at /tmp/code-copy: Waiting for checks to finish
2018/12/02 14:06:27 INFO  : Local file system at /tmp/code-copy: Waiting for transfers to finish
2018/12/02 14:06:27 INFO  : hello-world.css: Copied (new)
2018/12/02 14:06:27 INFO  : hello-world.c: Copied (new)
2018/12/02 14:06:27 INFO  : hello-world.java: Copied (new)
2018/12/02 14:06:27 INFO  : hello-world.html: Copied (new)
2018/12/02 14:06:27 INFO  : 
Transferred:   	       659 / 659 Bytes, 100%, 447 Bytes/s, ETA 0s
Errors:                 0
Checks:                 0 / 0, -
Transferred:            4 / 4, 100%
Elapsed time:        1.4s

Hope that helps!


#3

Thank you so much for your reply.

I had another question.

Can we skip the files that don’t exist in the http remote but are present in the --files-from files.txt?
I’ve extracted all of files into a text file but some of the files in the remote longer exists.
So rclone doens’t do anything and puts up this error,
Failed to lsf: Stat failed: failed to stat: HTTP Error 404: 404 Not Found

Can I suppress this error and still continue with the files that are present in the remote as well as in the files.txt.


#4

I think rclone should be doing that already…

In fact it looks like a bug…

Try this

https://beta.rclone.org/branch/v1.45-022-gfc654a4c-fix-http-not-found-beta/ (uploaded in 15-30 mins)


#5

You are right,

This does work.

Thanks a lot.


#6

Thanks for testing. I’ll merge that to the latest beta now - it will be there in 15-30 mins.


#7

Hi!

I’ve done just that, a scrapy spider crawls the website to list me all the links and stores it into a file for the remote i am trying to fetch using rclone.
Both the crawler and rclone runs in a scheduled manner, once every 6 hours,
but Rclone is taking very long to start the process of copy (about 2-4 hours) depending on the number of entries in the file mentioned in --files-from.

Without the use of --files-from rclone would perform much better,
is this supposed to take a long time to start?
(The file has around 4k-5k entries)

Rclone version I’ve tried on:
rclone: Version “v1.45-031-ge7684b7e-beta”


#8

Hmm, yes rclone is checking each file in the file-from exists. However it does this in a very inefficient way, one at a time! Rclone should really be parallelising this using --checkers threads at once like it does everything else.

This would be relatively easy to implement.

The code is here: https://github.com/ncw/rclone/blob/e7684b7ed5d2b325c2ce7790e8c0a663cc0a870b/fs/filter/filter.go#L509-L527

Can you please make a new issue on github about that and we can have a go at fixing it. Maybe you’d like to help?


#9

I’ve created a new issue,

I dont think I would be much help right now as I dont know GoLang at all,
Currently I am learning the basics of it,
Will try to help when I am able to understand whats going on in the project.


#10

I’ve posted a beta in the issue for you to try :smile:

This issue isn’t a good one for people new to Go as anything involving concurrency is always difficult!