HTTP not listing from "directory lister"

DarKSkuLL · December 2, 2018, 12:41pm

Hi!
I am trying to list files from a website that uses “Directory Lister” to list files,
But sadly rclone isn’t able to list from it.

Would be really glad if anyone could help.

PS:
Using rclone version:
rclone v1.45

os/arch: windows/amd64
go version: go1.11

ncw · December 2, 2018, 2:08pm

Looking at the directory lister example

http://demo.directorylister.com/?dir=code

This then links to files like

http://demo.directorylister.com/code/hello-world.c

This won’t work with the current scheme as rclone is expecting the links to look like this

http://demo.directorylister.com/code/
http://demo.directorylister.com/code/hello-world.c

If you extract the names from the directory listings, then rclone can fetch the files for you or mount them or whatever…

eg put this into files.txt

hello-world.c
hello-world.css
hello-world.html
hello-world.java

The names can have directories in, but should start from the root

Then

$ rclone --http-url http://demo.directorylister.com/code/ --files-from files.txt lsf :http:
hello-world.c
hello-world.css
hello-world.html
hello-world.java

$ rclone -v --http-url http://demo.directorylister.com/code/ --files-from files.txt copy :http: code-copy
2018/12/02 14:06:27 INFO  : Local file system at /tmp/code-copy: Waiting for checks to finish
2018/12/02 14:06:27 INFO  : Local file system at /tmp/code-copy: Waiting for transfers to finish
2018/12/02 14:06:27 INFO  : hello-world.css: Copied (new)
2018/12/02 14:06:27 INFO  : hello-world.c: Copied (new)
2018/12/02 14:06:27 INFO  : hello-world.java: Copied (new)
2018/12/02 14:06:27 INFO  : hello-world.html: Copied (new)
2018/12/02 14:06:27 INFO  : 
Transferred:   	       659 / 659 Bytes, 100%, 447 Bytes/s, ETA 0s
Errors:                 0
Checks:                 0 / 0, -
Transferred:            4 / 4, 100%
Elapsed time:        1.4s

Hope that helps!

DarKSkuLL · December 4, 2018, 1:31pm

Thank you so much for your reply.

I had another question.

Can we skip the files that don’t exist in the http remote but are present in the --files-from files.txt?
I’ve extracted all of files into a text file but some of the files in the remote longer exists.
So rclone doens’t do anything and puts up this error,
Failed to lsf: Stat failed: failed to stat: HTTP Error 404: 404 Not Found

Can I suppress this error and still continue with the files that are present in the remote as well as in the files.txt.

ncw · December 4, 2018, 5:43pm

I think rclone should be doing that already…

In fact it looks like a bug…

Try this

https://beta.rclone.org/branch/v1.45-022-gfc654a4c-fix-http-not-found-beta/ (uploaded in 15-30 mins)

DarKSkuLL · December 4, 2018, 6:30pm

You are right,

This does work.

Thanks a lot.

ncw · December 4, 2018, 8:21pm

Thanks for testing. I’ll merge that to the latest beta now - it will be there in 15-30 mins.

DarKSkuLL · December 11, 2018, 5:15am

Hi!

I’ve done just that, a scrapy spider crawls the website to list me all the links and stores it into a file for the remote i am trying to fetch using rclone.
Both the crawler and rclone runs in a scheduled manner, once every 6 hours,
but Rclone is taking very long to start the process of copy (about 2-4 hours) depending on the number of entries in the file mentioned in --files-from.

Without the use of --files-from rclone would perform much better,
is this supposed to take a long time to start?
(The file has around 4k-5k entries)

Rclone version I’ve tried on:
rclone: Version “v1.45-031-ge7684b7e-beta”

ncw · December 12, 2018, 9:40pm

Hmm, yes rclone is checking each file in the file-from exists. However it does this in a very inefficient way, one at a time! Rclone should really be parallelising this using --checkers threads at once like it does everything else.

This would be relatively easy to implement.

The code is here: https://github.com/ncw/rclone/blob/e7684b7ed5d2b325c2ce7790e8c0a663cc0a870b/fs/filter/filter.go#L509-L527

Can you please make a new issue on github about that and we can have a go at fixing it. Maybe you’d like to help?

DarKSkuLL · December 13, 2018, 5:08am

I’ve created a new issue,
https://github.com/ncw/rclone/issues/2835

I dont think I would be much help right now as I dont know GoLang at all,
Currently I am learning the basics of it,
Will try to help when I am able to understand whats going on in the project.

ncw · December 13, 2018, 11:02am

I’ve posted a beta in the issue for you to try

This issue isn’t a good one for people new to Go as anything involving concurrency is always difficult!

DarKSkuLL · December 17, 2018, 8:42am

Hi!

The latest Beta is showing as:
rclone-v1.45-031-ge7684b7e-beta-windows-amd64

Isn’t is supposed to be:
rclone-v1.45-033-g5ee1816a-fix-2835-beta-windows-amd64
or something with the version 033.

I might be mistaken.
I apologise if thats the case.

DarKSkuLL · December 17, 2018, 9:00am

I’ve encountered a new problem that i really need your help with.

There is an http remote which has many files that i need to copy into a google drive remote.
I’ve extracted all the links and stored it in a file names files.txt

files.txt is in this format:

dir1/file1.txt?jtoken=4ba0d4388ff1eafb3671473977a3a4ab
dir1/file2.txt?jtoken=81e79120f42a47ad3da1a2e7999a773b

Now the issue is that the server requires those jtokens to authenticate and if i were to pass the entire link including the jtoken, to a downloader(idm) then the file is downloaded.

Is there a way for rclone to able to takle this problem?

ncw · December 19, 2018, 3:08pm

The numbers are commits since the version was made so I wouldn’t expect the number to be less.

The latest beta right now is

https://beta.rclone.org/v1.45-035-g9cb3a68c-beta/

I don’t think you can do it with the http backend, but you can use rclone copyurl

DarKSkuLL · January 1, 2019, 4:07pm

Hi!

So do I need to call

rclone copyurl https://example.com/dir1/file1.txt?jtoken=4ba0d4388ff1eafb3671473977a3a4ab remote:dir1/file1.txt

for each individual file?

I actually have a lot of files(~10k), 200-500MB each, that need to be transferred, and this process will be very inefficient.

When I use rclone copyurl then --dry-run wasn’t working, and it copied the file anyway.

The speed was very less too,

Transferred: 28.406M / 210.926 MBytes, 13%, 339.306 kBytes/s, ETA 9m10s
Errors: 0
Checks: 0 / 0, -
Transferred: 0 / 1, 0%
Elapsed time: 1m25.7s

using a downloader or browser to download the file would give ~9-10 MBps, if i can acheive that then transferring the files one by one might become feasible.

My rclone version is:

rclone v1.45-056-g95e52e1a-beta

os/arch: windows/amd64

go version: go1.11

.

.
PS: I really appreciate you taking the time to reply to my queries.Thanks for your help.

ncw · January 3, 2019, 11:31am

Can you please make a new issue on github about this!

What you can do is use the rclone API to call the copyurl

So you’d run an rclone server “rclone rcd --rc-no-auth” then in another window issue

rclone rc operations/copyurl fs=drive: remote=path/to/file.txt url=https://example.com/file.txt

This will stop rclone having to make and remake the drive remote.

You can also do them in parallel if you supply “_async=true” to the command. (You might want to pace them a little otherwise it will do all of them at once!)

DarKSkuLL · January 3, 2019, 11:49am

I’ve created a new issue on github regarding this.

I am not sure I understand how to pace them.

Basically i have a text file with all the links, so do you mean i can run a python script for this,
and the command is passed to the rclone server?

ncw · January 3, 2019, 11:50am

Thanks

Just put a sleep 1 between each command or something like that.

DarKSkuLL · January 4, 2019, 11:06am

Thank you very much, it is working.

I have written down a script that does just that.
Few issues i am facing:

Extremely slow download rate for each file (10~15 KBytes/s)
copyurl copies content even if it exists in destination (the rcd server gets interrupted at times, so the script happens to copy files already copied)

.

It turns out that this problem that i was facing before still persits and rclone’s total bandwidth usage is around ~2 Mbits/s upload and ~2 Mbits/s download,
this is when i have 10-20 jobs running asncronously at a time.

Is there a way to increase the download rate for each rclone copyurl?

Also is it possible to check if file exists in the destination for rclone copyurl?
Basically if –ignore-existing flag can be used in this method.

EDIT:
I checked the logs, out of 200 files that the script tried copying using copyurl
180 transfers terminated with this error, after rclone sent a chunk of data:

2019/01/04 21:48:18 DEBUG : dir/file1.txt: Sending chunk 0 length 8388608
2019/01/04 21:48:18 ERROR : dir/file1.txt: Post request put error: Post https://www.googleapis.com/upload/drive/v3/files?alt=json&fields=id%2Cname%2Csize%2Cmd5Checksum%2Ctrashed%2CmodifiedTime%2CcreatedTime%2CmimeType%2Cparents%2CwebViewLink&uploadType=resumable&upload_id=AEnB2Uo8C69fQscv9amadv4xbnUhN-w0UIWKsWwxLDEQ91Tdi-dPnIZbhBknlWTHVLLqXIzJkJniC69cjwheB9-OkRhJpnjk0Q: stream error: stream ID 763; PROTOCOL_ERROR

Can this error be fixed?

with a run time of 6 hours only ~20 files were transferred with each transfer at average 10~15 KBytes/sec.
Is there anything i can do to imporve this?

I really appreciate your help.

ncw · January 4, 2019, 6:33pm

Can you try using copyurl to the local disk - how fast does it run then?

How big are the files you are copying?

I think the expectation here should be rclone does some sort of checking on the file, so if the remote file is the same length then it doesn’t copy it.

Implementing --ignore-existing is probably a good idea too.

That appears to be an HTTP2 error.

Can you run one copyurl (ideally with a small text file) which demostratest the problem with -vv --dump bodies?

It would be helpful to have a log to look at (with -vv) of the transfers.

DarKSkuLL · January 5, 2019, 6:49pm

--ignore-existing doesn’t seem to work with rclone copyurl. . . .
If it can work then, I would be grateful if you could tell me how to send --ignore-existing flags to the rcd.
.
.
.

About 200~300 MB each, some are Larger ~600 MB.

It has the same effect when saving in the disk
(The website seems to allow only ~150KB per connection, while idm manages to get downloads from 16 connections thereby downloading many parts and later appending)
(the ~150KB gets distributed among all the files downloading at a time )

I tried with a few text files, the problem didnt occour.
I think it only happens when I have schedules too many async jobs to the rcd.

PS: still appreciate you taking the time to reply.