imvedere
(Victor)
March 21, 2019, 2:06pm
1
Hey, I’m been using rclone to clone upward 30TB+ files onto GDrive and donated 50USD as gratitude, and this is the first time I’ve created an account to enquire something as I’m genuinely curious.
As the title says, how does it prevent ignoring files that should be downloaded when it is not a duplicate file? I’m current cloning an opendirectory hosted by my friend, and it returns more than 100 messages regarding “Duplicate object found in source - ignoring”, but he stated that he does not have a single duplicate file. I’m worried that rclone will miss out 100+ files.
Thank you in advance.
Google Drive doesn’t support duplicate files. You probably want to dedupe your GD and your source.
https://rclone.org/commands/rclone_dedupe/
imvedere
(Victor)
March 21, 2019, 2:30pm
3
What is the factor rclone used to determine if the file is a duplicate? My friend stated that he does not have duplicate files, your statement doesn’t answer my question.
The error message is pretty specific saying there are duplicate files.
You can run the command with -vv and share the full command and log output and we can see what files are duplicates.
A duplicate file would be the same size/modification time.
rclone copy /etc/hosts GD: -vv
2019/03/21 10:39:32 DEBUG : rclone: Version "v1.46" starting with parameters ["rclone" "copy" "/etc/hosts" "GD:" "-vv"]
2019/03/21 10:39:32 DEBUG : Using config file from "/opt/rclone/rclone.conf"
2019/03/21 10:39:33 DEBUG : hosts: Size and modification time the same (differ by -24.588µs, within tolerance 1ms)
2019/03/21 10:39:33 DEBUG : hosts: Unchanged skipping
2019/03/21 10:39:33 INFO :
Transferred: 0 / 0 Bytes, -, 0 Bytes/s, ETA -
Errors: 0
Checks: 1 / 1, 100%
Transferred: 0 / 0, -
Elapsed time: 400ms
2019/03/21 10:39:33 DEBUG : 4 go routines active
2019/03/21 10:39:33 DEBUG : rclone: Version "v1.46" finishing with parameters ["rclone" "copy" "/etc/hosts" "GD:" "-vv"]
1 Like
imvedere
(Victor)
March 21, 2019, 2:51pm
6
Since he is hosting past year papers, I doubt there would be any duplicate.
You seem to not include the full command/output so I’m missing what you are running and the rclone version are you running as I asked for that too.
You should run rclone dedupe on the source and you can see what duplicates there are.
What that means is that you have two files with the identical name and path. This isn’t possible on a normal filing system but it is on google drive.
imvedere
(Victor)
March 21, 2019, 4:00pm
8
One question, is it possible to dedupe a read-only directory? The opendirectory is hosted on a website.
calisro
(Rob)
March 21, 2019, 4:22pm
9
Your can check it but it won’t be able to remove dups.
Can you share the config to hit it or is it private? There could be issues related back to encoding as I thought you were going from GD to GD as well.
imvedere
(Victor)
March 21, 2019, 6:26pm
11
rclone copy --http-url “https://paperarchive.space ” :http: g10:papers --ignore-e
xisting --transfers 10 --checkers 8 --tpslimit 15 --tpslimit-burst 10 --fast-list -P
There you go, it’s not exactly private as he posted it on reddit as well.
imvedere
(Victor)
March 21, 2019, 6:29pm
12
Also, may I know what commands should I use to enforce directory-crawling limit on rclone? I’d like it to crawl a sub-directory, clone everything in the sub-directory, and proceed to crawl the next sub-directory.
The reason is the-eye.eu has enforced strict rate-limit, and it returns bunch of 429 errors.
ncw
(Nick Craig-Wood)
March 21, 2019, 9:03pm
13
Try with the latest beta - the http backend used to put duplicates in but that was fixed recently.
imvedere
(Victor)
March 21, 2019, 9:05pm
14
May I know what is the issue with that opendirectory? I like to listen to technical explanation.
ncw
(Nick Craig-Wood)
March 21, 2019, 9:10pm
15
Here is a listing with the latest beta
$ rclone lsf --http-url "https://paperarchive.space" :http:'Past Papers/AQA/GCSE/Biology (4401)/2017/June'
AQA-BL1FP-QP-JUN17.pdf
AQA-BL1FP-W-MS-JUN17.pdf
AQA-BL1HP-QP-JUN17.pdf
AQA-BL1HP-W-MS-JUN17.pdf
AQA-BL2FP-QP-JUN17.PDF
AQA-BL2FP-W-MS-JUN17.PDF
AQA-BL2HP-QP-JUN17.PDF
AQA-BL2HP-W-MS-JUN17.PDF
AQA-BL3FP-QP-JUN17.pdf
AQA-BL3FP-W-MS-JUN17.pdf
AQA-BL3HP-QP-JUN17.pdf
AQA-BL3HP-W-MS-JUN17.pdf
And here is the listing with 1.46 - note the duplicated file names. These are duplicated because there are duplicated links on the web page - the beta removes the duplicates.
$ rclone-v1.46 lsf --http-url "https://paperarchive.space" :http:'Past Papers/AQA/GCSE/Biology (4401)/2017/June'
AQA-BL1FP-QP-JUN17.pdf
AQA-BL1FP-QP-JUN17.pdf
AQA-BL1FP-W-MS-JUN17.pdf
AQA-BL1FP-W-MS-JUN17.pdf
AQA-BL1HP-QP-JUN17.pdf
AQA-BL1HP-QP-JUN17.pdf
AQA-BL1HP-W-MS-JUN17.pdf
AQA-BL1HP-W-MS-JUN17.pdf
AQA-BL2FP-QP-JUN17.PDF
AQA-BL2FP-QP-JUN17.PDF
AQA-BL2FP-W-MS-JUN17.PDF
AQA-BL2FP-W-MS-JUN17.PDF
AQA-BL2HP-QP-JUN17.PDF
AQA-BL2HP-QP-JUN17.PDF
AQA-BL2HP-W-MS-JUN17.PDF
AQA-BL2HP-W-MS-JUN17.PDF
AQA-BL3FP-QP-JUN17.pdf
AQA-BL3FP-QP-JUN17.pdf
AQA-BL3FP-W-MS-JUN17.pdf
AQA-BL3FP-W-MS-JUN17.pdf
AQA-BL3HP-QP-JUN17.pdf
AQA-BL3HP-QP-JUN17.pdf
AQA-BL3HP-W-MS-JUN17.pdf
AQA-BL3HP-W-MS-JUN17.pdf
imvedere
(Victor)
March 21, 2019, 9:23pm
17
How about this? Thank you in advance.
Also, may I know what commands should I use to enforce directory-crawling limit on rclone? I’d like it to crawl a sub-directory, clone everything in the sub-directory, and proceed to crawl the next sub-directory.
The reason is the-eye.eu has enforced strict rate-limit, and it returns bunch of 429 errors.
calisro
(Rob)
March 21, 2019, 9:35pm
18
You could probably use tps-limit to slow it down. And maybe max-depth for depth of crawling. Not 100% sure if those work with http-url.
imvedere
(Victor)
March 21, 2019, 10:18pm
19
And may I know if “rclone copy” actually downloads files from the source before uploading it to the destination? I use rclone because I don’t have to worry about my storage, and it works wonders.
calisro
(Rob)
March 22, 2019, 3:15am
20
It does download and then upload unless it’s the same remote and the remote supports server side moves.
1 Like
imvedere
(Victor)
March 22, 2019, 9:57pm
21
For issue with crawling some opendirectories, may I know if I should report it on Github?
I’m unable to scrape https://pastpapers.co/cie/ and https://pastpapers.papacambridge.com/?dir=Cambridge%20International%20Examinations%20(CIE)
Also, scraping https://papers.gceguide.com/ will return files as HTML instead of PDF.
Thank you in advance.