imvedere
(Victor)
March 21, 2019, 2:06pm
#1
Hey, I’m been using rclone to clone upward 30TB+ files onto GDrive and donated 50USD as gratitude, and this is the first time I’ve created an account to enquire something as I’m genuinely curious.
As the title says, how does it prevent ignoring files that should be downloaded when it is not a duplicate file? I’m current cloning an opendirectory hosted by my friend, and it returns more than 100 messages regarding “Duplicate object found in source - ignoring”, but he stated that he does not have a single duplicate file. I’m worried that rclone will miss out 100+ files.
Thank you in advance.
Google Drive doesn’t support duplicate files. You probably want to dedupe your GD and your source.
https://rclone.org/commands/rclone_dedupe/
imvedere
(Victor)
March 21, 2019, 2:30pm
#3
What is the factor rclone used to determine if the file is a duplicate? My friend stated that he does not have duplicate files, your statement doesn’t answer my question.
The error message is pretty specific saying there are duplicate files.
You can run the command with -vv and share the full command and log output and we can see what files are duplicates.
A duplicate file would be the same size/modification time.
rclone copy /etc/hosts GD: -vv
2019/03/21 10:39:32 DEBUG : rclone: Version "v1.46" starting with parameters ["rclone" "copy" "/etc/hosts" "GD:" "-vv"]
2019/03/21 10:39:32 DEBUG : Using config file from "/opt/rclone/rclone.conf"
2019/03/21 10:39:33 DEBUG : hosts: Size and modification time the same (differ by -24.588µs, within tolerance 1ms)
2019/03/21 10:39:33 DEBUG : hosts: Unchanged skipping
2019/03/21 10:39:33 INFO :
Transferred: 0 / 0 Bytes, -, 0 Bytes/s, ETA -
Errors: 0
Checks: 1 / 1, 100%
Transferred: 0 / 0, -
Elapsed time: 400ms
2019/03/21 10:39:33 DEBUG : 4 go routines active
2019/03/21 10:39:33 DEBUG : rclone: Version "v1.46" finishing with parameters ["rclone" "copy" "/etc/hosts" "GD:" "-vv"]
1 Like
imvedere
(Victor)
March 21, 2019, 2:51pm
#6
Since he is hosting past year papers, I doubt there would be any duplicate.
You seem to not include the full command/output so I’m missing what you are running and the rclone version are you running as I asked for that too.
You should run rclone dedupe on the source and you can see what duplicates there are.
What that means is that you have two files with the identical name and path. This isn’t possible on a normal filing system but it is on google drive.
imvedere
(Victor)
March 21, 2019, 4:00pm
#8
One question, is it possible to dedupe a read-only directory? The opendirectory is hosted on a website.
calisro
(Rob)
March 21, 2019, 4:22pm
#9
Your can check it but it won’t be able to remove dups.
Can you share the config to hit it or is it private? There could be issues related back to encoding as I thought you were going from GD to GD as well.
imvedere
(Victor)
March 21, 2019, 6:26pm
#11
rclone copy --http-url “https://paperarchive.space ” :http: g10:papers --ignore-e
xisting --transfers 10 --checkers 8 --tpslimit 15 --tpslimit-burst 10 --fast-list -P
There you go, it’s not exactly private as he posted it on reddit as well.
imvedere
(Victor)
March 21, 2019, 6:29pm
#12
Also, may I know what commands should I use to enforce directory-crawling limit on rclone? I’d like it to crawl a sub-directory, clone everything in the sub-directory, and proceed to crawl the next sub-directory.
The reason is the-eye.eu has enforced strict rate-limit, and it returns bunch of 429 errors.
ncw
(Nick Craig-Wood)
March 21, 2019, 9:03pm
#13
Try with the latest beta - the http backend used to put duplicates in but that was fixed recently.
imvedere
(Victor)
March 21, 2019, 9:05pm
#14
May I know what is the issue with that opendirectory? I like to listen to technical explanation.
ncw
(Nick Craig-Wood)
March 21, 2019, 9:10pm
#15
Here is a listing with the latest beta
$ rclone lsf --http-url "https://paperarchive.space" :http:'Past Papers/AQA/GCSE/Biology (4401)/2017/June'
AQA-BL1FP-QP-JUN17.pdf
AQA-BL1FP-W-MS-JUN17.pdf
AQA-BL1HP-QP-JUN17.pdf
AQA-BL1HP-W-MS-JUN17.pdf
AQA-BL2FP-QP-JUN17.PDF
AQA-BL2FP-W-MS-JUN17.PDF
AQA-BL2HP-QP-JUN17.PDF
AQA-BL2HP-W-MS-JUN17.PDF
AQA-BL3FP-QP-JUN17.pdf
AQA-BL3FP-W-MS-JUN17.pdf
AQA-BL3HP-QP-JUN17.pdf
AQA-BL3HP-W-MS-JUN17.pdf
And here is the listing with 1.46 - note the duplicated file names. These are duplicated because there are duplicated links on the web page - the beta removes the duplicates.
$ rclone-v1.46 lsf --http-url "https://paperarchive.space" :http:'Past Papers/AQA/GCSE/Biology (4401)/2017/June'
AQA-BL1FP-QP-JUN17.pdf
AQA-BL1FP-QP-JUN17.pdf
AQA-BL1FP-W-MS-JUN17.pdf
AQA-BL1FP-W-MS-JUN17.pdf
AQA-BL1HP-QP-JUN17.pdf
AQA-BL1HP-QP-JUN17.pdf
AQA-BL1HP-W-MS-JUN17.pdf
AQA-BL1HP-W-MS-JUN17.pdf
AQA-BL2FP-QP-JUN17.PDF
AQA-BL2FP-QP-JUN17.PDF
AQA-BL2FP-W-MS-JUN17.PDF
AQA-BL2FP-W-MS-JUN17.PDF
AQA-BL2HP-QP-JUN17.PDF
AQA-BL2HP-QP-JUN17.PDF
AQA-BL2HP-W-MS-JUN17.PDF
AQA-BL2HP-W-MS-JUN17.PDF
AQA-BL3FP-QP-JUN17.pdf
AQA-BL3FP-QP-JUN17.pdf
AQA-BL3FP-W-MS-JUN17.pdf
AQA-BL3FP-W-MS-JUN17.pdf
AQA-BL3HP-QP-JUN17.pdf
AQA-BL3HP-QP-JUN17.pdf
AQA-BL3HP-W-MS-JUN17.pdf
AQA-BL3HP-W-MS-JUN17.pdf
imvedere
(Victor)
March 21, 2019, 9:23pm
#17
How about this? Thank you in advance.
Also, may I know what commands should I use to enforce directory-crawling limit on rclone? I’d like it to crawl a sub-directory, clone everything in the sub-directory, and proceed to crawl the next sub-directory.
The reason is the-eye.eu has enforced strict rate-limit, and it returns bunch of 429 errors.
calisro
(Rob)
March 21, 2019, 9:35pm
#18
You could probably use tps-limit to slow it down. And maybe max-depth for depth of crawling. Not 100% sure if those work with http-url.
imvedere
(Victor)
March 21, 2019, 10:18pm
#19
And may I know if “rclone copy” actually downloads files from the source before uploading it to the destination? I use rclone because I don’t have to worry about my storage, and it works wonders.
calisro
(Rob)
March 22, 2019, 3:15am
#20
It does download and then upload unless it’s the same remote and the remote supports server side moves.
imvedere
(Victor)
March 22, 2019, 9:57pm
#21
For issue with crawling some opendirectories, may I know if I should report it on Github?
I’m unable to scrape https://pastpapers.co/cie/ and https://pastpapers.papacambridge.com/?dir=Cambridge%20International%20Examinations%20(CIE)
Also, scraping https://papers.gceguide.com/ will return files as HTML instead of PDF.
Thank you in advance.