How does rclone determine duplicate files

Hey, I’m been using rclone to clone upward 30TB+ files onto GDrive and donated 50USD as gratitude, and this is the first time I’ve created an account to enquire something as I’m genuinely curious.

As the title says, how does it prevent ignoring files that should be downloaded when it is not a duplicate file? I’m current cloning an opendirectory hosted by my friend, and it returns more than 100 messages regarding “Duplicate object found in source - ignoring”, but he stated that he does not have a single duplicate file. I’m worried that rclone will miss out 100+ files.

Thank you in advance.

Google Drive doesn’t support duplicate files. You probably want to dedupe your GD and your source.

https://rclone.org/commands/rclone_dedupe/

What is the factor rclone used to determine if the file is a duplicate? My friend stated that he does not have duplicate files, your statement doesn’t answer my question.

The error message is pretty specific saying there are duplicate files.

You can run the command with -vv and share the full command and log output and we can see what files are duplicates.

A duplicate file would be the same size/modification time.

rclone copy /etc/hosts GD: -vv
2019/03/21 10:39:32 DEBUG : rclone: Version "v1.46" starting with parameters ["rclone" "copy" "/etc/hosts" "GD:" "-vv"]
2019/03/21 10:39:32 DEBUG : Using config file from "/opt/rclone/rclone.conf"
2019/03/21 10:39:33 DEBUG : hosts: Size and modification time the same (differ by -24.588µs, within tolerance 1ms)
2019/03/21 10:39:33 DEBUG : hosts: Unchanged skipping
2019/03/21 10:39:33 INFO  :
Transferred:   	         0 / 0 Bytes, -, 0 Bytes/s, ETA -
Errors:                 0
Checks:                 1 / 1, 100%
Transferred:            0 / 0, -
Elapsed time:       400ms

2019/03/21 10:39:33 DEBUG : 4 go routines active
2019/03/21 10:39:33 DEBUG : rclone: Version "v1.46" finishing with parameters ["rclone" "copy" "/etc/hosts" "GD:" "-vv"]
1 Like

2019/03/21 14:48:39 NOTICE: Edexcel/International GCSE/Physics (4PH0)/2012/January/Question-paper: Duplicate directory found in source - ignoring
2019/03/21 14:48:39 NOTICE: IB/Group 5 - Mathematics/Mathematics SL/2004 May Examination Session/French Papers/Mathematical_methods_paper_1_TZ2_SL_French.pdf: Duplicate object found in source - ignoring
2019/03/21 14:48:39 NOTICE: IB/Group 5 - Mathematics/Mathematics SL/2004 May Examination Session/French Papers/Mathematical_methods_paper_2_TZ2_SL_French.pdf: Duplicate object found in source - ignoring
2019/03/21 14:48:40 NOTICE: Edexcel/International Lower Secondary Curriculum/Science (LSC01)/2014/June/Examiner-report: Duplicate directory found in source - ignoring
2019/03/21 14:48:40 NOTICE: Edexcel/International Lower Secondary Curriculum/Science (LSC01)/2014/June/Mark-scheme: Duplicate directory found in source - ignoring
2019/03/21 14:48:40 NOTICE: Edexcel/International Lower Secondary Curriculum/Science (LSC01)/2014/June/Question-paper: Duplicate directory found in source - ignoring
2019/03/21 14:48:40 NOTICE: Edexcel/International GCSE/History (4HI0)/2011/June/Mark-scheme/Markscheme-Paper1-June2011.pdf: Duplicate object found in source - ignoring
2019/03/21 14:48:41 NOTICE: IB/Group 4 - The sciences/Biology HL/2001 May Examination Session/English Papers: Duplicate directory found in source - ignoring
2019/03/21 14:48:41 NOTICE: Edexcel/International GCSE/Mathematics A (4MA0)/2011/June/Examiner-report/Examinerreport-Paper1F-June2011.pdf: Duplicate object found in source - ignoring
2019/03/21 14:48:41 NOTICE: Edexcel/International GCSE/Mathematics A (4MA0)/2011/June/Examiner-report/Examinerreport-Paper2F-June2011.pdf: Duplicate object found in source - ignoring
2019/03/21 14:48:41 NOTICE: Edexcel/International GCSE/Mathematics A (4MA0)/2011/June/Examiner-report/Examinerreport-Paper3H-June2011.pdf: Duplicate object found in source - ignoring
2019/03/21 14:48:41 NOTICE: Edexcel/International GCSE/Mathematics A (4MA0)/2011/June/Examiner-report/Examinerreport-Paper4H-June2011.pdf: Duplicate object found in source - ignoring
2019/03/21 14:48:42 NOTICE: CIE/O Level/English - Language (1123)/2001: Duplicate directory found in source - ignoring
2019/03/21 14:48:42 NOTICE: CIE/O Level/English - Language (1123)/2002: Duplicate directory found in source - ignoring
2019/03/21 14:48:42 NOTICE: CIE/O Level/English - Language (1123)/2003: Duplicate directory found in source - ignoring
2019/03/21 14:48:42 NOTICE: CIE/O Level/English - Language (1123)/2004: Duplicate directory found in source - ignoring
2019/03/21 14:48:42 NOTICE: CIE/O Level/English - Language (1123)/2005: Duplicate directory found in source - ignoring
2019/03/21 14:48:42 NOTICE: CIE/O Level/English - Language (1123)/2006: Duplicate directory found in source - ignoring
2019/03/21 14:48:42 NOTICE: CIE/O Level/English - Language (1123)/2007: Duplicate directory found in source - ignoring
2019/03/21 14:48:42 NOTICE: CIE/O Level/English - Language (1123)/2008: Duplicate directory found in source - ignoring
2019/03/21 14:48:42 NOTICE: CIE/O Level/English - Language (1123)/2009: Duplicate directory found in source - ignoring
2019/03/21 14:48:42 NOTICE: CIE/O Level/English - Language (1123)/2010: Duplicate directory found in source - ignoring

Since he is hosting past year papers, I doubt there would be any duplicate.

You seem to not include the full command/output so I’m missing what you are running and the rclone version are you running as I asked for that too.

You should run rclone dedupe on the source and you can see what duplicates there are.

What that means is that you have two files with the identical name and path. This isn’t possible on a normal filing system but it is on google drive.

One question, is it possible to dedupe a read-only directory? The opendirectory is hosted on a website.

Your can check it but it won’t be able to remove dups.

Can you share the config to hit it or is it private? There could be issues related back to encoding as I thought you were going from GD to GD as well.

rclone copy --http-url “https://paperarchive.space” :http: g10:papers --ignore-e
xisting --transfers 10 --checkers 8 --tpslimit 15 --tpslimit-burst 10 --fast-list -P

There you go, it’s not exactly private as he posted it on reddit as well.

Also, may I know what commands should I use to enforce directory-crawling limit on rclone? I’d like it to crawl a sub-directory, clone everything in the sub-directory, and proceed to crawl the next sub-directory.

The reason is the-eye.eu has enforced strict rate-limit, and it returns bunch of 429 errors.

Try with the latest beta - the http backend used to put duplicates in but that was fixed recently.

May I know what is the issue with that opendirectory? I like to listen to technical explanation.

Here is a listing with the latest beta

$ rclone lsf --http-url "https://paperarchive.space" :http:'Past Papers/AQA/GCSE/Biology (4401)/2017/June'
AQA-BL1FP-QP-JUN17.pdf
AQA-BL1FP-W-MS-JUN17.pdf
AQA-BL1HP-QP-JUN17.pdf
AQA-BL1HP-W-MS-JUN17.pdf
AQA-BL2FP-QP-JUN17.PDF
AQA-BL2FP-W-MS-JUN17.PDF
AQA-BL2HP-QP-JUN17.PDF
AQA-BL2HP-W-MS-JUN17.PDF
AQA-BL3FP-QP-JUN17.pdf
AQA-BL3FP-W-MS-JUN17.pdf
AQA-BL3HP-QP-JUN17.pdf
AQA-BL3HP-W-MS-JUN17.pdf

And here is the listing with 1.46 - note the duplicated file names. These are duplicated because there are duplicated links on the web page - the beta removes the duplicates.

$ rclone-v1.46 lsf --http-url "https://paperarchive.space" :http:'Past Papers/AQA/GCSE/Biology (4401)/2017/June'
AQA-BL1FP-QP-JUN17.pdf
AQA-BL1FP-QP-JUN17.pdf
AQA-BL1FP-W-MS-JUN17.pdf
AQA-BL1FP-W-MS-JUN17.pdf
AQA-BL1HP-QP-JUN17.pdf
AQA-BL1HP-QP-JUN17.pdf
AQA-BL1HP-W-MS-JUN17.pdf
AQA-BL1HP-W-MS-JUN17.pdf
AQA-BL2FP-QP-JUN17.PDF
AQA-BL2FP-QP-JUN17.PDF
AQA-BL2FP-W-MS-JUN17.PDF
AQA-BL2FP-W-MS-JUN17.PDF
AQA-BL2HP-QP-JUN17.PDF
AQA-BL2HP-QP-JUN17.PDF
AQA-BL2HP-W-MS-JUN17.PDF
AQA-BL2HP-W-MS-JUN17.PDF
AQA-BL3FP-QP-JUN17.pdf
AQA-BL3FP-QP-JUN17.pdf
AQA-BL3FP-W-MS-JUN17.pdf
AQA-BL3FP-W-MS-JUN17.pdf
AQA-BL3HP-QP-JUN17.pdf
AQA-BL3HP-QP-JUN17.pdf
AQA-BL3HP-W-MS-JUN17.pdf
AQA-BL3HP-W-MS-JUN17.pdf

How about this? Thank you in advance.

You could probably use tps-limit to slow it down. And maybe max-depth for depth of crawling. Not 100% sure if those work with http-url.

And may I know if “rclone copy” actually downloads files from the source before uploading it to the destination? I use rclone because I don’t have to worry about my storage, and it works wonders.

It does download and then upload unless it’s the same remote and the remote supports server side moves.

1 Like

For issue with crawling some opendirectories, may I know if I should report it on Github?

I’m unable to scrape https://pastpapers.co/cie/ and https://pastpapers.papacambridge.com/?dir=Cambridge%20International%20Examinations%20(CIE)

Also, scraping https://papers.gceguide.com/ will return files as HTML instead of PDF.

Thank you in advance.