How does rclone determine duplicate files

imvedere · March 21, 2019, 2:06pm

Hey, I’m been using rclone to clone upward 30TB+ files onto GDrive and donated 50USD as gratitude, and this is the first time I’ve created an account to enquire something as I’m genuinely curious.

As the title says, how does it prevent ignoring files that should be downloaded when it is not a duplicate file? I’m current cloning an opendirectory hosted by my friend, and it returns more than 100 messages regarding “Duplicate object found in source - ignoring”, but he stated that he does not have a single duplicate file. I’m worried that rclone will miss out 100+ files.

Thank you in advance.

Animosity022 · March 21, 2019, 2:25pm

Google Drive doesn’t support duplicate files. You probably want to dedupe your GD and your source.

https://rclone.org/commands/rclone_dedupe/

imvedere · March 21, 2019, 2:30pm

What is the factor rclone used to determine if the file is a duplicate? My friend stated that he does not have duplicate files, your statement doesn’t answer my question.

Animosity022 · March 21, 2019, 2:39pm

The error message is pretty specific saying there are duplicate files.

You can run the command with -vv and share the full command and log output and we can see what files are duplicates.

A duplicate file would be the same size/modification time.

rclone copy /etc/hosts GD: -vv
2019/03/21 10:39:32 DEBUG : rclone: Version "v1.46" starting with parameters ["rclone" "copy" "/etc/hosts" "GD:" "-vv"]
2019/03/21 10:39:32 DEBUG : Using config file from "/opt/rclone/rclone.conf"
2019/03/21 10:39:33 DEBUG : hosts: Size and modification time the same (differ by -24.588µs, within tolerance 1ms)
2019/03/21 10:39:33 DEBUG : hosts: Unchanged skipping
2019/03/21 10:39:33 INFO  :
Transferred:   	         0 / 0 Bytes, -, 0 Bytes/s, ETA -
Errors:                 0
Checks:                 1 / 1, 100%
Transferred:            0 / 0, -
Elapsed time:       400ms

2019/03/21 10:39:33 DEBUG : 4 go routines active
2019/03/21 10:39:33 DEBUG : rclone: Version "v1.46" finishing with parameters ["rclone" "copy" "/etc/hosts" "GD:" "-vv"]

imvedere · March 21, 2019, 2:49pm

2019/03/21 14:48:39 NOTICE: Edexcel/International GCSE/Physics (4PH0)/2012/January/Question-paper: Duplicate directory found in source - ignoring
2019/03/21 14:48:39 NOTICE: IB/Group 5 - Mathematics/Mathematics SL/2004 May Examination Session/French Papers/Mathematical_methods_paper_1_TZ2_SL_French.pdf: Duplicate object found in source - ignoring
2019/03/21 14:48:39 NOTICE: IB/Group 5 - Mathematics/Mathematics SL/2004 May Examination Session/French Papers/Mathematical_methods_paper_2_TZ2_SL_French.pdf: Duplicate object found in source - ignoring
2019/03/21 14:48:40 NOTICE: Edexcel/International Lower Secondary Curriculum/Science (LSC01)/2014/June/Examiner-report: Duplicate directory found in source - ignoring
2019/03/21 14:48:40 NOTICE: Edexcel/International Lower Secondary Curriculum/Science (LSC01)/2014/June/Mark-scheme: Duplicate directory found in source - ignoring
2019/03/21 14:48:40 NOTICE: Edexcel/International Lower Secondary Curriculum/Science (LSC01)/2014/June/Question-paper: Duplicate directory found in source - ignoring
2019/03/21 14:48:40 NOTICE: Edexcel/International GCSE/History (4HI0)/2011/June/Mark-scheme/Markscheme-Paper1-June2011.pdf: Duplicate object found in source - ignoring
2019/03/21 14:48:41 NOTICE: IB/Group 4 - The sciences/Biology HL/2001 May Examination Session/English Papers: Duplicate directory found in source - ignoring
2019/03/21 14:48:41 NOTICE: Edexcel/International GCSE/Mathematics A (4MA0)/2011/June/Examiner-report/Examinerreport-Paper1F-June2011.pdf: Duplicate object found in source - ignoring
2019/03/21 14:48:41 NOTICE: Edexcel/International GCSE/Mathematics A (4MA0)/2011/June/Examiner-report/Examinerreport-Paper2F-June2011.pdf: Duplicate object found in source - ignoring
2019/03/21 14:48:41 NOTICE: Edexcel/International GCSE/Mathematics A (4MA0)/2011/June/Examiner-report/Examinerreport-Paper3H-June2011.pdf: Duplicate object found in source - ignoring
2019/03/21 14:48:41 NOTICE: Edexcel/International GCSE/Mathematics A (4MA0)/2011/June/Examiner-report/Examinerreport-Paper4H-June2011.pdf: Duplicate object found in source - ignoring
2019/03/21 14:48:42 NOTICE: CIE/O Level/English - Language (1123)/2001: Duplicate directory found in source - ignoring
2019/03/21 14:48:42 NOTICE: CIE/O Level/English - Language (1123)/2002: Duplicate directory found in source - ignoring
2019/03/21 14:48:42 NOTICE: CIE/O Level/English - Language (1123)/2003: Duplicate directory found in source - ignoring
2019/03/21 14:48:42 NOTICE: CIE/O Level/English - Language (1123)/2004: Duplicate directory found in source - ignoring
2019/03/21 14:48:42 NOTICE: CIE/O Level/English - Language (1123)/2005: Duplicate directory found in source - ignoring
2019/03/21 14:48:42 NOTICE: CIE/O Level/English - Language (1123)/2006: Duplicate directory found in source - ignoring
2019/03/21 14:48:42 NOTICE: CIE/O Level/English - Language (1123)/2007: Duplicate directory found in source - ignoring
2019/03/21 14:48:42 NOTICE: CIE/O Level/English - Language (1123)/2008: Duplicate directory found in source - ignoring
2019/03/21 14:48:42 NOTICE: CIE/O Level/English - Language (1123)/2009: Duplicate directory found in source - ignoring
2019/03/21 14:48:42 NOTICE: CIE/O Level/English - Language (1123)/2010: Duplicate directory found in source - ignoring

imvedere · March 21, 2019, 2:51pm

Since he is hosting past year papers, I doubt there would be any duplicate.

Animosity022 · March 21, 2019, 2:55pm

You seem to not include the full command/output so I’m missing what you are running and the rclone version are you running as I asked for that too.

You should run rclone dedupe on the source and you can see what duplicates there are.

What that means is that you have two files with the identical name and path. This isn’t possible on a normal filing system but it is on google drive.

imvedere · March 21, 2019, 4:00pm

One question, is it possible to dedupe a read-only directory? The opendirectory is hosted on a website.

calisro · March 21, 2019, 4:22pm

Your can check it but it won’t be able to remove dups.

Animosity022 · March 21, 2019, 4:56pm

Can you share the config to hit it or is it private? There could be issues related back to encoding as I thought you were going from GD to GD as well.

imvedere · March 21, 2019, 6:26pm

rclone copy --http-url “https://paperarchive.space” :http: g10:papers --ignore-e
xisting --transfers 10 --checkers 8 --tpslimit 15 --tpslimit-burst 10 --fast-list -P

There you go, it’s not exactly private as he posted it on reddit as well.

imvedere · March 21, 2019, 6:29pm

Also, may I know what commands should I use to enforce directory-crawling limit on rclone? I’d like it to crawl a sub-directory, clone everything in the sub-directory, and proceed to crawl the next sub-directory.

The reason is the-eye.eu has enforced strict rate-limit, and it returns bunch of 429 errors.

ncw · March 21, 2019, 9:03pm

Try with the latest beta - the http backend used to put duplicates in but that was fixed recently.

imvedere · March 21, 2019, 9:05pm

May I know what is the issue with that opendirectory? I like to listen to technical explanation.

ncw · March 21, 2019, 9:10pm

Here is a listing with the latest beta

$ rclone lsf --http-url "https://paperarchive.space" :http:'Past Papers/AQA/GCSE/Biology (4401)/2017/June'
AQA-BL1FP-QP-JUN17.pdf
AQA-BL1FP-W-MS-JUN17.pdf
AQA-BL1HP-QP-JUN17.pdf
AQA-BL1HP-W-MS-JUN17.pdf
AQA-BL2FP-QP-JUN17.PDF
AQA-BL2FP-W-MS-JUN17.PDF
AQA-BL2HP-QP-JUN17.PDF
AQA-BL2HP-W-MS-JUN17.PDF
AQA-BL3FP-QP-JUN17.pdf
AQA-BL3FP-W-MS-JUN17.pdf
AQA-BL3HP-QP-JUN17.pdf
AQA-BL3HP-W-MS-JUN17.pdf

And here is the listing with 1.46 - note the duplicated file names. These are duplicated because there are duplicated links on the web page - the beta removes the duplicates.

$ rclone-v1.46 lsf --http-url "https://paperarchive.space" :http:'Past Papers/AQA/GCSE/Biology (4401)/2017/June'
AQA-BL1FP-QP-JUN17.pdf
AQA-BL1FP-QP-JUN17.pdf
AQA-BL1FP-W-MS-JUN17.pdf
AQA-BL1FP-W-MS-JUN17.pdf
AQA-BL1HP-QP-JUN17.pdf
AQA-BL1HP-QP-JUN17.pdf
AQA-BL1HP-W-MS-JUN17.pdf
AQA-BL1HP-W-MS-JUN17.pdf
AQA-BL2FP-QP-JUN17.PDF
AQA-BL2FP-QP-JUN17.PDF
AQA-BL2FP-W-MS-JUN17.PDF
AQA-BL2FP-W-MS-JUN17.PDF
AQA-BL2HP-QP-JUN17.PDF
AQA-BL2HP-QP-JUN17.PDF
AQA-BL2HP-W-MS-JUN17.PDF
AQA-BL2HP-W-MS-JUN17.PDF
AQA-BL3FP-QP-JUN17.pdf
AQA-BL3FP-QP-JUN17.pdf
AQA-BL3FP-W-MS-JUN17.pdf
AQA-BL3FP-W-MS-JUN17.pdf
AQA-BL3HP-QP-JUN17.pdf
AQA-BL3HP-QP-JUN17.pdf
AQA-BL3HP-W-MS-JUN17.pdf
AQA-BL3HP-W-MS-JUN17.pdf

imvedere · March 21, 2019, 9:23pm

How about this? Thank you in advance.

calisro · March 21, 2019, 9:35pm

You could probably use tps-limit to slow it down. And maybe max-depth for depth of crawling. Not 100% sure if those work with http-url.

imvedere · March 21, 2019, 10:18pm

And may I know if “rclone copy” actually downloads files from the source before uploading it to the destination? I use rclone because I don’t have to worry about my storage, and it works wonders.

calisro · March 22, 2019, 3:15am

It does download and then upload unless it’s the same remote and the remote supports server side moves.

imvedere · March 22, 2019, 9:57pm

For issue with crawling some opendirectories, may I know if I should report it on Github?

I’m unable to scrape https://pastpapers.co/cie/ and https://pastpapers.papacambridge.com/?dir=Cambridge%20International%20Examinations%20(CIE)

Also, scraping https://papers.gceguide.com/ will return files as HTML instead of PDF.

Thank you in advance.