User talk:EarwigBot/Copyvios/Exclusions

Protected edit request on 30 October 2014

Latest comment: 9 years ago3 comments2 people in discussion

This edit request has been answered. Set the |answered= or |ans= parameter to no to reactivate your request.

Please replace all <tt>...</tt> with <code>...</code> or <kbd>...</kbd> as is appropriate on a case-by-case basis.
Please replace all of the http:// in the internal Wikipedia sites section to https?:// since Wikipedia uses a secure server.

Thank you. — {{U|Technical 13}} ^{(e • t • c)} 17:02, 30 October 2014 (UTC)Reply

Should probably ping The Earwig here as well. Thanks again. — {{U|Technical 13}} ^{(e • t • c)} 17:04, 30 October 2014 (UTC)Reply

Re 1: Done. I just replaced all the tt tags with code tags... not sure what else you would have wanted?
Re 2: "Exceptions are protocol-insensitive (so the rule http://en.wikipedia.org will match https://en.wikipedia.org/wiki/Foo)", so this is not necessary and would break the rules since they are not regular expressions without the re: prefix. — Earwig ^talk 18:53, 30 October 2014 (UTC)Reply

common blacklist?

Latest comment: 9 years ago5 comments4 people in discussion

Hi! I noticed that you're maintaining a blacklist for EarwigBot and you the copyvio tool on Labs, and there's User:EranBot/Copyright/Blacklist another being maintained for User:EranBot, which User:ערן and User:Doc James have been working on lately. Would it be feasible to work from a common blacklist? I noticed a bunch of mirrors (covered by EranBot's blacklist) coming up when I tried out the copyvio tool.--ragesoss (talk) 00:20, 12 December 2014 (UTC)Reply

So does the black list for EarwigBot use the same format and is it also a collection of mirrors of Wikipedia? Doc James (talk · contribs · email) 01:10, 12 December 2014 (UTC)Reply

User:Doc James: The format is a little different (the user page of this Talk page is it), but they are both essentially just lists of regexes, so it might be simple to modify EarwigBot to use the much larger list of mirrors we've compiled so far for EranBot. (I anticipate using the same mirror list for Wiki Ed's plagiarism prevention system.)--Sage (Wiki Ed) (talk) 01:14, 12 December 2014 (UTC)Reply

This is a good idea; I wasn't aware there was another list of mirrors (and I should really watch this page more often!). I've created an issue for it which I'll handle soon. — Earwig ^talk 18:02, 23 January 2015 (UTC)Reply

Done now. The bot uses User:EranBot/Copyright/Blacklist too. — Earwig ^talk 00:19, 27 January 2015 (UTC)Reply

Add http://www.reference.com/

Latest comment: 9 years ago2 comments2 people in discussion

E.g. http://www.reference.com/browse/Ganesha --Redtigerxyz ^Talk 10:52, 26 December 2014 (UTC)Reply

Done. — Earwig ^talk 18:02, 23 January 2015 (UTC)Reply

+

Latest comment: 9 years ago3 comments2 people in discussion

Please modify quickiwiki.com/en to quickiwiki.com, as there is support for other language wikipedias. i.e. 1, 2

— Revi 06:49, 23 January 2015 (UTC)Reply

@The Earwig: Ping. — Revi 06:49, 23 January 2015 (UTC)Reply

Done. — Earwig ^talk 18:02, 23 January 2015 (UTC)Reply

Add http://www.donehealth.com/

Latest comment: 9 years ago1 comment1 person in discussion

--Redtigerxyz ^Talk 18:15, 7 February 2015 (UTC)Reply

Add http://www.questpedia.org

Latest comment: 9 years ago1 comment1 person in discussion

Thanks! --AlessioMela (talk) 11:39, 2 March 2015 (UTC)Reply

Edit request

Latest comment: 9 years ago2 comments2 people in discussion

This edit request has been answered. Set the |answered= or |ans= parameter to no to reactivate your request.

Add "url = http://*.gpo.gov" This is the United States Government Printing Office, which prints offical versions of US documents, it's all freely licensed as US government publications. Thanks! Kharkiv07 ^Talk 21:35, 3 April 2015 (UTC)Reply

Done Rjd0060 (talk) 23:26, 25 April 2015 (UTC)Reply

Edit request

Latest comment: 9 years ago2 comments2 people in discussion

This edit request has been answered. Set the |answered= or |ans= parameter to no to reactivate your request.

Please add "url = http://*.usgs.gov". As with gpo.gov, above, USGS is primarily works of the US government and public domain.

This article creation was flagged as a potential copyvio. TJRC (talk) 00:40, 26 June 2015 (UTC)Reply

Added — Martin (MSGJ · talk) 10:14, 2 July 2015 (UTC)Reply

New exclusions

Latest comment: 9 years ago2 comments2 people in discussion

This edit request has been answered. Set the |answered= or |ans= parameter to no to reactivate your request.

Russian Wikipedia mirrors:

rfwiki.org
gruzdoff.ru
dic.academic.ru/dic.nsf/ruwiki/*
www.wikiwand.com

Thanks in advance! --Fastboy (talk) 11:10, 15 August 2015 (UTC)Reply

Done Nakon 22:51, 15 August 2015 (UTC)Reply

More exclusions

Latest comment: 9 years ago2 comments2 people in discussion

This edit request has been answered. Set the |answered= or |ans= parameter to no to reactivate your request.

Freely licensed, available under CC 4.0 (https://creativecommons.org/licenses/by/4.0/deed.ru):

Clones:

--Fastboy (talk) 10:23, 18 August 2015 (UTC)Reply

Done, thank you! — Earwig ^talk 03:12, 19 August 2015 (UTC)Reply

Excluding project Gutenberg

Latest comment: 9 years ago2 comments2 people in discussion

We should probably exclude gutenberg.org as a public domain source as well (as I've seen it show up in false positives). Kaldari (talk) 23:18, 24 August 2015 (UTC)Reply

Done; I added gutenberg.org itself only and no subdomains. — Earwig ^talk 00:18, 25 August 2015 (UTC)Reply

rfwiki.org

Latest comment: 8 years ago1 comment1 person in discussion

2 Exclusions. --Pessimist 10:00, 9 September 2015 (UTC)

[1]

Done. — Earwig ^talk 19:26, 13 September 2015 (UTC)Reply

Exclusion

Latest comment: 8 years ago2 comments2 people in discussion

http://wreferat.baza-referat.ru Copy articles Russian Wikipedia.--Arbnos (talk) 18:08, 15 September 2015 (UTC)Reply

Done. — Earwig ^talk 18:32, 15 September 2015 (UTC)Reply

Add http://enzyklo.de

Latest comment: 8 years ago2 comments2 people in discussion

(Search engine) -- FriedhelmW (talk) 18:40, 1 October 2015 (UTC)Reply

Done — Earwig ^talk 20:25, 1 October 2015 (UTC)Reply

Exclusion

Latest comment: 8 years ago2 comments2 people in discussion

http://ensiklopedia.ru/wiki/* Copy articles Russian Wikipedia.--Arbnos (talk) 00:02, 8 November 2015 (UTC)Reply

Done. — Earwig ^talk 00:28, 8 November 2015 (UTC)Reply

Add http://encyclo.co.uk/

Latest comment: 8 years ago2 comments2 people in discussion

Thanks! -- FriedhelmW (talk) 19:01, 8 November 2015 (UTC)Reply

Done. — Earwig ^talk 21:36, 8 November 2015 (UTC)Reply

Add http://research.omicsgroup.org/

Latest comment: 8 years ago2 comments2 people in discussion

This website seems to copy Wikipedia articles in the format hxxp://research.omicsgroup.org/index.php/ARTICLENAME. Example: http://research.omicsgroup.org/index.php/Statue_of_liberty for Statue of Liberty. epic genius (talk) 02:25, 12 November 2015 (UTC)Reply

Done. — Earwig ^talk 10:00, 12 November 2015 (UTC)Reply

Add http://worldebooklibrary.net/articles/

Latest comment: 8 years ago2 comments2 people in discussion

Seems to be found a lot lately. Collect (talk) 23:24, 22 November 2015 (UTC)Reply

Done. — Earwig ^talk 10:23, 23 November 2015 (UTC)Reply

Exclusion

Latest comment: 8 years ago2 comments2 people in discussion

http://wikigraff.ru/ Copy articles Russian Wikipedia.--Arbnos (talk) 11:08, 27 November 2015 (UTC)Reply

Done. — Earwig ^talk 14:53, 27 November 2015 (UTC)Reply

Add http://nosmut.com/

Latest comment: 8 years ago2 comments2 people in discussion

This website seems to copy articles from Wikipedia... I think, although the home page is a bit weird. Anyway, http://nosmut.com/New_York_City_Subway.html, for example, duplicates New York City Subway. epic genius (talk) 04:25, 1 December 2015 (UTC)Reply

Done. A strange site indeed. — Earwig ^talk 04:29, 1 December 2015 (UTC)Reply

richestcelebrities.org

Latest comment: 8 years ago5 comments2 people in discussion

Appears to use Wikipedia for some material - alas. Compare Death of Antonio Calvo to http://richestcelebrities.org/richest-actors/antonio-calco-net-worth/ . Collect (talk) 16:18, 22 December 2015 (UTC)Reply

Also add libreriauniversitaria.it for similar use of Wikipedia articles/ Thanks. Collect (talk) 16:22, 22 December 2015 (UTC)Reply

First is done. @Collect: can you give an example of the second copying Wikipedia content? I can't find any. — Earwig ^talk 21:52, 22 December 2015 (UTC)Reply

http://www.libreriauniversitaria.it/tom-riall-betascript-publishing/book/9786133017016 (appears to use material from Betascript Publishing, a repackager of Wikipedia - perhaps the filter should simply look for "betascript"?) should suffice. Collect (talk) 22:24, 22 December 2015 (UTC)Reply

Done. — Earwig ^talk 23:14, 22 December 2015 (UTC)Reply

Excluding some sites

Latest comment: 8 years ago3 comments2 people in discussion

This is probably me not knowing how to use the Copyvio % tool. I am having difficulty with Our Lady of the Good Success; it contains material virtually identical to http://[add www.]fisheaters.com/forums/index.php?topic=3468895.0, but the latter site copies Wikipedia and puts, withpout comment, https://en.wikipedia.org/wiki/Nuestra_Se%C3%B1ora_del_Buen_Suceso_de_Para%C3%B1aque at the bottom. I tried to add the site to Wikipedia:Mirrors and forks/Mno (section), but this was not accepted as fisheaters.com is blacklisted. So the question I ask, which is due either to me missing something obvious, a bit of documentation that needs adding, or even an actual limitation of the Copyvio tool: How can specific sites be excluded, based on user reqiremwents?

I have recently started using this tool; a scenario I find is that a site which is copied from Wikipedia obviously comes up as t he prime suspect, obviously shouldn't be considered, and masks true copyedits from other sites. I would like to be able to click a found site (without going through the procedure of adding it to a permanent list) and for it then to be ignored in subsequent searches. Is this possible somehow? Also, it might make sense to exclude from the search (perhaps controlled by ab option) sites which are blacklisted by Wikipedia.

Apologies for taking your time with what is probably user inexperience of the tool, best wishes, and congratulations on a clever and much-needed tool, Pol098 (talk) 12:45, 24 December 2015 (UTC) P.S. I can't save this message with a valid link to fisheaters. com in it! Surely this should be OK on a Talk page?Reply

@Pol098: this will happen frequently for pages that have an established history, where they are widely copied around the Internet. The tool works best on new pages. In this case, I don't usually like adding mirrors to this list that only mirror a single page, because then the list would become too large and unmaintainable. You should be able to simply ignore the mirror result and review any other matches the tool finds, using the direct compare option. — Earwig ^talk 19:22, 24 December 2015 (UTC)Reply

Thanks for response. I take your point about new pages, which isn't what I've been looking at. I did add one site to your list; I got the impression that it had a lot of material from Wikipedia, not just a one-off (if I remember rightly, I was editing several articles); you may wish to remove it, in which case I apologise for the unwanted addition. As comment—no need to respond—a page with 95% match because it's a copy from Wikipedia is a nuisance, and can mask others. The comparison often reports to the effect that no more pages will be compared because there are already a lot of hits. What I would like to see, but may not be of general uses or sensible to implement, is a way to implement a one-off temporary list of sites not to be checked in this particular case, or/and a way to tag listed sites with matches so that they are excluded from a later run. I've found an awful lot of long-standing pages with many edits that are crammed with swathes of copied text. I haven't been seriously using the tool for long, and may be talking nonsense: if so, ignore. Best wishes, Pol098 (talk) 19:48, 24 December 2015 (UTC)Reply

hearplanet.com

Latest comment: 8 years ago8 comments2 people in discussion

http://www.hearplanet.com/article/930747 uses Wikipedia as a source Collect (talk) 09:39, 31 December 2015 (UTC)Reply

Done. — Earwig ^talk 09:47, 31 December 2015 (UTC)Reply

Also note a bunch of users on youtube.com quote Wikipedia a lot - and as youtube is not a reliable source in any event - probably should be excluded. Collect (talk) 09:40, 31 December 2015 (UTC)Reply

I don't think this is a good blanket exclusion. Being a reliable source is mostly irrelevant (ELPEREN aside); it's more about whether the site has the potential for people to copy from it, and I think the answer there is yes. You're probably right that the reverse is much more common, but I'd rather people need to wade through some false positives than get a false negative. — Earwig ^talk 09:47, 31 December 2015 (UTC)Reply

Wondering - could one simply add "wikipedia" to be a marker for not showing a result? Many of these do have "wikipedia" somewhere in their source code <g>. Collect (talk) 09:43, 31 December 2015 (UTC)Reply

Right now it automatically excludes pages that link back to Wikipedia—I skipped just a text search to avoid rare false negatives—but that might be worth looking into more... — Earwig ^talk 09:47, 31 December 2015 (UTC)Reply

Hold on, I noticed hearplanet.com does that already. That's a mistake... will take a look in the morning. — Earwig ^talk 09:48, 31 December 2015 (UTC)Reply

Long morning. Fixed now; auto-excluding pages that link directly back to the article being searched. Still a bit strict, but should help somewhat. — Earwig ^talk 10:26, 15 January 2016 (UTC)Reply

turnitin

Latest comment: 8 years ago2 comments2 people in discussion

Alas - gives all the false positives which this list tries to avoid - can its results be tweaked to avoid long lists of "99% of words copied"? In fact maybe the folks there should be given the suggestion that once "wikipedia" is referenced in the source, that it not be listed as a violation separately from the Wikipedia violation? Collect (talk) 16:47, 22 January 2016 (UTC)Reply

I'll look into this; I really want to change the way the turnitin output is shown. Ideally we just use it as a source for URLs like the search engine, but the reason the WMF chose not to do that when submitting their patch is that many turnitin results are behind paywalls. — Earwig ^talk 20:14, 22 January 2016 (UTC)Reply

datab.us

Latest comment: 8 years ago2 comments2 people in discussion

http://datab.us/i/Wichita,%20Kansas is very suspiciously like the Wikipedia article (like about 100%) - and gives no attribution. It does use commons images - meaning I have no doubt this is an unattributed copy of Wikipedia. Thanks. Collect (talk) 21:59, 4 February 2016 (UTC)Reply

Added. — Earwig ^talk 00:19, 5 February 2016 (UTC)Reply

wikitree.com

Latest comment: 8 years ago2 comments2 people in discussion

In fact, can you do a general exclusion of all sites beginning with "wiki" at all? there appear to be a bunch of them, and it might same some time in the exclusion process. Thanks. Collect (talk) 13:06, 14 February 2016 (UTC)Reply

I don't know. Would need to do further research on how often sites with "wiki" in their name are not mirroring Wikipedia, because it could reasonably happen, though because the tool reports what sites were excluded now it's less likely to be an issue. — Earwig ^talk 20:34, 14 February 2016 (UTC)Reply

my-definitions.com

Latest comment: 8 years ago2 comments2 people in discussion

Please add http://my-definitions.com/fr/definition/ that copy lot of french WP articles (ex: [2] ans analyse [3]). Thanks you --Framawiki (talk) 14:52, 1 April 2016 (UTC)Reply

Done. — Ǝɐɹʍıƃ ^ʇɐlʞ 21:15, 1 April 2016 (UTC)Reply

Add http://fr.academic.ru/dic.nsf/frwiki/*

Latest comment: 8 years ago2 comments2 people in discussion

Can you add this website in the exlusion : http://fr.academic.ru/dic.nsf/frwiki/* ? Thanks ! --Bastenbas (talk) 13:39, 2 April 2016 (UTC)Reply

Done. — Earwig ^talk 15:14, 2 April 2016 (UTC)Reply

Add http://lanimalchat.com/index.html

Latest comment: 8 years ago2 comments2 people in discussion

Hello, this website use the contents of wikipedia fr --Bastenbas (talk) 12:41, 3 April 2016 (UTC)Reply

Done. — Earwig ^talk 19:46, 3 April 2016 (UTC)Reply

Gutenberg.us

Latest comment: 8 years ago2 comments2 people in discussion

Uses "World Heritage Encyclopedia" which is Wikipedia as a source. Considered a "sham encyclopedia". Collect (talk) 22:54, 21 May 2016 (UTC)Reply

Done. — Earwig ^talk 04:22, 22 May 2016 (UTC)Reply

livingnewdeal.org/projects/ariel-rios-federal-building-murals-washington-dc/

Latest comment: 8 years ago2 comments2 people in discussion

Uses Wikipedia - but as a URL which might not necessarily be caught in "source Notes". Is the new check for "Wikipedia" on a page going to pick these URLs up? Thanks! Collect (talk) 23:27, 21 May 2016 (UTC)Reply

I prefer to leave these cases for manual review, but maybe I'll try something more greedy. — Earwig ^talk 04:23, 22 May 2016 (UTC)Reply

nekropole.info/en/Cary-Grant

Latest comment: 8 years ago1 comment1 person in discussion

Presumably uses a lot more - the phrase "Creative Commons" might actually be better than a simple look for "Wikipedia" on such sites, as this one only uses that phrase.

www.rusc.com/old-time-radio/Cary-Grant.aspx?t=256 actually credits Wikipedia. Collect (talk) 12:54, 2 June 2016 (UTC)Reply

allstarpics.famousfix.com/pictures/chloe-madeley

Latest comment: 8 years ago1 comment1 person in discussion

Credits "en.wikipedia.org". Collect (talk) 14:45, 3 June 2016 (UTC)Reply

Add youtube

Latest comment: 8 years ago2 comments2 people in discussion

Should youtube be added? It seems quite unlikely that an article would be copied from a youtube description or comment (but many Youtube videos, e.g. trailers and such, use Wikipedia articles in their descriptions) Intelligentsium 13:26, 30 June 2016 (UTC)Reply

I don't think so, per my comments above in #hearplanet.com. Such a match generally warrants investigation. Exclusions are best suited for things that are always mirrors. — Earwig ^talk 13:27, 30 June 2016 (UTC)Reply

Special:Diff/802116377

Latest comment: 6 years ago1 comment1 person in discussion

Will you be able to resolve this false result in which the two links are copying the Wikipedia article? -- 1989 17:14, 24 September 2017 (UTC)Reply

add wikimapia?

Latest comment: 6 years ago1 comment1 person in discussion

http://wikimapia.org/terms_reference.html — Preceding unsigned comment added by Sergkarman (talk • contribs) 16:46, 19 February 2018 (UTC)Reply

https://kids.kiddle.co/Brooklyn_Navy_Yard

Latest comment: 5 years ago2 comments2 people in discussion

@The Earwig: This website copied an earlier version of Brooklyn Navy Yard and actually credits Wikipedia. I'm not sure if there are other articles on the same site that copy from Wikipedia as well. epicgenius (talk) 18:30, 20 October 2018 (UTC)Reply

Thanks, looks like a lot of WP-based articles, added. — Earwig ^talk 18:39, 20 October 2018 (UTC)Reply

Add https://www.govinfo.gov/

Latest comment: 5 years ago2 comments1 person in discussion

Can this be added to the exclusion list? — Preceding unsigned comment added by Pdxdoglover (talk • contribs) 00:03, 21 January 2019 (UTC)Reply

Pdxdoglover (talk) 21:23, 21 January 2019 (UTC)Reply

vk.com

Latest comment: 5 years ago1 comment1 person in discussion

It's a Russian Facebook, users freely copy information from Wikipedia, which skews the Copyvio rate. For example, when I wrote this article in August 2017 https://tools.wmflabs.org/copyvios/?lang=ru&project=wikipedia&title=Пудовкин,_Денис_Евгеньевич , the rate of confidence was 14,5%, but then it rose to 67% after some vk.com user quoted the article extensively in one of her posts in September 2018. Could you add vk.com to the exclusion list, please? Arbeite19 (talk) 12:07, 9 April 2019 (UTC)Reply

https://www.toursandtravel.app/en/points-of-interests/new-york/brooklyn-bridge-park/85

Latest comment: 5 years ago2 comments2 people in discussion

@The Earwig: I think the above page is almost entirely copy-pasted from an early version of Brooklyn Bridge Park without attribution. Compare the article version from 2015 and the above linked website. The only thing the other website did is to remove the "History" section. epicgenius (talk) 01:17, 12 July 2019 (UTC)Reply

Yep, it's copying multiple articles. Added, thanks. — Earwig ^talk 04:15, 12 July 2019 (UTC)Reply

https://www.cruisebe.com/morningside-park-new-york-city-ny

Latest comment: 5 years ago2 comments2 people in discussion

@The Earwig: this looks like it was sloppily copied from a previous version of Morningside Park (Manhattan). It even says at the bottom: Source: https://en.wikipedia.org/wiki/Morningside_Park_(New_York_City) epicgenius (talk) 20:09, 30 July 2019 (UTC)Reply

Added, thanks! — Earwig ^talk 00:48, 31 July 2019 (UTC)Reply

http://www.nyc-architecture.com/HAR/HAR002.htm and related pages

Latest comment: 4 years ago2 comments2 people in discussion

@The Earwig: This website has likely copied old versions of Wikipedia pages without attribution. For instance,

Cathedral of St. John the Divine - nyc-architecture, versus our article in 2007, comparison seen here. The nyc-architecture website copied the footnote number but removed any maintenance tags. There was a 98% match.
Grand Central Terminal - nyc-architecture versus our article in 2007, comparison seen here. The nyc-architecture website still has the reference numbers and "citation needed" tag. There was a 98% match (again).
St. Patrick's Cathedral (Manhattan) - nyc-architecture versus our article in 2007, comparison seen here. The nyc-architecture website still has the "citation needed" tag. There was a 98% match (again).
- More examples of this sort can be found by Google search: https://www.google.com/search?q=%5Bcitation+needed+site%3Anyc-architecture.com&oq=%5Bcitation+needed+site%3Anyc-architecture.com&aqs=chrome..69i57.5797j0j1&sourceid=chrome&ie=UTF-8

It's very likely that the other website copied from Wikipedia, since very few other websites have a need to use the citation needed tag, and since there is such similarity between each of the pages from 2007. Granted, this website still has original content, but I am more concerned about the false positives from Wikipedia. epicgenius (talk) 05:40, 1 December 2019 (UTC)Reply

Added; thanks for your investigation! — Earwig ^talk 06:06, 1 December 2019 (UTC)Reply

http://worddisk.com/wiki/

Latest comment: 4 years ago2 comments2 people in discussion

This appears to be a mirror of Wikipedia without attribution: it even has our main page at http://worddisk.com/wiki/search Caeciliusinhorto-public (talk) 14:49, 16 January 2020 (UTC)Reply

Done Darylgolden^(talk) Ping when replying 04:21, 8 April 2020 (UTC)Reply

Onlineradiobox

Latest comment: 4 years ago2 comments2 people in discussion

This site copied content from Udaya Geetham, and should therefore be excluded from EarwigBot. --Kailash29792 (talk) 17:21, 9 February 2020 (UTC)Reply

Not doneThat site is copying from copyrighted source, so neither should be excluded. Crow^Caw 15:14, 30 April 2020 (UTC)Reply

British and Irish Legal Information Institute (BAILII)

Latest comment: 4 years ago6 comments2 people in discussion

Earwig is flagging articles that quote from Irish Supreme Court case decisions hosted on BAILII. BAILII (here) and the Irish Courts Service (here) allow for direct quotation. British decisions can also be quoted. Could BAILII be removed from Earwig? AugusteBlanqui (talk) 13:42, 30 April 2020 (UTC)Reply

Is there a good subdomain of the site for just the court decisions? In addition to allowing quotes of the court decisions, it also states The copyright in the text of legislation and judgments displayed on BAILII's website may belong to courts, other government bodies, judges, and/or to commercial publishers. BAILII cannot authorize any copying of such material. So if we whitelist the whole BAILII site we may miss catching some of those other cases. Crow^Caw 14:10, 30 April 2020 (UTC)Reply

@Crow: This subdomain is safe to whitelist: https://www.bailii.org/ie/cases/IESC/ and this one: https://www.bailii.org/ie/cases/IEHC/ Thanks! AugusteBlanqui (talk) 14:21, 30 April 2020 (UTC)Reply

Added. I note that their re-use policy just says "re-use" which is a little ambiguous. So to avoid any issues, please always quote the text (rather than incorporating it directly) and cite the web sites. But this should stop the Earwig matches. To other CopyPatrol users, this will not stop ErinBot from flagging these, which is probably a good thing. Crow^Caw 15:11, 30 April 2020 (UTC)Reply

Thanks @Crow:. We will cite/quote from BAILII. A question, if you don't mind, on how subdomains work for Earwig. So https://www.bailii.org/ie/cases/IESC/ is the landing page for Irish decisions. All the Irish decisions have web addresses that start after the IESC, for example https://www.bailii.org/ie/cases/IESC/2007/S28.html . Is that what it means to whitelist a subdomain? The pages 'below' that IESC address are included (technology not my strongest domain). AugusteBlanqui (talk) 16:43, 30 April 2020 (UTC)Reply

Yes that entry should whitelist everything after the trailing / in the url. If it doesn't, let me know. Thanks! Crow^Caw 16:47, 30 April 2020 (UTC)Reply

Historic American Engineering Record articles hosted on nycsubway.org

Latest comment: 3 years ago2 comments2 people in discussion

@The Earwig: The following pages on nycsubway.org copy from the Historic American Engineering Record, a public domain source, and may bring up false positives.

May I request that only these specific pages be added to the exclusion list? Epicgenius (talk) 18:24, 30 December 2020 (UTC)Reply

Done — The Earwig ^talk 19:15, 30 December 2020 (UTC)Reply

Please add https://handwiki.org

Latest comment: 3 years ago2 comments2 people in discussion

It's a mirror site. Sudonet (talk) 09:13, 7 January 2021 (UTC)Reply

Done — The Earwig ^talk 05:02, 8 January 2021 (UTC)Reply

google-info.org

Latest comment: 3 years ago2 comments2 people in discussion

I thought I'd added google-info.org with this edit 12 April, but amp.en.google-info.org is still sullying the copyvio results (odd that the mighty corporate hand of Google hasn't yet come down to smite them). Should I have done something differently? BlackcurrantTea (talk) 16:19, 18 April 2021 (UTC)Reply

@BlackcurrantTea: I fixed this last week, but didn't notice you had brought it up here. Bug on my end. — The Earwig (talk) 02:24, 14 May 2021 (UTC)Reply

Exclusion

Latest comment: 3 years ago2 comments2 people in discussion

http://wikiorg.ru/wiki/*, because it's a clone of Ryussian Wiki. 78.37.129.71 (talk) 19:43, 13 May 2021 (UTC)Reply

Done. — The Earwig (talk) 02:24, 14 May 2021 (UTC)Reply

please add https://wordsimilarity.com

Latest comment: 3 years ago2 comments2 people in discussion

Could someone please add https://wordsimilarity.com/? It appears to be using Wikipedia directly, in, for example, https://wordsimilarity.com/en/avolition, messing with the Copyvio detector. Thanks!

EDIT: I also found https://eng.ichacha.net/zaoju/ , which seems to be sourcing text from Wikipedia for at least some of its pages, as well as https://en.glosbe.com/, which seems to often source from something called "WikiMatrix" (I haven't really looked into it).

Yitz (talk) 18:33, 14 May 2021 (UTC)Reply

Done, added the three. — The Earwig _alt (talk) 18:48, 14 May 2021 (UTC)Reply

Add spellchecker.net

Latest comment: 2 years ago2 comments2 people in discussion

It's not a mirror, but it seems to scrape random chunks of text from articles. Cheers, Estheim (talk) 22:47, 19 September 2021 (UTC)Reply

Added, thanks. — The Earwig (talk) 01:55, 20 September 2021 (UTC)Reply

Wikipedia:Mirrors and forks

Latest comment: 2 years ago2 comments2 people in discussion

Please add all of the WP mirrors listed under Wikipedia:Mirrors and forks. Thank you Jamplevia (talk) 23:52, 2 November 2021 (UTC)Reply

@Jamplevia, this is already done: read the first line of the exclusions page. If there's a particular mirror you're still getting results for, it may be getting parsed incorrectly by the tool; if so, please indicate which. — The Earwig (talk) 02:52, 3 November 2021 (UTC)Reply

nina.az

Latest comment: 2 years ago1 comment1 person in discussion

Hi,

I've got a question about "url = http://wikipedia.*.nina.az/". The tool still includes URLs starting with wikipedia.de.nina.az (see for example Copyvios Gioia) but the regular expression is supposed to ensure that these URLs are excluded. I didn't add the regex, but can anybody figure out why it doesn't work? I've tried matching the regex and the URL with re.match() in Python and there it worked. Thanks in advance. --CaroFraTyskland (talk) 09:30, 7 November 2021 (UTC)Reply

Please add https://hmong.ru/

Latest comment: 2 years ago2 comments2 people in discussion

It's another mirror site. Thanks, SamWilson989 (talk) 23:12, 29 May 2022 (UTC)Reply

And https://wiki2.net too 92.242.69.182 (talk) 19:01, 29 June 2022 (UTC)Reply

Additions

Latest comment: 2 years ago1 comment1 person in discussion

If I spot sites that seem to be plagiarising Wikipedia without attribution should I just add them to the list and forget about them or should they be reported somewhere else for the Wikimedia Foundation to lean on?

For context, I was cleaning up Nick Weir and I found:

Both of these seem to be (badly) processed from old versions of our articles, possibly by an AI. Which category should those go in? They are not exactly mirrors. DanielRigal (talk) 15:21, 20 August 2022 (UTC)Reply

Please add http://ikonysrebrnegoekranu.blogspot.com/

Latest comment: 2 years ago1 comment1 person in discussion

This blog (http://ikonysrebrnegoekranu.blogspot.com/2017/01/) contains a nearly plagiarized version of the article from Polish-language Wikipedia (https://pl.wikipedia.org/wiki/Popi%C3%B3%C5%82_i_diament_(film)), which was expanded back in 2012/2013; meanwhile, Copyvio returns a false 96% plagiarism score on the Wikipedia side. Ironupiwada (talk) 12:10, 5 September 2022 (UTC)Reply

Please add https://frwiki.wiki/

Latest comment: 2 years ago1 comment1 person in discussion

This is another mirror site of Wikipedia. Ironupiwada (talk) 12:21, 5 September 2022 (UTC)Reply

Please add https://timenote.info/

Latest comment: 2 years ago1 comment1 person in discussion

Another fork of Wikipedia, it even mentions Wikipedia as a source, which leads to false positive Copyvio results (like here). Ironupiwada (talk) 12:58, 5 September 2022 (UTC)Reply

Please add https://wiki.edu.vn/wiki25/

Latest comment: 1 year ago1 comment1 person in discussion

Fork, it mentions wikipedia as a source. Friniate^talk 15:30, 12 October 2022 (UTC)Reply

Please add latitude.to

Latest comment: 1 year ago1 comment1 person in discussion

Please add latitude.to to the exclusion list. It's not exactly a mirror, more of a Wikipedia link farm, but it still gave me a false positive. It's already on the link spam blacklist, as I found when I tried to add a link to this comment as an example :) Apocheir (talk) 03:13, 8 April 2023 (UTC)Reply

Scrapes of wiki pages

Latest comment: 1 year ago3 comments2 people in discussion

It seems that many of the sites in the blacklist are there because they are scraping the wiki. As these appear and disappear frequently, I suspect this leads to a lot of maintenance workload. I'm wondering if there is not another solution to the problem that is based on back-testing?

The example that lead me here is this one for toroidal solenoid:

Earwig's Copyvio Detector indicates a similarity to an article on Zeta (fusion reactor) in Hellenicaworld. Can we be certain that the Toroidal solenoid article predates the Hellenicaworld article? Nolabob (talk) 12:10, 28 July 2023 (UTC)Reply

That Zeta article is a copy of the one I wrote here on the Wiki some years ago. My new article does indeed have bits in common with Zeta, and that is entirely deliberate, both are early UK fusion systems. The copyvio between the two pages here on the wiki is of course suppressed, but not the one with this 3rd party scrape.

It would seem that this could be avoided by testing to see if the external hit is a scrape. In this case, it would match to some very high degree, and thus be "likely a wikipedia scrape". This would require two matches on each possible hit, and I'm not sure what that would do to the performance, but I think it might avoid a lot of false positives? Maury Markowitz (talk) 17:58, 28 July 2023 (UTC)Reply

Good idea, perhaps at least having a leaderboard with the sorted list of domains by match would be a good start to discover easily new mirrors. Thanks, Framawiki (please notify me when you reply) 19:31, 28 July 2023 (UTC)Reply

@Framawiki: Oh, yes, that might be a great intermediate solution. Maury Markowitz (talk) 21:00, 28 July 2023 (UTC)Reply

Add https://www.populartimelines.com/

Latest comment: 10 months ago2 comments2 people in discussion

Says on the second search bar that it uses wikipedia as a source. I have also found this article https://medium.com/@populartimelines/timelines-of-famous-people-events-companies-and-more-726de9cb8950, but I'm unsure how reliable this website is. 2001:8F8:1123:D698:493A:CC2:EDDC:5AED (talk) 08:28, 5 November 2023 (UTC)Reply

Done. LittlePuppers (talk) 16:36, 5 November 2023 (UTC)Reply

Add vintageisthenewold.com

Latest comment: 9 months ago3 comments2 people in discussion

Noticed here during a DYK nom. Seems to copy information verbatim from sites including Wikipedia. IceWelder [✉] 09:45, 14 November 2023 (UTC)Reply

@IceWelder: some of their content is definitely copied from WP, but there's also a lot that is not; I'm a bit hesitant to put it on the list because I can't see a good way to isolate just the copied-from-Wikipedia pages. LittlePuppers (talk) 18:38, 14 November 2023 (UTC)Reply

There is no original writing, so if it does contain something from a third-party sure that happens to be infringed on, surely the tool would also find the original source? IceWelder [✉] 00:59, 15 November 2023 (UTC)Reply

Add topic