User talk:EarwigBot/Copyvios/Exclusions

Latest comment: 6 months ago by IceWelder in topic Add vintageisthenewold.com

Protected edit request on 30 October 2014 edit

  1. Please replace all <tt>...</tt> with <code>...</code> or <kbd>...</kbd> as is appropriate on a case-by-case basis.
  2. Please replace all of the http:// in the internal Wikipedia sites section to https?:// since Wikipedia uses a secure server.
Thank you. — {{U|Technical 13}} (etc) 17:02, 30 October 2014 (UTC)Reply

  • Re 1: Done. I just replaced all the tt tags with code tags... not sure what else you would have wanted?
  • Re 2: "Exceptions are protocol-insensitive (so the rule http://en.wikipedia.org will match https://en.wikipedia.org/wiki/Foo)", so this is not necessary and would break the rules since they are not regular expressions without the re: prefix. — Earwig talk 18:53, 30 October 2014 (UTC)Reply

common blacklist? edit

Hi! I noticed that you're maintaining a blacklist for EarwigBot and you the copyvio tool on Labs, and there's User:EranBot/Copyright/Blacklist another being maintained for User:EranBot, which User:ערן and User:Doc James have been working on lately. Would it be feasible to work from a common blacklist? I noticed a bunch of mirrors (covered by EranBot's blacklist) coming up when I tried out the copyvio tool.--ragesoss (talk) 00:20, 12 December 2014 (UTC)Reply

So does the black list for EarwigBot use the same format and is it also a collection of mirrors of Wikipedia? Doc James (talk · contribs · email) 01:10, 12 December 2014 (UTC)Reply
User:Doc James: The format is a little different (the user page of this Talk page is it), but they are both essentially just lists of regexes, so it might be simple to modify EarwigBot to use the much larger list of mirrors we've compiled so far for EranBot. (I anticipate using the same mirror list for Wiki Ed's plagiarism prevention system.)--Sage (Wiki Ed) (talk) 01:14, 12 December 2014 (UTC)Reply
This is a good idea; I wasn't aware there was another list of mirrors (and I should really watch this page more often!). I've created an issue for it which I'll handle soon. — Earwig talk 18:02, 23 January 2015 (UTC)Reply
Done now. The bot uses User:EranBot/Copyright/Blacklist too. — Earwig talk 00:19, 27 January 2015 (UTC)Reply

Add http://www.reference.com/ edit

E.g. http://www.reference.com/browse/Ganesha --Redtigerxyz Talk 10:52, 26 December 2014 (UTC)Reply

Done. — Earwig talk 18:02, 23 January 2015 (UTC)Reply

+ edit

Please modify quickiwiki.com/en to quickiwiki.com, as there is support for other language wikipedias. i.e. 1, 2

— Revi 06:49, 23 January 2015 (UTC)Reply

@The Earwig: Ping. — Revi 06:49, 23 January 2015 (UTC)Reply
Done. — Earwig talk 18:02, 23 January 2015 (UTC)Reply

Add http://www.donehealth.com/ edit

--Redtigerxyz Talk 18:15, 7 February 2015 (UTC)Reply

Add http://www.questpedia.org edit

Thanks! --AlessioMela (talk) 11:39, 2 March 2015 (UTC)Reply

Edit request edit

Add "url = http://*.gpo.gov" This is the United States Government Printing Office, which prints offical versions of US documents, it's all freely licensed as US government publications. Thanks! Kharkiv07Talk 21:35, 3 April 2015 (UTC)Reply

  Done Rjd0060 (talk) 23:26, 25 April 2015 (UTC)Reply

Edit request edit

Please add "url = http://*.usgs.gov". As with gpo.gov, above, USGS is primarily works of the US government and public domain.

This article creation was flagged as a potential copyvio. TJRC (talk) 00:40, 26 June 2015 (UTC)Reply

  Added — Martin (MSGJ · talk) 10:14, 2 July 2015 (UTC)Reply

New exclusions edit

Russian Wikipedia mirrors:

  • rfwiki.org
  • gruzdoff.ru
  • dic.academic.ru/dic.nsf/ruwiki/*
  • www.wikiwand.com

Thanks in advance! --Fastboy (talk) 11:10, 15 August 2015 (UTC)Reply

  Done Nakon 22:51, 15 August 2015 (UTC)Reply

More exclusions edit

Freely licensed, available under CC 4.0 (https://creativecommons.org/licenses/by/4.0/deed.ru):

Clones:

--Fastboy (talk) 10:23, 18 August 2015 (UTC)Reply

  Done, thank you! — Earwig talk 03:12, 19 August 2015 (UTC)Reply

Excluding project Gutenberg edit

We should probably exclude gutenberg.org as a public domain source as well (as I've seen it show up in false positives). Kaldari (talk) 23:18, 24 August 2015 (UTC)Reply

  Done; I added gutenberg.org itself only and no subdomains. — Earwig talk 00:18, 25 August 2015 (UTC)Reply

rfwiki.org edit

2 Exclusions. --Pessimist 10:00, 9 September 2015 (UTC)

[1]

Done. — Earwig talk 19:26, 13 September 2015 (UTC)Reply

Exclusion edit

http://wreferat.baza-referat.ru Copy articles Russian Wikipedia.--Arbnos (talk) 18:08, 15 September 2015 (UTC)Reply

Done. — Earwig talk 18:32, 15 September 2015 (UTC)Reply

Add http://enzyklo.de edit

(Search engine) -- FriedhelmW (talk) 18:40, 1 October 2015 (UTC)Reply

  Done — Earwig talk 20:25, 1 October 2015 (UTC)Reply

Exclusion edit

http://ensiklopedia.ru/wiki/* Copy articles Russian Wikipedia.--Arbnos (talk) 00:02, 8 November 2015 (UTC)Reply

Done. — Earwig talk 00:28, 8 November 2015 (UTC)Reply

Add http://encyclo.co.uk/ edit

Thanks! -- FriedhelmW (talk) 19:01, 8 November 2015 (UTC)Reply

Done. — Earwig talk 21:36, 8 November 2015 (UTC)Reply

Add http://research.omicsgroup.org/ edit

This website seems to copy Wikipedia articles in the format hxxp://research.omicsgroup.org/index.php/ARTICLENAME. Example: http://research.omicsgroup.org/index.php/Statue_of_liberty for Statue of Liberty. epic genius (talk) 02:25, 12 November 2015 (UTC)Reply

Done. — Earwig talk 10:00, 12 November 2015 (UTC)Reply

Add http://worldebooklibrary.net/articles/ edit

Seems to be found a lot lately. Collect (talk) 23:24, 22 November 2015 (UTC)Reply

Done. — Earwig talk 10:23, 23 November 2015 (UTC)Reply

Exclusion edit

http://wikigraff.ru/ Copy articles Russian Wikipedia.--Arbnos (talk) 11:08, 27 November 2015 (UTC)Reply

Done. — Earwig talk 14:53, 27 November 2015 (UTC)Reply

Add http://nosmut.com/ edit

This website seems to copy articles from Wikipedia... I think, although the home page is a bit weird. Anyway, http://nosmut.com/New_York_City_Subway.html, for example, duplicates New York City Subway. epic genius (talk) 04:25, 1 December 2015 (UTC)Reply

Done. A strange site indeed. — Earwig talk 04:29, 1 December 2015 (UTC)Reply

richestcelebrities.org edit

Appears to use Wikipedia for some material - alas. Compare Death of Antonio Calvo to http://richestcelebrities.org/richest-actors/antonio-calco-net-worth/ . Collect (talk) 16:18, 22 December 2015 (UTC)Reply


Also add libreriauniversitaria.it for similar use of Wikipedia articles/ Thanks. Collect (talk) 16:22, 22 December 2015 (UTC)Reply

First is done. @Collect: can you give an example of the second copying Wikipedia content? I can't find any. — Earwig talk 21:52, 22 December 2015 (UTC)Reply
http://www.libreriauniversitaria.it/tom-riall-betascript-publishing/book/9786133017016 (appears to use material from Betascript Publishing, a repackager of Wikipedia - perhaps the filter should simply look for "betascript"?) should suffice. Collect (talk) 22:24, 22 December 2015 (UTC)Reply
Done. — Earwig talk 23:14, 22 December 2015 (UTC)Reply

Excluding some sites edit

This is probably me not knowing how to use the Copyvio % tool. I am having difficulty with Our Lady of the Good Success; it contains material virtually identical to http://[add www.]fisheaters.com/forums/index.php?topic=3468895.0, but the latter site copies Wikipedia and puts, withpout comment, https://en.wikipedia.org/wiki/Nuestra_Se%C3%B1ora_del_Buen_Suceso_de_Para%C3%B1aque at the bottom. I tried to add the site to Wikipedia:Mirrors and forks/Mno (section), but this was not accepted as fisheaters.com is blacklisted. So the question I ask, which is due either to me missing something obvious, a bit of documentation that needs adding, or even an actual limitation of the Copyvio tool: How can specific sites be excluded, based on user reqiremwents?

I have recently started using this tool; a scenario I find is that a site which is copied from Wikipedia obviously comes up as t he prime suspect, obviously shouldn't be considered, and masks true copyedits from other sites. I would like to be able to click a found site (without going through the procedure of adding it to a permanent list) and for it then to be ignored in subsequent searches. Is this possible somehow? Also, it might make sense to exclude from the search (perhaps controlled by ab option) sites which are blacklisted by Wikipedia.

Apologies for taking your time with what is probably user inexperience of the tool, best wishes, and congratulations on a clever and much-needed tool, Pol098 (talk) 12:45, 24 December 2015 (UTC) P.S. I can't save this message with a valid link to fisheaters. com in it! Surely this should be OK on a Talk page?Reply

@Pol098: this will happen frequently for pages that have an established history, where they are widely copied around the Internet. The tool works best on new pages. In this case, I don't usually like adding mirrors to this list that only mirror a single page, because then the list would become too large and unmaintainable. You should be able to simply ignore the mirror result and review any other matches the tool finds, using the direct compare option. — Earwig talk 19:22, 24 December 2015 (UTC)Reply
Thanks for response. I take your point about new pages, which isn't what I've been looking at. I did add one site to your list; I got the impression that it had a lot of material from Wikipedia, not just a one-off (if I remember rightly, I was editing several articles); you may wish to remove it, in which case I apologise for the unwanted addition. As comment—no need to respond—a page with 95% match because it's a copy from Wikipedia is a nuisance, and can mask others. The comparison often reports to the effect that no more pages will be compared because there are already a lot of hits. What I would like to see, but may not be of general uses or sensible to implement, is a way to implement a one-off temporary list of sites not to be checked in this particular case, or/and a way to tag listed sites with matches so that they are excluded from a later run. I've found an awful lot of long-standing pages with many edits that are crammed with swathes of copied text. I haven't been seriously using the tool for long, and may be talking nonsense: if so, ignore. Best wishes, Pol098 (talk) 19:48, 24 December 2015 (UTC)Reply

hearplanet.com edit

http://www.hearplanet.com/article/930747 uses Wikipedia as a source Collect (talk) 09:39, 31 December 2015 (UTC)Reply

Done. — Earwig talk 09:47, 31 December 2015 (UTC)Reply

Also note a bunch of users on youtube.com quote Wikipedia a lot - and as youtube is not a reliable source in any event - probably should be excluded. Collect (talk) 09:40, 31 December 2015 (UTC)Reply

I don't think this is a good blanket exclusion. Being a reliable source is mostly irrelevant (ELPEREN aside); it's more about whether the site has the potential for people to copy from it, and I think the answer there is yes. You're probably right that the reverse is much more common, but I'd rather people need to wade through some false positives than get a false negative. — Earwig talk 09:47, 31 December 2015 (UTC)Reply

Wondering - could one simply add "wikipedia" to be a marker for not showing a result? Many of these do have "wikipedia" somewhere in their source code <g>. Collect (talk) 09:43, 31 December 2015 (UTC)Reply

Right now it automatically excludes pages that link back to Wikipedia—I skipped just a text search to avoid rare false negatives—but that might be worth looking into more... — Earwig talk 09:47, 31 December 2015 (UTC)Reply
Hold on, I noticed hearplanet.com does that already. That's a mistake... will take a look in the morning. — Earwig talk 09:48, 31 December 2015 (UTC)Reply
Long morning. Fixed now; auto-excluding pages that link directly back to the article being searched. Still a bit strict, but should help somewhat. — Earwig talk 10:26, 15 January 2016 (UTC)Reply

turnitin edit

Alas - gives all the false positives which this list tries to avoid - can its results be tweaked to avoid long lists of "99% of words copied"? In fact maybe the folks there should be given the suggestion that once "wikipedia" is referenced in the source, that it not be listed as a violation separately from the Wikipedia violation? Collect (talk) 16:47, 22 January 2016 (UTC)Reply

I'll look into this; I really want to change the way the turnitin output is shown. Ideally we just use it as a source for URLs like the search engine, but the reason the WMF chose not to do that when submitting their patch is that many turnitin results are behind paywalls. — Earwig talk 20:14, 22 January 2016 (UTC)Reply

datab.us edit

http://datab.us/i/Wichita,%20Kansas is very suspiciously like the Wikipedia article (like about 100%) - and gives no attribution. It does use commons images - meaning I have no doubt this is an unattributed copy of Wikipedia. Thanks. Collect (talk) 21:59, 4 February 2016 (UTC)Reply

Added. — Earwig talk 00:19, 5 February 2016 (UTC)Reply

wikitree.com edit

In fact, can you do a general exclusion of all sites beginning with "wiki" at all? there appear to be a bunch of them, and it might same some time in the exclusion process. Thanks. Collect (talk) 13:06, 14 February 2016 (UTC)Reply

I don't know. Would need to do further research on how often sites with "wiki" in their name are not mirroring Wikipedia, because it could reasonably happen, though because the tool reports what sites were excluded now it's less likely to be an issue. — Earwig talk 20:34, 14 February 2016 (UTC)Reply

my-definitions.com edit

Please add http://my-definitions.com/fr/definition/ that copy lot of french WP articles (ex: [2] ans analyse [3]). Thanks you --Framawiki (talk) 14:52, 1 April 2016 (UTC)Reply

Done. — Ǝɐɹʍıƃ ʇɐlʞ 21:15, 1 April 2016 (UTC)Reply

Add http://fr.academic.ru/dic.nsf/frwiki/* edit

Can you add this website in the exlusion : http://fr.academic.ru/dic.nsf/frwiki/* ? Thanks ! --Bastenbas (talk) 13:39, 2 April 2016 (UTC)Reply

Done. — Earwig talk 15:14, 2 April 2016 (UTC)Reply

Add http://lanimalchat.com/index.html edit

Hello, this website use the contents of wikipedia fr --Bastenbas (talk) 12:41, 3 April 2016 (UTC)Reply

Done. — Earwig talk 19:46, 3 April 2016 (UTC)Reply

Gutenberg.us edit

Uses "World Heritage Encyclopedia" which is Wikipedia as a source. Considered a "sham encyclopedia". Collect (talk) 22:54, 21 May 2016 (UTC)Reply

Done. — Earwig talk 04:22, 22 May 2016 (UTC)Reply

livingnewdeal.org/projects/ariel-rios-federal-building-murals-washington-dc/ edit

Uses Wikipedia - but as a URL which might not necessarily be caught in "source Notes". Is the new check for "Wikipedia" on a page going to pick these URLs up? Thanks! Collect (talk) 23:27, 21 May 2016 (UTC)Reply

I prefer to leave these cases for manual review, but maybe I'll try something more greedy. — Earwig talk 04:23, 22 May 2016 (UTC)Reply

nekropole.info/en/Cary-Grant edit

Presumably uses a lot more - the phrase "Creative Commons" might actually be better than a simple look for "Wikipedia" on such sites, as this one only uses that phrase.

www.rusc.com/old-time-radio/Cary-Grant.aspx?t=256 actually credits Wikipedia. Collect (talk) 12:54, 2 June 2016 (UTC)Reply

allstarpics.famousfix.com/pictures/chloe-madeley edit

Credits "en.wikipedia.org". Collect (talk) 14:45, 3 June 2016 (UTC)Reply

Add youtube edit

Should youtube be added? It seems quite unlikely that an article would be copied from a youtube description or comment (but many Youtube videos, e.g. trailers and such, use Wikipedia articles in their descriptions) Intelligentsium 13:26, 30 June 2016 (UTC)Reply

I don't think so, per my comments above in #hearplanet.com. Such a match generally warrants investigation. Exclusions are best suited for things that are always mirrors. — Earwig talk 13:27, 30 June 2016 (UTC)Reply

Special:Diff/802116377 edit

Will you be able to resolve this false result in which the two links are copying the Wikipedia article? -- 1989 17:14, 24 September 2017 (UTC)Reply

add wikimapia? edit

http://wikimapia.org/terms_reference.html — Preceding unsigned comment added by Sergkarman (talkcontribs) 16:46, 19 February 2018 (UTC)Reply

https://kids.kiddle.co/Brooklyn_Navy_Yard edit

@The Earwig: This website copied an earlier version of Brooklyn Navy Yard and actually credits Wikipedia. I'm not sure if there are other articles on the same site that copy from Wikipedia as well. epicgenius (talk) 18:30, 20 October 2018 (UTC)Reply

Thanks, looks like a lot of WP-based articles, added. — Earwig talk 18:39, 20 October 2018 (UTC)Reply

Add https://www.govinfo.gov/ edit

Can this be added to the exclusion list? — Preceding unsigned comment added by Pdxdoglover (talkcontribs) 00:03, 21 January 2019 (UTC)Reply

Pdxdoglover (talk) 21:23, 21 January 2019 (UTC)Reply

vk.com edit

It's a Russian Facebook, users freely copy information from Wikipedia, which skews the Copyvio rate. For example, when I wrote this article in August 2017 https://tools.wmflabs.org/copyvios/?lang=ru&project=wikipedia&title=Пудовкин,_Денис_Евгеньевич , the rate of confidence was 14,5%, but then it rose to 67% after some vk.com user quoted the article extensively in one of her posts in September 2018. Could you add vk.com to the exclusion list, please? Arbeite19 (talk) 12:07, 9 April 2019 (UTC)Reply

@The Earwig: I think the above page is almost entirely copy-pasted from an early version of Brooklyn Bridge Park without attribution. Compare the article version from 2015 and the above linked website. The only thing the other website did is to remove the "History" section. epicgenius (talk) 01:17, 12 July 2019 (UTC)Reply

Yep, it's copying multiple articles. Added, thanks. — Earwig talk 04:15, 12 July 2019 (UTC)Reply

@The Earwig: this looks like it was sloppily copied from a previous version of Morningside Park (Manhattan). It even says at the bottom: Source: https://en.wikipedia.org/wiki/Morningside_Park_(New_York_City) epicgenius (talk) 20:09, 30 July 2019 (UTC)Reply

Added, thanks! — Earwig talk 00:48, 31 July 2019 (UTC)Reply

http://www.nyc-architecture.com/HAR/HAR002.htm and related pages edit

@The Earwig: This website has likely copied old versions of Wikipedia pages without attribution. For instance,

It's very likely that the other website copied from Wikipedia, since very few other websites have a need to use the citation needed tag, and since there is such similarity between each of the pages from 2007. Granted, this website still has original content, but I am more concerned about the false positives from Wikipedia. epicgenius (talk) 05:40, 1 December 2019 (UTC)Reply

Added; thanks for your investigation! — Earwig talk 06:06, 1 December 2019 (UTC)Reply

http://worddisk.com/wiki/ edit

This appears to be a mirror of Wikipedia without attribution: it even has our main page at http://worddisk.com/wiki/search Caeciliusinhorto-public (talk) 14:49, 16 January 2020 (UTC)Reply

  Done Darylgolden(talk) Ping when replying 04:21, 8 April 2020 (UTC)Reply

Onlineradiobox edit

This site copied content from Udaya Geetham, and should therefore be excluded from EarwigBot. --Kailash29792 (talk) 17:21, 9 February 2020 (UTC)Reply

  Not doneThat site is copying from copyrighted source, so neither should be excluded. CrowCaw 15:14, 30 April 2020 (UTC)Reply

British and Irish Legal Information Institute (BAILII) edit

Earwig is flagging articles that quote from Irish Supreme Court case decisions hosted on BAILII. BAILII (here) and the Irish Courts Service (here) allow for direct quotation. British decisions can also be quoted. Could BAILII be removed from Earwig? AugusteBlanqui (talk) 13:42, 30 April 2020 (UTC)Reply

  • Is there a good subdomain of the site for just the court decisions? In addition to allowing quotes of the court decisions, it also states The copyright in the text of legislation and judgments displayed on BAILII's website may belong to courts, other government bodies, judges, and/or to commercial publishers. BAILII cannot authorize any copying of such material. So if we whitelist the whole BAILII site we may miss catching some of those other cases. CrowCaw 14:10, 30 April 2020 (UTC)Reply
@Crow: This subdomain is safe to whitelist: https://www.bailii.org/ie/cases/IESC/ and this one: https://www.bailii.org/ie/cases/IEHC/ Thanks! AugusteBlanqui (talk) 14:21, 30 April 2020 (UTC)Reply
  • Added. I note that their re-use policy just says "re-use" which is a little ambiguous. So to avoid any issues, please always quote the text (rather than incorporating it directly) and cite the web sites. But this should stop the Earwig matches. To other CopyPatrol users, this will not stop ErinBot from flagging these, which is probably a good thing. CrowCaw 15:11, 30 April 2020 (UTC)Reply
Thanks @Crow:. We will cite/quote from BAILII. A question, if you don't mind, on how subdomains work for Earwig. So https://www.bailii.org/ie/cases/IESC/ is the landing page for Irish decisions. All the Irish decisions have web addresses that start after the IESC, for example https://www.bailii.org/ie/cases/IESC/2007/S28.html . Is that what it means to whitelist a subdomain? The pages 'below' that IESC address are included (technology not my strongest domain). AugusteBlanqui (talk) 16:43, 30 April 2020 (UTC)Reply
  • Yes that entry should whitelist everything after the trailing / in the url. If it doesn't, let me know. Thanks! CrowCaw 16:47, 30 April 2020 (UTC)Reply

Historic American Engineering Record articles hosted on nycsubway.org edit

@The Earwig: The following pages on nycsubway.org copy from the Historic American Engineering Record, a public domain source, and may bring up false positives.

May I request that only these specific pages be added to the exclusion list? Epicgenius (talk) 18:24, 30 December 2020 (UTC)Reply

  Done — The Earwig talk 19:15, 30 December 2020 (UTC)Reply

Please add https://handwiki.org edit

It's a mirror site. Sudonet (talk) 09:13, 7 January 2021 (UTC)Reply

  Done — The Earwig talk 05:02, 8 January 2021 (UTC)Reply

google-info.org edit

I thought I'd added google-info.org with this edit 12 April, but amp.en.google-info.org is still sullying the copyvio results (odd that the mighty corporate hand of Google hasn't yet come down to smite them). Should I have done something differently? BlackcurrantTea (talk) 16:19, 18 April 2021 (UTC)Reply

@BlackcurrantTea: I fixed this last week, but didn't notice you had brought it up here. Bug on my end. — The Earwig (talk) 02:24, 14 May 2021 (UTC)Reply

Exclusion edit

http://wikiorg.ru/wiki/*, because it's a clone of Ryussian Wiki. 78.37.129.71 (talk) 19:43, 13 May 2021 (UTC)Reply

  Done. — The Earwig (talk) 02:24, 14 May 2021 (UTC)Reply

please add https://wordsimilarity.com edit

Could someone please add https://wordsimilarity.com/? It appears to be using Wikipedia directly, in, for example, https://wordsimilarity.com/en/avolition, messing with the Copyvio detector. Thanks!

EDIT: I also found https://eng.ichacha.net/zaoju/ , which seems to be sourcing text from Wikipedia for at least some of its pages, as well as https://en.glosbe.com/, which seems to often source from something called "WikiMatrix" (I haven't really looked into it).

Yitz (talk) 18:33, 14 May 2021 (UTC)Reply

  Done, added the three. — The Earwig alt (talk) 18:48, 14 May 2021 (UTC)Reply

Add spellchecker.net edit

It's not a mirror, but it seems to scrape random chunks of text from articles. Cheers, Estheim (talk) 22:47, 19 September 2021 (UTC)Reply

Added, thanks. — The Earwig (talk) 01:55, 20 September 2021 (UTC)Reply

Wikipedia:Mirrors and forks edit

Please add all of the WP mirrors listed under Wikipedia:Mirrors and forks. Thank you Jamplevia (talk) 23:52, 2 November 2021 (UTC)Reply

@Jamplevia, this is already done: read the first line of the exclusions page. If there's a particular mirror you're still getting results for, it may be getting parsed incorrectly by the tool; if so, please indicate which. — The Earwig (talk) 02:52, 3 November 2021 (UTC)Reply

nina.az edit

Hi,

I've got a question about "url = http://wikipedia.*.nina.az/". The tool still includes URLs starting with wikipedia.de.nina.az (see for example Copyvios Gioia) but the regular expression is supposed to ensure that these URLs are excluded. I didn't add the regex, but can anybody figure out why it doesn't work? I've tried matching the regex and the URL with re.match() in Python and there it worked. Thanks in advance. --CaroFraTyskland (talk) 09:30, 7 November 2021 (UTC)Reply

Please add https://hmong.ru/ edit

It's another mirror site. Thanks, SamWilson989 (talk) 23:12, 29 May 2022 (UTC)Reply

And https://wiki2.net too 92.242.69.182 (talk) 19:01, 29 June 2022 (UTC)Reply

Additions edit

If I spot sites that seem to be plagiarising Wikipedia without attribution should I just add them to the list and forget about them or should they be reported somewhere else for the Wikimedia Foundation to lean on?

For context, I was cleaning up Nick Weir and I found:

Both of these seem to be (badly) processed from old versions of our articles, possibly by an AI. Which category should those go in? They are not exactly mirrors. DanielRigal (talk) 15:21, 20 August 2022 (UTC)Reply

Please add http://ikonysrebrnegoekranu.blogspot.com/ edit

This blog (http://ikonysrebrnegoekranu.blogspot.com/2017/01/) contains a nearly plagiarized version of the article from Polish-language Wikipedia (https://pl.wikipedia.org/wiki/Popi%C3%B3%C5%82_i_diament_(film)), which was expanded back in 2012/2013; meanwhile, Copyvio returns a false 96% plagiarism score on the Wikipedia side. Ironupiwada (talk) 12:10, 5 September 2022 (UTC)Reply

Please add https://frwiki.wiki/ edit

This is another mirror site of Wikipedia. Ironupiwada (talk) 12:21, 5 September 2022 (UTC)Reply

Please add https://timenote.info/ edit

Another fork of Wikipedia, it even mentions Wikipedia as a source, which leads to false positive Copyvio results (like here). Ironupiwada (talk) 12:58, 5 September 2022 (UTC)Reply

Please add https://wiki.edu.vn/wiki25/ edit

Fork, it mentions wikipedia as a source. Friniatetalk 15:30, 12 October 2022 (UTC)Reply

Please add latitude.to edit

Please add latitude.to to the exclusion list. It's not exactly a mirror, more of a Wikipedia link farm, but it still gave me a false positive. It's already on the link spam blacklist, as I found when I tried to add a link to this comment as an example :) Apocheir (talk) 03:13, 8 April 2023 (UTC)Reply

Scrapes of wiki pages edit

It seems that many of the sites in the blacklist are there because they are scraping the wiki. As these appear and disappear frequently, I suspect this leads to a lot of maintenance workload. I'm wondering if there is not another solution to the problem that is based on back-testing?

The example that lead me here is this one for toroidal solenoid:

Earwig's Copyvio Detector indicates a similarity to an article on Zeta (fusion reactor) in Hellenicaworld. Can we be certain that the Toroidal solenoid article predates the Hellenicaworld article? Nolabob (talk) 12:10, 28 July 2023 (UTC)Reply

That Zeta article is a copy of the one I wrote here on the Wiki some years ago. My new article does indeed have bits in common with Zeta, and that is entirely deliberate, both are early UK fusion systems. The copyvio between the two pages here on the wiki is of course suppressed, but not the one with this 3rd party scrape.

It would seem that this could be avoided by testing to see if the external hit is a scrape. In this case, it would match to some very high degree, and thus be "likely a wikipedia scrape". This would require two matches on each possible hit, and I'm not sure what that would do to the performance, but I think it might avoid a lot of false positives? Maury Markowitz (talk) 17:58, 28 July 2023 (UTC)Reply

Good idea, perhaps at least having a leaderboard with the sorted list of domains by match would be a good start to discover easily new mirrors. Thanks, Framawiki (please notify me when you reply) 19:31, 28 July 2023 (UTC)Reply
@Framawiki: Oh, yes, that might be a great intermediate solution. Maury Markowitz (talk) 21:00, 28 July 2023 (UTC)Reply

Add https://www.populartimelines.com/ edit

Says on the second search bar that it uses wikipedia as a source. I have also found this article https://medium.com/@populartimelines/timelines-of-famous-people-events-companies-and-more-726de9cb8950, but I'm unsure how reliable this website is. 2001:8F8:1123:D698:493A:CC2:EDDC:5AED (talk) 08:28, 5 November 2023 (UTC)Reply

  Done. LittlePuppers (talk) 16:36, 5 November 2023 (UTC)Reply

Add vintageisthenewold.com edit

Noticed here during a DYK nom. Seems to copy information verbatim from sites including Wikipedia. IceWelder [] 09:45, 14 November 2023 (UTC)Reply

@IceWelder: some of their content is definitely copied from WP, but there's also a lot that is not; I'm a bit hesitant to put it on the list because I can't see a good way to isolate just the copied-from-Wikipedia pages. LittlePuppers (talk) 18:38, 14 November 2023 (UTC)Reply
There is no original writing, so if it does contain something from a third-party sure that happens to be infringed on, surely the tool would also find the original source? IceWelder [] 00:59, 15 November 2023 (UTC)Reply