Wikipedia talk:Wikidata/2017 State of affairs/Archive 6

Archive 1Archive 4Archive 5Archive 6Archive 7Archive 8Archive 10

"In contrast, Wikidata has sources to most of the statements, these sources are reliable by Wikipedia standards, and the share of these sourced statements increases."

This claim, and ones similar to it, have been made a few times in these discussions. However, somehow these reliably sourced statements seem often to elude me when I actually look at Wikidata items. Using the "random item" selector, I get (excluding categories, templates, ...)

The pattern seems very obvious: if you have a Wikidata item which corresponds with a Wikipedia article, the vast, vast majority of statements is unsourced (or Wikipedia sourced, which is essentially the same). The well-sourced items are bot-created, database-scraped entities which nearly always don't correspond with Wikipedia articles. So for our purposes, to judge whether Wikidata items are well-sourced or not compared to Wikipedia items, the answer has to be a clear "no". Fram (talk) 13:35, 18 September 2017 (UTC)

Well, if something escaped your attention it does not yet mean it does not exist. Nikkimaria, a staunch opponent od Wikidata, brought this references above in the discussion: [1]. It shows that over 50% of Wikidata statements are sourced to sources different from Wikimedia projects. In fact, the absolute majority of these statements are databases which obey WP:RS. The small minority are junk sources such as Find a Grave. The same graph shows that the share steadily grows. The real problems are that (i) most of these referenced statements come in the items such as Q27970632 (note that the image is sourced to Commons which in this way is a perfectly valid reference, as you yourself argue above, and the reference URL is not sourced because it could be only sourced to itself - and thus the item is only 80% formally reliably sourced though in fact it is 100%) which are not among the first topics we see on Wikipedia. (ii) at some point, a mistake (in my opinion) was made, and bots brought a lot of Wikipedia statements to Wikidata adding "imported from xxx Wikipedia" rather than making an effort to find actual sources. This is being corrected but still can take a couple of year to reduce the share of such sources to a negligible number; (iii) descriptions can not be sourced by construction, and they seem to be the most visible things for the time being. In my opinion, they are useful for working on Wikidata (no opinion for Wikipedia, I do not use the mobile version as I find it extremely inconvenient for editing), but likely not ready for use in Wikipedia.--Ymblanter (talk) 14:17, 18 September 2017 (UTC)
The main thing I got here from Fram was "The well-sourced items are bot-created, database-scraped entities which nearly always don't correspond with Wikipedia articles." Agree, disagree? It's irrelevant (for the discussion we're having on this page) if Wikidata has well-sourced items, if those items aren't likely to be displayed on Wikipedia. - Dank (push to talk) 14:22, 18 September 2017 (UTC)
Yes, I agree. Wikidata currently has 39 million items, and this number can not be maintained by human effort.--Ymblanter (talk) 14:47, 18 September 2017 (UTC)
Well, actually, to make a positive statement, indeed, a lot of statements were imported to Wikidata and labeled as "Wikipedia sourced". However, they, whereas being unsourced, contain the same information Wikipedia already has. (let us not talk about differences between different language versions, this is small fraction anyway). Restricting this info will likely have no effect on Wikipedia. However, information imported by bots directly from databases is reliably sourced, up-to-date, and typically is not on Wikipedia. Restricting this information is luddism.--Ymblanter (talk) 14:57, 18 September 2017 (UTC)
"However, they, whereas being unsourced, contain the same information Wikipedia already has." No, they have at best the information Wikipedia had at that time. "However, information imported by bots directly from databases is reliably sourced, up-to-date, and typically is not on Wikipedia." The percentage of such information in items with an enwiki article is very, very small (a few specific niche topics excluded). Most bot-included "information" is long lists of scientific articles, newspaper articles, ... or long lists of usually not very interesting (or at worst unreliable) identifiers. Actual, encyclopedia-like information on enwiki subjects is very hard to find among the bot edits, and is usually still the work of humans. "Luddism" is a nice FUD call, but has very little to do with this. Fram (talk) 07:22, 19 September 2017 (UTC)
I do not think you address my point. Indeed, Wikidata contains all kind of info, of which 90% is possibly not even usable on Wikipedia. On the other hand, only a very small share of information in Wikipedia articles can even in principle (current problems aside) be imported from Wikidata. Therefore it is irrelevant what percentage of informaion from Wikidata is currently used on Wikipedia, and what percentage of Wikipedia information is coming from Wikidata. The relevant question (for Wikipedia) is what information in principle can be imported from Wikidata, and what would be the provenance of this information. There are indeed some "specific niche topics", how you call them, and, indeed, the usage of Wikidata will never go beyond these specific niche topics. This is perfectly ok, because the Default Language Wikipedia is not the only Wikidata reuser (I am not sure it is actually the biggest reuser). Therefore, for example, it is irrelevant that references are stored in separate items on Wikidata, and these items will never have Wikipedia articles about them. What is relevant is that some (a very small fraction) of these sources are used on Wikipedia, and that they can be directly loaded to any Wikipedia page (which will become more complicated after {{Cite Q}} is deleted). And these references were imported by bots from databases, and I do not see (apart from some technical problems which are fixable) how one can claim that these sources are not reliable or have unclear provenance (actually if they have unclear provenance they can be deleted according to Wikidata policies).--Ymblanter (talk) 07:39, 19 September 2017 (UTC)
What's the point of writing an article, using a source, and then needing to go to Wikidata to check if that source is perhaps available there as well, and then try to get the source to be displayed correctly through some template (which in most current cases was problematic or simply wrong)? All of this because perhaps someday the reference needs a change, and then if the source is used on multiple pages it would be faster to do this on Wikidata, if you are lucky. When you couple this with the effect that the page gets more opaque for casual editors, making it harder to correct a source in many cases, I don't see where the benefit is supposed to be. That's not luddism, that's comparing pro's and con's and drawing a different conclusion than you and some others do. Fram (talk) 08:22, 19 September 2017 (UTC)
References might not be the best example because they indeed change very seldom if ever. There are, however, two issues. Indeed, references one typically adds by humans in Wikipedia articles, and often it would be inappropriate to add them automatically and by bots. There are examples however when references are quite appropriate to add by bot, for example, for listed buildings which are referenced to national databases of protected heritage. Especially of these databases are not in English, it is unlikely that references would be added by editors en masse, but they can be added by bots no problem from Wikidata, and I do not see any issues with this automatic addition. Furthermore, there are other things which change on a regular basis, are used in Wikipedia, and are pain in the ass to update by humans. An example is population; there are some issues with European databases (license incompatibility), but for the US and Canada I believe Wikidata is reliable (or, if I am wrong, it can be easily made reliable), and then the population data in zillions of Wikipedia articles could be either read directly from Wikidata, or updated by bots on a regular basis.--Ymblanter (talk) 09:13, 19 September 2017 (UTC)
I'm not a "staunch opponent od Wikidata". I think that in theory the concept has merit. What I am an opponent of is poor sourcing practices and standards, as demonstrated in discussions like this one or this. And unfortunately, that graph underestimates the number of poorly sourced statements. On the one hand, it includes statements that this wiki would term BLUE, or that are truly self-sourcing. On the other, it includes hundreds of Wikia references, thousands for IMDb, etc. Nikkimaria (talk) 14:35, 18 September 2017 (UTC)
My apologies for labeling you as Wikidata opponent, this was my impression which I got from your statements I have seen on Wikidata and here. Anyway, I fully agree that statements sourced to junk (such as Wikia) should not propagate anywhere. It is not difficult to organize, and, in my opinion, this would be a much more constructive approach that "OMG! They have citations to Wikia!! They are nowhere close to reliable!!! We should prohibit any propagation of data from Wikidata to the (default language) Wikipedia".--Ymblanter (talk) 14:51, 18 September 2017 (UTC)
The English Wikipedia includes thousands of Wikia references and (at least) tens of thousands of IMDb references when measured this way. Regards, Tbayer (WMF) (talk) 14:54, 18 September 2017 (UTC)
No, they include thousands of links to these sites. Have you even looked at what you linked? In your first 100 so-called "Wikia references", there is 1 (ONE) mainspace article, Wikia itself. Looking at the bottom of your list, these are external links, not references. These are comparable to the identifiers Wikidata sees fit to include to Freebase, Quora, Findagrave... but are not what we are discussing here. For someone so eager to dismiss all Wikidata criticism for perceived failures, you really should do a better job in presenting your side of the equation. Fram (talk) 15:16, 18 September 2017 (UTC)
My claim is not that enwiki is perfect with regards to sourcing by any means. However, (a) many of those links are not used as sources, and (b) enwiki does have policies and practices that allow poor sources, and material supported by poor sources, to be removed, whereas at the moment Wikidata does not. The major sticking point as I see it isn't that Wikidata has unsourced or poorly sourced statements, it's that there seems to be a widespread culture that does not perceive that as a problem to be addressed, and that there are no agreed-upon and enforced site policies for addressing it. Nikkimaria (talk) 15:35, 18 September 2017 (UTC)
Wait, Fram, so you are proclaiming the outcome of a comparison between Wikidata and Wikipedia ("to judge whether Wikidata items are well-sourced or not compared to Wikipedia items") without having examined any statements on Wikipedia? What's the percentage of sourced statements there? Speaking as a volunteer Wikipedia editor who has put a lot of work into enforcing sourcing (you can find many thousand edit summaries in my contributions list that refer to WP:V) I continue to be surprised at some of the assumptions about Wikipedia in the discussions on this page. Regards, Tbayer (WMF) (talk) 14:54, 18 September 2017 (UTC)
Feel free to present your case wrt Wikipedia sourcing. Does Wikidata even have a "fact" tag or anything similar? Lambda-CDM model, one of the Wikidata items above, clearly has a lot more references here than on Wikidata. Multiracism, the other one with an enwiki article, also has relevant, reliable references. Do all articles here have those? Of course not. But at least I don't proclaim the superiority of Wikipedia based on some inflated figures which fall apart when you take a closer look at them. Using Wikidata in Wikipedia items "because more than half of the Wikidata items are reliably sourced" (to paraphrase some statements) is starting from a completely wrong position. Fram (talk) 15:23, 18 September 2017 (UTC)

Does anyone have any idea what happened between 24 July 2017 and 21 August 2017[2]? In these 28 days, Wikidata got 4 million new items, and nearly doubled the number of "reliable" sources (i.e. "Statements referenced to other sources") from 55 million to 102 million statements.

This means that in 28 days, 47 million statements were sourced (not to Wikipedia). I can't immediately find info on which bots are responsible for most of this (assuming the numbers are correct). It doesn't seem to be Succubot (the most active bot), which doesn't add sources to statements usually apparently. Botninja isn't the culprit either. Neither is ValterVBot, Dexbot, Krbot, ... Fram (talk) 15:10, 18 September 2017 (UTC)

Doing the above again, but now only for items that also have an Enwiki article.

So, a small sample of 5 items again confirms that, for items which have an enwiki article, sourcing on Wikidata is on average considerably worse than on enwiki. Fram (talk) 16:35, 18 September 2017 (UTC)

Vandalism on Wikidata

Above, user:Jkatz (WMF) said

"Regardless of our progress, independent research is also beginning to provide us with a more thorough understanding of vandalism on Wikidata. A new study found that vandalism rates on Wikidata have decreased significantly in recent years (from 2013 to 2016) and are quite low: ”... -lately- only between 0.1% and 0.2% of [human edits on Wikidata] are malicious”."

Technically, this may be correct and seem impressive. However, when I actually look at Wikidata recent changes, it seems to me that a) this is an underestimation, b) this doesn't take into account the median time between vandalism and revert, which seems to be a lot higher than on e.g. enwiki, and c) this doesn't take into account the completely different nature of editing on Wikidata.

On enwiki and on Wikidata, a vandal often makes a quick, small vandal edit. On enwiki, a positive edit is often relatively large though. On Wikidata, by its nature, all edits are "small" edits, and if you want to create an item you need to make many edits. Many positive edits I encounter are semi-automated, repeated tasks, with uncontroversial and often hardly worthwhile data. An edit like this is not worth vandalizing.

To put it differently: an "article" on Wikidata consists of many properties, but only a few are wortwhile for a vandal. The positive edits vs vandal edit rate there is much closer to what we are used to on enwiki, and the reversion times are usually much slower.

Looking at 100 recent changes by IPs, made between 13.45 and 14.20, I note many edits I can't judge (Chinese labels and the like), some which seem positive, and at least 25 vandalism edits. This would mean that in the same period, some 12,000 to 25,000 positive edits must have been made to get the cited percentage (and this assumes that no registered editors made vandal edits). Now, we may well reach this figure, but in this period, approximately 95% of the edits was made by one editor using "Quickstatements", making her (Harmonia Amanda) essentially a maintenance bot.

  • [4] 2 edits first vandalizing and then removing correct information
  • [5] 2 edits vandalizing already incorrect information, which were then semi-reverted, semi-vandalized by [6] 2 further IP edits
  • [7] 8 blatant vandalism edits
  • [8] 4 edits vandalizing Internet
  • [9] 7 vandal edits, reverted 17 minutes later by Izno (thanks)

I know, anecdote isn't data, but I have looked at Wikidata way too much recently, and this rate (and low reversion rate) seems quite typical.

So please, take all statements about the low vandalism rates on Wikidata with a very big grain of salt. Wikidata has a small group of dedicated editors using tools to make small changes very rapidly, and apart from that a very small group of more "regular" editors who are at the moment unable to adequately, rapidly deal with vandalism. This is not the fault of those people who actually do vandalism reversions, and this section is not intended to blame anyone. But the end result is that, compared to the number of editors, Wikidata has actually a high vandalism rate, which doesn't get reverted very fast (I often see easy-to-spot vandalism remaining for hours and days). Fram (talk) 14:40, 14 September 2017 (UTC)

Seems like the issue may be more "a overly large proportion of Wikidata pages contains vandalism at a given point of time". Jo-Jo Eumerus (talk, contributions) 14:44, 14 September 2017 (UTC)
As you say, this is anecdotal. Would a comparative analysis of the vandalism and revert rates between Wikidata and English Wikipedia be helpful in putting your mind at rest addressing your concerns, as I perceive them, over vandalism and revert rates? I don't know if I can make it possible to do so, but I can inquire. CKoerner (WMF) (talk) 16:05, 14 September 2017 (UTC)
This reply is already problematic, in its assumption of the desired conclusion ("put your mind at rest") before even doing the investigation that might lead to that conclusion. It suggests that the investigation will be reported by cherry-picking whatever leads to putting your mind at rest, rather than in a properly unbiased way. Again, why is WMF trying to promote the invasive use of a sister project here? —David Eppstein (talk) 16:09, 14 September 2017 (UTC)
Well, revert ratios don't tell us how long vandalism stays around before it's caught, so that wouldn't help (and what about vandalism that wasn't reverted yet). Jo-Jo Eumerus (talk, contributions) 16:17, 14 September 2017 (UTC)
I was asking Fram, and others, if this work would be helpful to bring clarity where some lacks. I don't want to make a request if it's not desired as it could take a substantial amount of work. I've clarified my statement. CKoerner (WMF) (talk) 16:22, 14 September 2017 (UTC)
Here are some examples of longstanding, gross Wikidata vandalism:
  • [10], [11]. From May 2013 to June 2015, the description of Larry Silverstein on Wikidata read, "In May of 2013, Larry and his pal Tom, confessed to the implosion of the WTC, and later were sentenced to multiple life terms in prison." instead of "American business man". The edit that vandalised the entry conveniently had the full text of that BLP violation in its edit summary.
  • [12][13] For five months in 2014, Wikidata gave "Adolf Hitler" as an alias of Franklin D. Roosevelt.
  • I promised myself that I would be able to find a new bit of similar nonsense on Wikidata in less than five minutes. I did. The Ozzy Osbourne entry on Wikidata states that "Dross" is an alias of Osbourne's. (Dross Rotzank is a YouTuber who, it appears, looks a bit like Osbourne.) That alias has been on Wikidata since early December 2016.
There isn't the remotest chance that any of those howlers could have stood unchallenged for that long in the respective Wikipedia biographies. Do you acknowledge that? Andreas JN466 19:14, 14 September 2017 (UTC)
I think part of the reason is that not enough people see these vandalised entries (and of course not enough people do anti-vandal work on Wikidata). If enough people add the code at d:User:Yair rand/WikidataInfo.js to their Special:Mypage/common.js (or something equivalent), we might have a chance to notice Wikidata vandalism more easily. —Kusma (t·c) 20:23, 14 September 2017 (UTC)
I'm sorry if Fram, or others, got the impression I was ignoring the list. I see the examples and understand folks consider them to egregious. I should have acknowledged that before suggesting action items. A list of vandalism isn't very instructive if I have nothing to compare it to. Is it because we don't have a many eyes on Wikidata? Is it worse that what we'd expect here on English Wikipedia? I don't know. Folks telling me it is concerning is useful to know. I'm suggesting maybe we find out more together. CKoerner (WMF) (talk) 21:04, 14 September 2017 (UTC)
User:CKoerner (WMF), how can you ask "Is it worse that what we'd expect here on English Wikipedia?" Do you know that little about Wikipedia? There are plenty of WMF blog posts pointing out that gross vandalism in the English Wikipedia is usually quickly removed. Even if some of those blog posts are a little over-confident about Wikipedia's ability to spot and remove such vandalism, is it not very, very obvious that the situation in Wikidata is far worse than what you'd find in Wikipedia? Gross nonsense, including nonsense with horrifying BLP implications, lasts for months and years on Wikidata, even in items about public figures that are household names.
Much the same vandalism that happened about Larry Silverstein on Wikidata happened here on Wikipedia as well: see [14] and then step through the following 20 edits or so to compare the response in Wikipedia, ending here, to that in Wikidata. Andreas JN466 21:24, 14 September 2017 (UTC)
To be honest, I come across vandalism on the English Wikipedia which stays in the articles for years on a regular basis. Just one example: this edit from November 2011 was undetected; 13 users in good standing (plus one bot) edited the article like if nothing happened, until I found and reverted it in November 2012 [15].--Ymblanter (talk) 06:57, 15 September 2017 (UTC)
It happens, but most of the vandalism gets reverted within minutes. On Wikidata, this is within hours instead. Fram (talk) 08:44, 15 September 2017 (UTC)
This is correct, but since there is much more vandalism on Wikipedia than on Wikidata, I suspect that there is still more vandalism which stays for hours on Wikipedia than on Wikidata. It would be great to have some quantitative estimates.--Ymblanter (talk) 08:52, 15 September 2017 (UTC)
What you had there, Ymblanter, was someone deleting a couple of lines of text and a heading in a text on Greek mythology. No one accused anyone of being Hitler, or of having blown up the World Trade Center, and caused that to be displayed at the top of mobile readers' screens for a year. --Andreas JN466 11:16, 15 September 2017 (UTC)
I find daily on my watchlist in the morning vandalism, sometimes BLP vandalism, which was added during the night and stays there for hours until I revert it. I believe Hitler in the description on Wikidata would be blocked by an edit filter though I might be wrong on this one.--Ymblanter (talk) 13:56, 15 September 2017 (UTC)
One problem with Wikidata surely is how prominently the vandalism is displayed. If you search for the European Commissioner article on the Spanish Wikipedia mobile site right now, you see "caquita" ("little shit") displayed in the search drop-down, and then displayed again right under the article title when you read the article itself. A site that's supposed to serve as a central repository informing lots of other WMF and non-WMF projects needs to be a bit more robust and reliable in my view. Andreas JN466 15:17, 15 September 2017 (UTC)
Andreas, I find it quite surprising to suddenly see you coming out as a defender of the English Wikipedia's quality and vandalism resistance, considering that for many years you/Wikipediocracy have been advocating exactly the opposite opinion (as can be seen, for example, in various articles that are still linked on your user page), highlighting apparently every case of Wikipedia hoaxes, vandalism and BLP violations that you could get hold of.
I think your conclusions about Wikipedia back then had the same flaw that your conclusions about Wikidata have now: Cherry-picked example edits are not a substitute for systematic evaluations of the quality of the content as seen by the reader.
Regards, Tbayer (WMF) (talk) 22:27, 15 September 2017 (UTC)
I mentioned above (and elsewhere) that in my view WMF's public communications have had an unfortunate tendency to overstate the reliability of Wikipedia, encouraging blind trust in readers rather than critical reading, including checking of references before propagating Wikipedia content and so forth. In my view, this has been an actual dereliction of duty by the WMF, which is harming the integrity of human knowledge. Wikipedia doesn't need credulous readers and re-users, it needs critical and discerning readers and re-users who understand and are alert to the inherent vulnerabilities of Wikipedia's crowdsourcing process.
But whatever problems Wikipedia has, Wikidata raises them to a whole new level by importing masses of data from Wikipedia projects, many of which have a far less developed culture than the English Wikipedia and some of which (such as the Croatian, Kazakh and Azeri Wikipedias) have known, very significant problems that the WMF also keeps mum about in public. Wikidata lacks robust standards (or any standards, really) for reliable sourcing and BLP content. Moreover, as a number of people on this page have pointed out – convincingly to my mind – vandalism control on Wikidata isn't well developed. That's inherent in Wikidata's ambitious scope: people are adding content in dozens of languages that's then prominently displayed to mobile readers. A vandalism patroller on Wikidata has to be able to spot vandalism in Arabic, Hebrew, Pashto and Turkish as well as French, Spanish, Portuguese and English and dozens of other languages ... which makes the task a lot more time-consuming unless you're fluent in 55 languages. And I don't think there are equivalents to ClueBot NG (which does a brilliant job in en:WP) in many of these languages. Those are problems Wikidata needs to address. The pushback you're seeing here is in good part a result of the fact that Wikidata has been struggling to do so to date. --Andreas JN466 09:52, 16 September 2017 (UTC)
Well first, at least with regard to the use of Wikidata descriptions in the app (that has been the focus of contention here), these concerns seem unrelated - the app won't display Pashto Wikidata descriptions above English Wikipedia articles. In general, I personally agree that there are some interesting questions here, but I would also be curious how widespread such issues are in practice - have you observed actual mass imports of problematic BLP statements from, say, the Kazakh Wikipedia into Wikidata, or is this more a theoretical concern so far? And concerning patrolling, keep in mind that most of the information on Wikidata is language-agnostic (it's meant to be machine-readable, after all), and the language-dependent information (descriptions and labels, in particular) is in a very restricted format that makes it easier to patrol. Speaking as a volunteer editor who has been a fairly active RC patroller on Wikidata for over a year now, here are some vandalism reverts I have made for descriptions in languages I don't speak: [16] [17] [18]. That said, for this slice of content on Wikidata that is language-dependent, it's obviously more efficient if patrollers can focus on changes in the languages they are fluent in, and earlier this year some tools became available for that. Regards, Tbayer (WMF) (talk) 13:13, 16 September 2017 (UTC)
The concerns are related in that it takes a patroller far longer to identify vandalism in a language they don't speak. It means that patrollers are stretched thinner, for all languages. If you spend a few minutes figuring out whether נבוא גלוון המלך של סין is vandalism or not, that means you're missing a number of edits in other languages. As for mass imports from Wikipedias known to be politically corrupted, this is something the WMF could profitably study and keep an eye on long-term (as indeed I have argued WMF and/or partners should study the political content of the Wikipedias themselves that are known to have, or suspected of having, problems around political freedom, with the resulting findings reported to the public). Beyond that, I think it is self-evident that many minor-language Wikipedia projects have sourcing standards that are at least a decade behind those of the English Wikipedia. I can't see how the content imported to Wikidata from these Wikipedias can remain unaffected by this. Regards. --Andreas JN466 14:25, 16 September 2017 (UTC)
Incidentally, Tilman, is a relative of ClueBot even used to spot English-language vandalism on Wikidata? And are there any other bots designed to revert vandalism in other languages? Something like "caquita" or "pringao" even going through as a description (which is then prominently displayed for mobile users) without instantly being bot-reverted, as it would be here, leads me to think that there isn't one for Spanish at least, but do correct me if I am labouring under a misconception. It's always better to have agreed facts on the table. --Andreas JN466 11:02, 16 September 2017 (UTC)
See this discussion which I started several days ago. Apparently, there is interest in developing such a bot, but it did not yet happen.--Ymblanter (talk) 11:07, 16 September 2017 (UTC)
Andreas, to quote from Jon's statement above:
"Here are some steps particularly around the area of vandalism: [...] Introduction of anti-vandalism bots making automated reverts of edits that are very likely to be vandalism"
The bot linked there has been up and running, here are some of its recent auto-reverts: [19] [20] [21]
Regards, Tbayer (WMF) (talk) 13:13, 16 September 2017 (UTC)
These are good reverts, but the system's level of sophistication still seems to lag considerably behind ClueBot NG, judging by the obvious cases still slipping through. And can't ClueBot NG be adapted to run on WD? There may be problems with that idea that I'm ignorant about, but I'm surprised this wasn't done years ago. Best, --Andreas JN466 14:25, 16 September 2017 (UTC)

Hi Fram, there are some thoughtful observations and hypotheses in what you wrote. However, I'm not convinced by these assumptions:

  • On enwiki and on Wikidata, a vandal often makes a quick, small vandal edit. - you mean a small impact on the content?
  • On enwiki, a positive edit is often relatively large though. - and often not, just look at the amount of minor edits in recent changes.
  • On Wikidata, by its nature, all edits are "small" edits, and if you want to create an item you need to make many edits. - The creation of an article on Wikipedia often involves many edits too, and in any case such initial creation activities form only a small minority of edits overall. The actual number of edits per article or item contradicts your assumption: it's much higher on Wikipedia (96) than on Wikidata (17).
  • Many positive edits I encounter are semi-automated, repeated tasks, with uncontroversial and often hardly worthwhile data - same on Wikipedia: formatting tweaks, typo fixes, edits that update a single number...
  • approximately 95% of the edits was made by one editor using "Quickstatements", making her (Harmonia Amanda) essentially a maintenance bot - it may be a big mistake to exclude such edits from calculations, considering that a lot of the descriptions (the vast majority as I seem to recall from one data analysis I did a while ago, would need to check the exact numbers) are added or substantially changed by Wikidata editors using such tools. For example, generic descriptions such as "American actor" or "surname".
I'm all for exploring the subtleties of such research results, and different ways of looking at the data. (By the way, if someone wants to actually read the paper and write a review, these are always welcome in the "Recent research" section of the Signpost, doubling as the Wikimedia Research Newsletter, that I have been editing with others since 2011. Let me know if you are interested.)
But so far, as an attempt to discredit "all [!] statements about the low vandalism rates on Wikidata", the above looks not very convincing to me. To explain away such a conclusion by seven independent academic researchers, one needs more than such unproven assumptions (some likely false, see above) combined with anecdotal data and minuscule samples.
Regards, Tbayer (WMF) (talk) 22:30, 15 September 2017 (UTC)


So, after I posted the above list of vandal edits, I took another set of seven vandalized articles to see how fast such vandalism gets reverted at Wikidata.

I didn't filter out immediately reverted vandalism, these were simply the first ones I encountered and easily spotted in the recent changes. It is obvious, every time I look at it, that vandalism patrol at Wikidata is way, way too slow to be acceptable as a source for any info on enwiki at the moment. Fram (talk) 07:31, 15 September 2017 (UTC)

There seems to be a lack of manpower in Wikidata ... and it's a multilingual problem. I pity the patrollers. User:YMS seems to have done about half the recent vandalism reverts by himself.
Even so, the following were easy to find; they are all unreverted IP vandalism in Wikidata's recent changes. I focused on descriptions only, as displayed in searches and at the top of mobile screens: 2 days, 2 days, 2 days, 1 day, 1 hour ("dickhead").
You'll need to find more people volunteering for what's a Sisyphean and mind-numbing job, along with ClueBot-level AI tools in dozens of languages. A tall order. --Andreas JN466 14:12, 15 September 2017 (UTC)
@Jayen466: Sorry, I don't have the time to dive into this discussion now, but I would be personally interested in how you came to the conclusion, that I am doing half of the patrols? Just from looking in the logs, or is there a way or place where I find accumulated numbers? Thanks for telling me. --YMS (talk) 11:28, 16 September 2017 (UTC)
@YMS: It was just based on my scanning recent changes for vandalism. Of the ones I found that had been reverted, more often than not it was you who had done so. Best, --Andreas JN466 22:35, 16 September 2017 (UTC)
Found some statistics now: [22] (includes data up to December 31, 2016) shows 16 users with more than 10'000 reverts. However, 9 of those are bots, and they are hardly used to actively fight vandalism, but more likely just randomly restore old versions if someone e.g. removed a label or edited a sandbox item. Of the 7 human users, I'm number 3 there, but I very much increased my vandal-fighting activities in 2017, while I'm not aware of any particular recent activities in this field of the top 2. @Erik Zachte: Would it be possible without too much effort to get a more up to date version of those stats? Because since February, I'm using an RC tool that I developed, which is not stable enough for public use yet, and I was unsure whether I should ever release it. If it would turn out that now I'm actually doing an overly large portion of the RC work, I guess this would indicate that this tool is actually quite powerful, and should get available for others. --YMS (talk) 15:36, 18 September 2017 (UTC)
Here is a look at the Wikipedia vandalism reverts I have made as a volunteer editor so far this month (September 2017), and how long the vandalism had remained live in each case:
(Does not include some reverts of possibly good-faith edits, such as this likely hoax: 3 months.)
Regards, Tbayer (WMF) (talk) 14:31, 15 September 2017 (UTC)
Which shows that we are far from perfect, but ignores the number of vandalism reverts made immediately or near immediately. The one you show are the ones that feel thorough the cracks (and yes, there are too many of those); the one I showed were all the vandalism edits from that time period from IPs (all the ones I could easily spot at least, I haven't checked identifiers, arab descriptions, ...). The experience you have with some edits remaining unnoticed for too long is what happens to (nearly) all vandalism on Wikidata. Fram (talk) 15:08, 15 September 2017 (UTC)

Let's take a step back up and get a sense of perspective:

  • According to the most recent data available on Wikistats [23], the ratio of reverted article edits on the English Wikipedia is 8.1% (higher when excluding bot edits - likely >10% -, and 19.4% for anonymous edits)
  • According to the cited paper by Crescenzi et al, the ratio of reverted (damaging) edits on Wikidata, excluding bots, is between 0.1% and 0.2%.

In other words, the ratio of vandalism edits on Wikipedia may be up to two magnitudes (50 to 100 times) higher than on Wikidata - quite contrary to the impression given in some comments above.

And according to the same Wikistats page, English Wikipedia sees 3.8 million reverts per year. That's enough to cherry-pick from to fill entire books with embarrassing examples.

Yes, there are some subtleties, e.g. around the exact definition of reverts (note though that each ratio is an apples to apples comparison). And I think Fram made a valid point that the speed of vandalism cleanup (e.g. the mean or median time to revert) needs to be considered too. But so far nobody has presented solid data on the latter. (Again, a spot check of only a few edits at a specific moment in time is not a valid way to do this.) And even if the vandalism cleanup on Wikidata was, say, 10 times slower than on Wikipedia on average, that would still mean that the relative impact of vandalism on what the reader seens is 5-10 times higher on Wikipedia than on Wikidata. That's not even considering that Wikidata has vastly more items (35 million) than English Wikipedia has articles (5 million).

The data question that's IMHO most relevant for the current debate has not been examined yet: How likely it is that the reader will see a vandalized version when they look either at a Wikipedia article or its corresponding Wikidata description. Gathering this data might take some time but could be worthwhile.

Regards, Tbayer (WMF) (talk) 23:10, 18 September 2017 (UTC)

Not sure if, when you say, "the current debate", you mean this round of discussion of Wikidata in WP. Assuming you are, IMHO vandalism is not the most important issue. I would put it way below the WMF making content decisions without getting consensus, and the differences in policy and governance between Wikidata and en-WP.
Vandalism is a problem everywhere but is not really The Problem. The more meaningful thing to measure would be "bad edits". Whether they are intentional or not, bad edits are what harm WP, and there are definitely way more than whatever is being counted as "vandalism".
The problem about the pneumonia article that Doc James mentioned above, was created by someone running a bot at Wikidata that added what they thought was "useful" information about "drugs useful to treat X" which filled that field with garbage. Bad edit. Not vandalism. And via inboxes that bad edit to Wikidata effected many, many en-WP articles. We killed that field so that can't happen anymore. But this is the kind of thing that needs to be considered; not just "vandalism". Jytdog (talk) 23:33, 18 September 2017 (UTC)
Hi Jytdog, I said "the data question that's IMHO most relevant for the current debate", not necessarily the most important issue overall. I did get the impression that vandalism seemed to be the most important concern to several commenters on this talk page (cf. the length of this section), but you clearly feel different and I did not mean to dismiss that.
Regarding the pneumonia example (i.e. this controversy between medical Wikipedia editors and scientists from the ProteinBoxBot team), I lack the topic expertise necessary for forming a definite opinion about who was right, so I'll take your word for it. We should keep in mind though that this case was about displaying Wikidata information in an article's infobox, which is separate from the issue of displaying Wikidata descriptions in the apps. The apps certainly don't do the former on their own and (cf. Jon's statement) the Foundation does not plan to change that.
Regards, Tbayer (WMF) (talk) 00:10, 19 September 2017 (UTC)
Where to start with this one... That Wikidata has vastly more items is because they are indiscriminately importing whole databases, creating e.g. every potential source as an item. These are not used anywhere, not shown anywhere, and vandals would have very little reason to edit these. That Wikidata has vastly more unproblematic edits is because they need vastly more edits to get some result (e.g. adding the same label in 10 languages = 10 edits), and many of the "edits" at Wikidata are not even real edits there (just like many "editors" at Wikidata have never actually edited Wikidata (I supposedly have some 250 Wikidata edits, but I never edited Wikidata except three or four test edits in 2014). "The data question that's IMHO most relevant for the current debate has not been examined yet: How likely it is that the reader will see a vandalized version when they look either at a Wikipedia article or its corresponding Wikidata description." Like I said below, this is a general page about the use of Wikidata on enwiki, not specific to the case of the descriptions. I started this as a separate section to highlight vandalism issues in general, not vandalism of the descriptions specifically (though that is one of the more often vandalized statements).
I also asked below where the 50 million or so sources in the last month have come from, and which "editors" made these. It is the kind of thing that makes me doubt any statistics about this, as it seems unlikely. Fram (talk) 07:14, 19 September 2017 (UTC)
Your final statement seems right to me: the key question is "How likely is it that a reader looking at corresponding data on the two sites will see vandalism?" Re your earlier comments: pulling some numbers from memory, I think I recall studies showing that the average revert time on en-wiki is about five minutes. Certainly Cluebot must bring the average down pretty low. If that's right, then if there are, as you say, 100 times as many vandalism edits on en-wiki as on Wikidata, the revert time would have to be more than about 500 minutes -- 8 hours or so -- for it to last longer. Fram's sample data is certainly not rigorous, but it does at least make it plausible that the average revert time is much longer than that.
I also suspect (with no data to support me) that vandalism is more likely on items corresponding to Wikipedia pages, and less likely on data imported from vast technical databases such as genetic data; if so, that would increase the visibility of vandalism on Wikidata.
Despite both the above points, I think vandalism is a secondary issue. If the interface to Wikidata permitted an editor to see article vandalism (as the script for Wikidata descriptions does); and to see it in one's watchlist too (which is not usefully possible at the moment); and to revert it from a WP page (not possible at all); then I think the vandalism would mostly get reverted from en-wp. Those enhancements seem necessary to me for other reasons, so I don't think vandalism is the real reason for resistance to Wikidata integration. It's just the consequence of the lack of integration. Mike Christie (talk - contribs - library) 23:41, 18 September 2017 (UTC)
  • This is kind of dumb discussion. Wikidata isn't "read" and there aren't "readers" of Wikidata. Wikidata is "used". You run queries, build graphs, etc. It is data. With regard to vandalism if I move a decimal in a field in Wikidata, somebody "seeing" that is a really different process that somebody "seeing" vandalism in en-WP. (People maybe "see" vandalism in Wikidata by running some query and having one of the results be a funky outlier, and they go to find out why, and find a vandalized field. That sort of thing)
Part of the problem with the WMF grabbing the "description" field and slathering it all over the place, is that the description field is probably one of the very few fields in Wikidata that is meant to be read by humans and is the kind of content where Wikidata is weakest, policy and governance-wise.... It just not the kind of thing that Wikidata was built for - and this difference in kinds of content is probably the main reason why the policies and governance is so different between the projects.
But WMF chose probably the weakest field in Wikidata, to put all this weight on. Which is one of the things that is so hard for me to understand. Jytdog (talk) 00:43, 20 September 2017 (UTC)