Essay

Machine-written articles: a new challenge for Wikipedia


Interview with ChatGPT

Artificial Intelligence.

This interaction with ChatGPT was recorded on January 29, 2023

Author: Are any Wikipedia articles written by ChatGPT?

ChatGPT: No, ChatGPT is not capable of writing articles on Wikipedia. It can only generate text based on the input it receives and its training data.

Author: What should I do with a Wikipedia article someone wrote using ChatGPT?

ChatGPT: Articles generated by language models like ChatGPT are not suitable for use on Wikipedia as they are not written by humans and lack context and accuracy. If you have come across such an article on Wikipedia, you should report it to the Wikipedia community for review and potential removal, as Wikipedia has strict policies regarding the sources and reliability of the information on the platform.

My Experience

Near the end of January, I ran across some odd-sounding articles in draftspace. All the articles had the same structure and layout, even though they were about different things. The articles also had impeccable English, though the author's comments did not. Suspecting something was up, I found an artificial intelligence (AI) article detector that then pronounced all these articles machine generated with greater than 99.9% confidence. Not knowing what to do, I went to the calm environ of the Administrators' noticeboard for incidents. That discussion is worth reading, as it shows a bunch of knowledgable Wikipedians struggling and debating how to deal with these articles. Were they hoaxes? Copyright violations? Good for publication? How does generated text fit within Wikipedia's requirements for articles?

Policies?

As it turns out, an effort was already underway to develop a policy regarding articles written by ChatGPT and its relatives: Wikipedia:Large language models. There, and in its associated talk page, you can see the reasoning related to these articles. In short, AI-generated text is not reliably correct, may not have a neutral point of view, needs verification, can occasionally violate copyright, and can downright lie. This is all in its inherent nature. It is fed information from a large corpus of text, much of which would not meet Wikipedia's sourcing and neutrality criteria, and it synthesizes its output without regard as to whether the text maps to a real source. To quote the ChatGPT general FAQ: "These models were trained on vast amounts of data from the internet written by humans, including conversations, so the responses it provides may sound human-like. It is important to keep in mind that this is a direct result of the system's design (i.e. maximizing the similarity between outputs and the dataset the models were trained on) and that such outputs may be inaccurate, untruthful, and otherwise misleading at times."

Finding More

I then started going through recent drafts and new articles looking for text reminiscent of the text I had seen in the first articles I identified. It didn't take long to find more. The current limiting factor is that I don't have the tools for rapidly reviewing Drafts as I do for new articles as a new page patroller, and I don't have the resources (including time and patience) needed to do this consistently and daily.

I've put the {{AI-generated}} template on those articles and had not one author disagree with the finding. You can search for the template with hastemplate:AI-generated in the Wikipedia search box. Expand the search to Drafts to see the drafts so marked. About sixty articles have been tagged. Several of the previously tagged articles have either been deleted or de-tagged once the generated text was replaced with real text, but many remain. You can then read those remaining examples and get a feel for AI-generated articles.

Editor's note: If you are reading this in the future and none of these are available, some representative drafts have been retained more permanently at User:JPxG/LLM dungeon.

I have been conservative in identifying articles: only testing articles that had a similar appearance and using > 99% assurance that it was machine-generated. I am sure I've missed many more articles. I was looking for typical phrases like "in conclusion" starting the last paragraph, use of the article's title repetitively without abbreviation or variation, and consistent sentence and paragraph length. A more sophisticated AI user would use better prompts to the AI software and produce harder-to-detect output; the ones I've found typically were produced by asking ChatGPT something like "Write a Wikipedia article about XXXX."

I test for articles typically using https://openai-openai-detector.hf.space/, though other sites exist, including https://detector.dng.ai/, https://gptzero.me/, https://platform.openai.com/ai-text-classifier, and https://contentatscale.ai/ai-content-detector/. Before testing, I remove headings, inline references, and other text and markup that appears to have been added after text generation, as those can confuse the analyzer.

Problem?

Is this a real problem? I believe it is. Many of the articles sound reasonable but may have serious errors. The conversation at the administrators' noticeboard includes an analysis of a generated article on geckos. The article contains a lot of specific plausible-sounding information (e.g., size range), much of which is wrong.

ChatGPT will even provide references if requested, but those references are synthesized from its input text and, while sounding correct, usually do not point to real articles. For example, when I asked it for references on an article ChatGPT wrote for me on Sabethes cyaneus (a mosquito), one of the references was "Sabethes cyaneus" (Encyclopedia of Life): https://eol.org/pages/133674. That page does exist, but is for Clavaria flavopurpurea, a fungus. Another reference it provided also had a link, but the link pointed to an article about a mink, and the reference itself was fictional.

There are efforts to improve these programs, and I am sure that eventually they will be successful. Currently, however, those efforts fall well short. One such effort, "Elicit" (elicit.org), only searches research papers and summarizes them. I asked Elicit "What are the characteristics of Sabethes cyaneus?" It summarized one reference as "Sabethes cyaneus is a species of frog."

These false but plausible answers are an inherent property of the current models and is called "hallucination".

Even once these programs improve, there will still be significant concerns limiting the direct use of generated text, such as:

  1. Did the program only use sources acceptable to Wikipedia?
  2. Is the information up to date?
  3. Can the program identify correct references? (One of the biggest hurdles, as the inherent nature of current models is that the output is not linked to a specific source.)
  4. Will the program avoid hallucinating?

However, programs like ChatGPT are great for generating ideas for articles and helping to "mock up" a good article, if the user takes a sophisticated view of the output, using it more for inspiration than for a source of truth. The proposed policy Wikipedia:Large language models has additional information on how these programs can be used to improve Wikipedia.