Wikipedia talk:Neglected articles/Archive

Vague idea of checking old articles with few edits.

I have this concept - need to work out details - but I'd like to be able to look at articles of greater than x age with fewer than y total edits, the idea being that this will smoke out junk that was dropped in and forgotten. Say, for the sake of argument, that x is a year and y is 3 total edits - can this be done? Can it be done in a plug-in-the-variable sense, to get an idea of the scope using different variables? BD2412 T 04:04, 7 November 2005 (UTC)

Hmm...interesting idea. We currently have Special:Ancientpages, but that's only a list of the articles that have gone the longest without editing. Wikipedia:Pages edited by one author only has also been requested in the past, but not yet created. Certainly something along these lines should be straightforward enough to implement. I'm thinking of the following procedure:

  • Download a *full* dump from http://download.wikimedia.org/wikipedia/en/. This would require between 3 and 20 GB of free disk space.
  • Uncompress the dump, and analyze it article-by-article, on the fly. This requires parsing XML to extract just four things:
    • The title of the article
    • The date (perhaps convert to number of days before the present) of the first or last edit (or both?)
    • The number of edits
    • The number of editors
  • Dump these fields into a text file, which should take up less than 100 MB.

Then you would then be able to try combinations of various variables to see how many articles would be included by various criteria. Though I think you might do well just to come up with a ranked list. I would recommend sorting by number of editors first, then by number of edits, then by creation date. You could then post a list of the worst 500 to the wiki.

But what's going to happen to these articles after they are found? Will there need to be a mechanism to suppress listing certain articles?

Stuff that's so bad it needs to be deleted is pretty much taken care of - it won't show up on the next report, and in the meantime, everyone will be able to see that it is a redlink. I expect that people will start to triage the articles, and add {{wikify}} and {{cleanup}} and whatnot. If your listing is going to be triage-only, you can have people delete the listing once they've applied a cleanup tag, since then they will be brought to the attention of other editors through the normal cleanup process. Adding such a tag will actually increase the number of editors and number of edits associated with an article, so you might not have to worry about preventing these articles from showing up at the top of your list the next time around. If you want to make it more than a triage-only listing, editors can be instructed to remove listings only after substantial improvements have been made, though after a few editors touch them, articles will no longer appear at the top of your list. If you are just doing triage, though, you might want to exclude articles that are already flagged for some kind of cleanup, if they clog up the list. (I'm guessing that won't happen, but you never know.) -- Beland 06:56, 7 November 2005 (UTC)

  • My main concern is in removing "bad stuff" from Wikipedia. I like the idea of assigning a score based on variables, and the only thing I can think to add (in retrospect) is a factor that would take the length of the article into account (also captured by Wikipedia:Shortpages). Since certain factors are clearly red flags for articles, I'd like to find a way to mix in those factors to indicate which flags are the reddest, and worry about what to do with the information (i.e. setting up a triage system) once the info is on the table. BD2412 T 15:29, 7 November 2005 (UTC)
  • Well, now all we need is a working implementation. I am currently downloading and analyzing only current revisions, and I am still rewriting many scripts so they will work with the new XML dumps. It wouldn't be that hard for me to do, though I will probably need to upgrade my desktop machine first. Since it might be a while before I get around to it, I'll post this idea in case anyone else wants to tackle it. -- Beland 06:14, 13 November 2005 (UTC)
    • I like it, but - needing to find something to complain about - I'd name it Red flag articles. BD2412 T 19:02, 13 November 2005 (UTC)

Claiming blocks

  • Is there any way of breaking it up and having people lay "claim" to certain blocks? Or are you just going to run the system once and a while and remove the edited ones?--Rayc 18:08, 16 December 2005 (UTC)
    • No need for that - just delete the articles as they are tagged/fixed/redirected/etc. - the point of the list is to smoke out long-dormant junk. bd2412 T 18:24, 16 December 2005 (UTC)
You mean to say "..delete the articles from this list as...", right? :-) Awolf002 18:28, 16 December 2005 (UTC)
Nah, just delete them from Wikipedia altogether. The preceding statement is a joke, and only a joke. If it was a serious statement... well, let's not go there. bd2412 T 18:37, 16 December 2005 (UTC)
Well... most of the tagging is AfD. But, yes from this list. Actually, I like the idea of keeping them here on the list. Most of the info is good, just not clean. If you don't re-run it I can just clean up the articles without worring about verification.--Rayc 18:43, 16 December 2005 (UTC)
My impression was that this list was generated with a tool, and it would be "re-created" periodically. Awolf002 18:54, 16 December 2005 (UTC)
  • It is true that the list will need to be updated occasionally, and that any article which has been edited will almost certainly not be re-listed. However, depending on the frequency of database dumps and how often I can get around to doing it, it might be anywhere from a week to a month between updates. (Though it's easy to add more recently created articles to the list from the dump I've already processed, and there are tens of thousands of those which I haven't posted since they would make the page too long.) The "please remove when finished" advice is simply so that during this weekly or monthly period, other editors don't come by and try to fix articles that have already been fixed. If you are going to be working on a whole bunch of articles all at once, it would certainly be annoying to delete them one by one, and given how active the page is, it's not inconceivable that someone will try to work on the same articles as you unless there's a note. So, by all means feel free to put a note around a certain block that says "I'm working on these today". (I would recommend a relatively short timeframe, like "today", so that in case you get distracted and can't actually fix everything in the block, the articles won't fester if you forget to remove your note.) Thanks for your help, by the way! -- Beland 22:57, 16 December 2005 (UTC)


Status, Dec 2005

While sitting around waiting for UPS today, I whipped up a script to implement this idea. However, because of the enormous amount of data to be processed, it looks like it's going to take several days to run. Thank you for your patience. -- Beland 22:34, 14 December 2005 (UTC)

Well, my linear extrapolation was (fortunately) quite pessimistic. As you might expect, newer articles have fewer revisions than older ones, so the later analysis goes much faster. There were the usual snafus, and some debugging to properly exclude redirects, but I'm happy to say that the lists are now ready for posting. -- Beland 20:59, 15 December 2005 (UTC)


Created WP:NEG as shortcut. —Preceding unsigned comment added by jmason888 (talkcontribs)

Suggestion: Suppress images

  • After this run-through, images should be suppressed - I presume that the upload itself is not counting as an edit, or we'd be seeing most of them, but aside from one broken link, the ones here seem fine. bd2412 T 22:09, 15 December 2005 (UTC)
    • Mmm...well, the first image description page I randomly checked was for George W. Bush.jpg or something like that, and said simply, "America sucks", so I deleted it. I checked two more at random just now, and it looks like the description pages here should be deleted or blanked, because they duplicate content on Commons. -- Beland 22:36, 15 December 2005 (UTC)

Other suggestions

(These have since been refactored.)

  • We may need to explicitly suppress listing "tagged for cleanup" articles, since we are more interested in doing triage than finalizing articles, but we'll see how it goes. -- Beland 22:34, 14 December 2005 (UTC)
  • We'll have to figure out some other method for determining whether or not the content on the category description page is likely to be bad. One red flag is if the page is very long; sometimes people dump article or web page text into category pages. Probably another thing we could do is check to see if there is any text other than the link to the parent category; not sure how we would then do further ranking, though. (Is there a foul language-o-meter? Heh.) -- Beland 20:40, 15 December 2005 (UTC)
    • Is there a way to measure for text dumps (i.e. very long single edits)? bd2412 T 20:43, 15 December 2005 (UTC)
    • Sure. Since I'm cycling through the entire database now, I have access to the full text of any revision. Right now I'm not using that information, but it's easy to measure the length of the text in the current version. -- Beland 21:57, 15 December 2005 (UTC)
  • Suppress (or separately list) disambiguation pages - these are also more likely to be made with one edit and not edited thereafter; and anyone who adds a page with a built-in disambig tag probably is familiar enough with Wikipedia to not be dropping in garbage. Or they're a very clever vandal... bd2412 T 22:43, 16 December 2005 (UTC)
  • Drop another 50 points for articles that are longer than some preset length and not wikified. I seems to me that articles that are long and by a single (especially anon) author have a good chance of being copyvios. Dalf | Talk 06:13, 21 December 2005 (UTC)
    • I see you had that already, though I think long and with a single anon editor is more significant. Dalf | Talk 06:13, 21 December 2005 (UTC)
    • Sadly, long and anonymous first edit does indeed seem like a good indicator for copyright violations. (Lengthy, well-written original prose doesn't spontaneously appear that often, I guess.) Extreme shortness is also a good indication of "poop!" vandalism and the like, though that's pretty quickly cleaned up if there are multiple editors. I'll definitely keep long-anon-first-edit in mind as a factor for future improvement. -- Beland 07:23, 24 December 2005 (UTC)

Orphaned articles?

Could you remove another, say, 50 points if the article is orphaned? That could narrow things down a bit. :) r3m0t talk 12:00, 16 December 2005 (UTC)

  • And another 50 if the page is unwikified! Provided that I am correct in understanding "orphaned" to mean that the "What links here" list will be empty. bd2412 T 23:13, 16 December 2005 (UTC)
    • It sometimes would also mean the "what links here" list has just one link, and it is to the article itself... (I still don't get why that happens.) --cesarb 00:46, 18 December 2005 (UTC)
      • I think some of the stub templates do cause that to happen, be cause of the edit-link they contain. Jamie (talk/contribs) 01:26, 18 December 2005 (UTC)


First success story of this project

I caught this article Michael Minnig. Lotsofissues 09:49, 17 December 2005 (UTC)

ZAP! Yay! -- Beland 03:06, 21 December 2005 (UTC)

Thought for the day

Out of curiousity, I looked at the original (now deleted) revisions of the controversy-generating neglected article, John Seigenthaler Sr. At the time of its discovery, it would have been ranked 2903 on the current scale. This would put it at about number 181,000 (out of 894,096 "real" articles ranked). So, I guess there is some work yet to be done. At least now we getting the remaining worst of the lot more promptly. -- Beland 03:03, 21 December 2005 (UTC)

  • What factors made it 2903? Would any of the above suggested tweaks have changed that? Suppose we limit the project to articles for the time being, on the theory that hoaxters/vandals are more likely to start in that namespace? bd2412 T 03:20, 21 December 2005 (UTC)
  • The score is based on the following: There were three editors, two IPs and one logged-in user, with one revision each. The first edit was anonymous. -- Beland 08:57, 25 December 2005 (UTC)
  • I think the initial dump was unwikified; that would have dropped it down to 2850 by the suggestion above. It was not a copyvio and not a long dump, so that metric might not have helped much. -- Beland 08:57, 25 December 2005 (UTC)

List management

I noticed that despite the instrections saying to remove articles fomr the list after you address them people don't tend to do it. In my case this is because of a feeling that most of these articles still coudl use more attention even if I have given them some already. The result is people add notes to the entries and leave them. This was making it hard to read the list so I moved the article that had been looked at into two sub headings one for articles that were fixed in some way and one for articles that look ok and can be dropped form this list. If no one objects to this we should update the instructions at the top to tell people to do the same. Dalf | Talk 05:41, 21 December 2005 (UTC)

Updated. -- Beland 08:57, 25 December 2005 (UTC)