User:Olaf Davis/Bot specification


As I see it, the bot needs to store:

  • A list of danger words (e.g. "poop"), and for each one:

Operations it needs to perform are:

  1. For a given danger word, perform a search for any page which contains the danger word outside of all safe phrases; discard pages on the safe list; output the list of remaining pages
  2. Add pages which have been checked to a word's safe list, or remove pages, at user instruction
  3. Amend a word's list of safe phrases, at user instruction
  4. Add a new danger word, at user instruction

The most obvious approach is to have all the storage done in the bot's (or someone else's) userspace: perhaps a subpage with a list of danger words, and then for each word a separate page for safe phrases and pages. This has the advantage that most user interaction (nos. 2-4) can be done quite simply by editing these pages. If we're worried about vandals tampering with the safelist I could have the bot refuse to run if the last edit to one of its subpages was from a non-trusted user. There's also the issue of a public safelist being a BEANSy vandal-magnet: to be honest I doubt it would be that serious though, and if they did target the 'safe' pages it would just leave us in the same situation we are now and not worse.

Alternatives are storing things on an off-wiki webpage (publicly viewable but not publicly editable) or on the machine the bot's running from. I'd be inclined to stick with on-wiki unless you disagree.

Thanks that sounds near perfect, I agree that the pages it uses and creates should be on-wiki, whether the words its checking or the pages and phrases it ignores should be unprotected, semi protected or fully protected I think we can leave for the Bot approval group to advise on. But it will need some sort of throttle or cutoff just in case some helpful person decides to put "and" or "the" in as a word to check for. Of course the pages it creates need no protection. I'm not sure I like the phrase danger word - this would work fine when its used for poop or pubic, but "posses" errors and calvary/cavalry confusion are I'm sure just typos; perhaps target word would be more neutral? We need to decide the run frequency - I'm assuming this would be adhoc, but usually after the latest set of pages had been worked through? If this works as well as I hope it will then we will need some process for others to request that new words be added, which I assume revolves around how many wikipedia articles contain that word - pubic is around 700 and posses I try to keep at 200. I think the largest of these projects I ever started was trough with over 100 of 2400 being typos, staring still has 2100 occurrences after I fixed over a 100 recently but starring is at 97,000 and probably way beyond the scope of this sort of project. Wikipedia:Typo Team/works completed has more examples, many quite small like preformer. ϢereSpielChequers 15:17, 9 January 2010 (UTC)
PS I'm assuming this would just run in mainspace articles, as I have no interest in typos in talkpages, but perhaps template and portal space should also be included. Thanks ϢereSpielChequers 15:22, 9 January 2010 (UTC)
I've coded up most of the bot's functions: it's successfully reading safe lists and phrases from my userspace and outputting article lists on-screen. All that needs adding is a place for it to output to, and a page it can check for new job requests. For output - would the best solution be to have a subpage (or section) for each search in its own userspace and let operators copy results to their own sandboxes if they want to work on them there, or deliver it straight to the user? The first would help people to keep track of which searches everyone else's doing, but the results page(s) might get pretty long/numerous... any thoughts?
Re. throttling: yes, good idea. The exact threshold shouldn't be too important, I don't think: if it just brings up the first, say, 500 pages from a search then the user can deal with those and run the search again to get the next lot. Changing danger to target is also a good idea. Olaf Davis (talk) 22:22, 11 January 2010 (UTC)