User:SelectionBot/0.7/Notes

Notes about the 0.7 selection process.

Phase 1: Selection

SelectionBot

Not really a bot, SelectionBot is a script to make an automated selection of articles. This ran well and selected about 30,000 articles for the 0.7 release.

Generating the list required parsing the ratings data and some external info (hitcounts, pagelink counts, interwiki counts). We used a database dump to gather the info. The motivations for using the database dumps were:

Speedy data access once the dump is downloaded.
The data was stable and we could replicate results in testing.

Disadvantages/issues:
1. We had to write parsers for the files (User:Kelson helped with this).
2. The hitcount data is not in a convenient form.
3. The dumps got more and more outdated, because the WP dump system failed and was disabled, and the release took longer than expected.
4. The selection was presented on a collection of static web pages on toolserver. These were not user friendly and could not be easily updated as the selection changed.
Source code:
1. https://svn.toolserver.org/svnroot/cbm/SelectionBot/
Suggestions for future releases:
1. Generate the selection directly from toolserver, without using database dumps
2. Develop a CGI system for storing and qurying the contents of the selection.

Manual selection

The manual selection listed articles that were not picked up by the automated selection but,m for various reasons, needed to be included in the release.

Disadvantages/issues:
1. It was difficult to maintain the large list by hand. The list entries used inconsistent formats, and redirects or duplicates sometimes appeared.
2. Automated processing of the list was difficult and error-prone.
Suggestions for future releases:
1. Develop an automated system (javascript/cgi) to permit WP 1.0 reviewers to add articles to a list of manually-selected articles that is stored on a web server. This list must have the ability to be queried to see what articles were on it.
2. Integrate the listing of the manual selection into the listing of the automated selection using CGI.

Phase 2: Revision ID selection

This phase required selecting a revision id for each article in the release. This was relatively straightforward to implement using toolserver to query revision histories.

Disadvantages/issues:
1. It was difficult to know if good criteria had been chosen to select revision ids. The 0.7 criteria used a simple heuristic to avoid IP edits and edits that appeared to be reverts.
2. The database dump used to select articles was many months out of date by the time the version ids were selected. Thus articles had often been renamed or deleted by the time we got to them.
Suggestions for future releases:
1. Develop and test a heuristic for selecting revision ids.

Phase 3: Indexing

We decided on several indexes: a topical index, a geographical index, and an alphabetical list of all articles.

There is a lot of metadata available to help: a list of all wikiprojects that have assessed each article, a list of all categories on the article, and a list of all categories on the talk page.

Topical index

A WP 1.0 "category" was assigned to each wikiproject manually. Then the articles were sorted by project and by the project's category. The projects were manually broken down into numerous pages, and the pages were assembled into an index.

Disadvantages/issues:
1. Some of the index pages, e.g. for Biographies, were very long and may be hard to use
Suggestions for future releases:
1. Develop a better way to assign WP 1.0 categories to articles.

Geographical index

A list was generated of all categories of selected articles. These were broken down into words. A manual assignment of many of these words to geographical locations was made. Then this was used to assign location(s) to various articles. The articles were sorted by location and then by WP 1.0 category, using the same system as the topical index. Then a page was made for each country.

Disadvantages/issues:
1. Some of the index pages, e.g. for the United States, were very long and may be hard to uise
Suggestions for future releases:
1. Use the coordinate data, e.g. Template:coord, to sort articles by location or place them on a map
2. Refine the current method to give more accurate results

Other issues

Pages are renamed very often, or redirected to other locations. I had expected that the articles being selected would be very stable, but this was not true. So data collected at different times was often inconsistent. This problem was compounded by the long preparation time for the release.

The talk page tags were not completely consistent with the manual selection or the previous release selections.