Wikipedia talk:Version 1.0 Editorial Team/Archive 18

why so few articles?

Wikipedia:Version_1.0_Editorial_Team/Release_Version_Criteria says "Release Version 0.7 - Nominations open - For this release we are aiming for around 30,000 articles, so the scope will be much wider than for Version 0.5." that was a long time ago. now we are on 2008/9 and only have 5500 articles. What's with the change?

Over 10,000 articles are currently in "good article" status or better, and over 20,000 are marked as "top importance". If you just include, as the release version criteria says, "GA+ articles of mid importance or higher", that's already over 8,000 articles, not including featured lists. Add in "B articles of high importance or higher" and you're already over your stated 30,000 article goal. So why the **** are we only at 5500?! Kevin Baastalk 18:28, 23 February 2010 (UTC)

Okay, I read on the site that it's ostensibly due to how much can fit on a dvd. but then: a) is it compressed? b) how are the images stored? can they be better compressed or at lower res? c) can unimportant images be removed to make way for more content? d) are math formulas stored as images or is it running a LaTeX processor?, and e) it' only at about 3GB. you have a whole 'nother GB. so "as much as could fit" is a little disingenuous. and...
Why don't you make a multi-DVD version to complement the single-DVD version. one dvd with a main index and the most read items, and the remaining dvds (say, 5/6 total = over 30,000 articles?) storing the rest, in decreasing order of how often the articles are read. It could be an "extended set" just for the school library or the teacher. Kevin Baastalk 18:45, 23 February 2010 (UTC)
And actually you could probably store it all on a single USB stick, nowadays. 20 bucks instead of 1x(5 or 6 CDs). Kevin Baastalk 18:49, 23 February 2010 (UTC)
I am the french editor of the technical solution, the most reason it s that nobody help us to do this work. Pmartin (talk) 08:27, 26 February 2010 (UTC)
Well you don't have to manually go through all the articles. you can just write a little script and/or bot. then it's completely scalable and independent of how much help you get. Kevin Baastalk 17:14, 26 February 2010 (UTC)
There are around 31,000, as listed at Wikipedia:Version_0.7, selected by importance and quality. That sentence on the page is out of date - I just updated it. As it said just below that, the manual selection was just used to supplement the selection done using this bot. The selection for Version 0.7 was made in 2008, and an electronic version (in Kiwix) is already available and (for example) in use in schools in Africa.
We currently have a few problems to confront. Once we can resolve these, Version 0.8 will be better, and available within a few weeks. If you would like to help us work on these, your efforts would be very welcome! The problems remaining:
  • Version selection. We had the articles picked out by late 2008, but it took us until August 2009 to filter out vandalism from those 31,000 articles. We're now working with the WikiTrust people, who have adapted WikiTrust to select "trustworthy versions". Can you help us evaluate that?
  • Writing an index. We have a rather crude script currently, based on category keyword associations with core concepts (principally high level categories and countries). It works, mostly, but it omits quite a lot that should be in, and includes some things in the wrong places. Can you help us work on this?
  • The software. We've been fixing bugs in the software for around six months; the Windows and Unix versions of Okawix are now good, but the Mac version still may have some bugs - we're still not sure. A major problem is a lack of people testing Okawix and Kiwix software. Can you help? (Especially if you're a Mac user!)
  • Tagging of articles. That's where you got the 5500 number from, because the bot selection still hasn't been tagged. User:CBM has been too busy with the new 1.0 bot, and I don't know how to write scripts. If you can do this, you could tag the relevant articles with the 1.0 template.
Please let us know if you can help! Thanks, Walkerma (talk) 01:02, 27 February 2010 (UTC)
Regarding the indexing thing, there's a neat way to do it semantically by using information theory: find how often each word is used in the entire wikipedia, and how often it is used in each article. then compare the article's word frequency with the encyclopedia's. those words that stand out as unique to the article are a sort of "semantic signature" for the article. something like:
word usage surprise factor = log((total word count of encyclopedia * frequency of word in article) / (total word count of article * frequency of word in encyclopedia))
then just take every word with a surprise factor greater than a certain number as an index word for that article. This would be a fairly quick, easy, and accurate way to compile an index. Kevin Baastalk 14:12, 27 February 2010 (UTC)
You could also tweak it by for instance double-weighing words in section headers and triple-weighing words in article names. Kevin Baastalk 14:15, 27 February 2010 (UTC)
IMO that's the only real "correct" way to compile an index. If you want "meta-categories" on top of that than you could total up the word usage figures of articles you consider in that category, and that's your "category signature". take the dot product of that and an individual article's category semantic signature and that should give you a "score" on how much that article fits in that category. Kevin Baastalk 14:22, 27 February 2010 (UTC)
Let me think about that last part more - there are better ways to meta-categorize. for one, it should use something more like kl-divergence instead of dot product. Kevin Baastalk 14:48, 27 February 2010 (UTC)
Well my other idea was some kind of svd-variant that takes the possibility of category overlap into account. i.e. for every pair of cateogory signatures it would store an overlap signature or something like that. however interesting, it would be far more complex and cpu-intensive. so i'd say just use kl-divergence of word usage from the category average to score topic dis-similarity. you can then even show these scores in your index. Kevin Baastalk 15:05, 27 February 2010 (UTC)
This sounds very interesting. I'm an organic chemist, so I'm out of my depth when it comes to this sort of thing - for me, the semantic web means making chemical structures searchable (another problem!). If you would like to see what CBM did for Version 0.7, see this thread, and this code. I meant to mention - regarding the tagging of articles, CBM is working on that, see the thread above. I think within a couple of months he should be able to get that done so the totals match up OK. Thanks, Walkerma (talk) 15:59, 27 February 2010 (UTC)
Hmmm, i'd have to brush up on my perl a little, but i think i get the basic algorithm. (I couldn't find any explanation of it in the links you provided). as far as i can tell it doesn't do any semantic indexing. I'd be more comfortable writing something in c or java, rather than perl. then i'd just need to know how to interface with the article content. just read it from files? i'd prefer that. then i can just output a file, or a file for each word, or something like that. also i thought about it a littel more and the accuracy can be improved by taking into account word correlations. one can do this by constructing a covariance matrix then doing principal component analysis (or slightly better, independent component analysis) on it, then scoring articles based on the principal component frequency rather than word frequency. (for the layman, think of a principal component as a "semantically latent idea" (or just "idea") as opposed to a "word"). that's what you want, thou, right; a sort of "index" as distinct from a "table of contents"? Kevin Baastalk 16:27, 27 February 2010 (UTC)
It would be totally easy to do from java by using java.util.hashtable, string.split, and string.replaceAll. But I'm not going to put any time into it unless i know it's going to be used and i can get the articles in a format that i can easy read in java (such as files). Kevin Baastalk 17:00, 27 February 2010 (UTC)
Making chemical structures searchable sounds pretty easy. your just looking for all the chemical formulas (e.g. CHO2) that contain a given string of characters (e.g. O2), right? from a computer programming perspective, that's utterly trivial. Kevin Baastalk 19:16, 27 February 2010 (UTC)
Regarding chemical structures, I mean making a being able to search for a structure and find a a png or gif like this one. In fact, the InChI and InChIKey are a pretty good way, but they're still not well established. Anyway, on WP topics, I'll reply to your other points below. Thanks, Walkerma (talk) 21:18, 27 February 2010 (UTC)
I should mention, regarding the keywords, there is a lookup table that I did for CBM, which connects category keywords (about 11000 of them) to general categories. (NB: A simple category hierarchy doesn't work well) This is an Excel file which I can email to you if you like. It makes connections like Babylonian => Iraq, or Birds => Natural sciences. It's far from perfect, but it gave us a good starting point. CBM used this lookup table along with the perl code to create the index. Walkerma (talk) 21:37, 27 February 2010 (UTC)
I'm curious: is there something like a chemical informatics database that contains a web of all known chemical interactions? I imagine that would be SUPER-useful. for instance you could then write computer programs that you ask "figure out for me the cheapest way to make such-and-such" and it would do some kind of evolutionary search and actually give you a really good answer. Kevin Baastalk 22:34, 27 February 2010 (UTC)
My PhD adviser (James B Hendrickson) has worked on this type of thing since 1967, though his work looks at more general classifications (he reduces organic chemistry to 123 types of reaction). His SYNGEN program has a database of commercially available compounds and gives you the cheapest route it can think of. The closest anyone has come to doing what you said is probably Corey's Lhasa (computing) project, but this grew to be rather big and complex. Remember that over one million papers are abstracted by Chemical Abstracts each year, and some of these papers may contain over 100 reactions, and many of these involve quite nuanced reactivity ("the additional methoxy group on position 33 promotes the enolate formation at carbon 36 instead of the alkylation at carbon 11") - I think you begin to see the problem. If every concept could be reduced to text, it might be do-able, but how do you make a computer understand the more subtle details of chemical reactivity? Walkerma (talk) 19:11, 28 February 2010 (UTC)

how would i get a dump of the wp articles fro the next 1.0 release as a directory of text files?

i want to run my algorithm on my power computer. but to do it i'd need a dump of the articles. preferably one big directory (zipped) with 1 file per article, called simply [title of the article].txt, and the content of the file being just the wikisource. how would i get this? Kevin Baastalk 19:33, 27 February 2010 (UTC)

(btw, it's got 4x7200RPM's in a raid 0 configuration and 8GB of memory, so the bottleneck's clearly gonna be the cpu. i built if for data-parallel applications, such as, well, this.) Kevin Baastalk 19:38, 27 February 2010 (UTC)

The standard format we have for the offline releases is a ZIM (actually, openZIM) file, and this can be read using a ZIM reader. See this archived discussion for details. Does this give you what you need? If you can get a working index, we would definitely use it, as there is no one else actively working on this (in fact it's the biggest gap that we need to fill at the moment). If User:CBM is around, let him know what you're doing - though he normally watches this page carefully, and he'll probably comment soon if he is online. Thanks, Walkerma (talk) 21:27, 27 February 2010 (UTC)
BTW, if you get things nicely working, CBM may be able to migrate working code onto the toolserver, so it can work at maximum efficiency and be shared with other languages. The French and Hungarian Wikipedias also use our article tagging system, and others are certainly watching what we're doing with automated article selection. Cheers, Walkerma (talk) 21:30, 27 February 2010 (UTC)
For specifics, i'm making a keyword index generator. written in Java. first thing it'll do is generate list of word frequencies for the entire dataset as well as for each article individually, and store them as .csv files (comma separated values). From that one should be able to do anything they want with a little matrix algebra. next logical steps would be to select index words based on entropy and calculate a term-term covariance matrix for more sophisticated indexing.
if i do a principal component analysis or svd (for latent indexing) it won't be in java. it will probably be in CUDA (a language for GPGPUs) and simply use an open source linear algebra library - and if any one knows what to use that would be great - and it would take a .csv of the covariance matrix and produce a .csv of the resulting orthogonal vectors. that would be something worth publishing and freely distributing in-and-of-itself. first things first thou. i'm not saying i would do that, just that i think that would be the logical next step.
i downloaded the wiki-schools project and it appears to have the pages as a collection of .html files sorted alphabetically. i suppose i can just grab the main article text section and ignore everything in angle brackets. so that should do. Kevin Baastalk 22:21, 27 February 2010 (UTC)
Sounds great! I don't understand a lot of the theory, but it looks to be exactly what we've been waiting for. One word of warning - with such a large collection, at any point there are article versions that are vandalized or even near-blanked. You don't want to find that your top keyword for Manchester United FC is w*nk**s. (Actually, speaking as a Newcastle supporter, I think maybe that is appropriate?). If you use the offline versions of these collections from either the ZIM file (see link above) or here (for the SOS Schools project), you will avoid most or all of these issues, because both have been extensively checked for vandalism. Walkerma (talk) 02:20, 28 February 2010 (UTC)
The code's relatively small so i could theoretically just paste it all as i did above, but I presume this isn't the most appropriate place to do so. Where would be a more appropriate place? I would prefer to post it on a wiki so that it becomes "publicly owned" in a sense.
Also what format should the keyword index ultimately be in? And where would i upload it?
I would like to upload the raw word counts and word covariances too, so others don't have to recreate them. and i think there should be a central place where anyone can get the most up-to-date copy of them. maybe torrent them like the final product is. I can see a lot of applications for the data - basically anyone doing anything with natural language processing would probably love to have those kind of word stats from a ginormous online encyclopedia such as this. Kevin Baastalk 02:41, 28 February 2010 (UTC)
With things like SelectionBot, we simply created sub pages like Wikipedia:Version 1.0 Editorial Team/SelectionBot, which explain what the tool does. In some cases the subpage includes the code written out as you did above; sometimes people list the code in a subpage of their user page. These are also listed at m:Static_version_tools, so that other projects can see what's available. If you need a server to make the data available for download, let us know - this group has access to one or two such servers. Regarding format, the final index itself should be a wiki page; for the keyword list used to generate that, I suspect that any machine-readable format would be fine, but CBM is really the person to ask. Does this answer your questions? Cheers, Walkerma (talk) 07:25, 28 February 2010 (UTC)
I have the impression you are trying to create a search engine type index for the release. That's a good idea, but it doesn't seem to be the same type of "index" that was made for 0.7. The type of index I am thinking of is intended for manual browsing, not for searching. However, having a search engine in the reader would be a good idea, if there is not one already. I know very little about the kiwix reader project.
There were three types of human-readable indexes that we created for WP 0.7:
  • Alphabetical - very easy to create
  • By topic - more difficult. Each WikiProject was designated as a "topic" (occasionally several projects were merged into one "topic"). The topics were arranged hierarchically to make an overall index. This index is at Wikipedia:0.7/0.7index
  • By country - this was the most difficult. We used categories on the articles to match articles with countries. We used an association between wikiprojects and fields (Arts, Geography, etc) to subdivide the articles on each county into sections. This index is visible at Wikipedia:0.7/0.7geo
A central difficulty in creating human-readable indexes is removing false positives. A search engine can return false results, and nobody cares. But if our index says "Tony Blair" is related to Nepal because his article says he traveled to Nepal once, that's a problem. We don't want to list every politician who visited Nepal on the index page for Nepal. So simply scanning all the words in each article does not seem like it will generate a human-readable index. There is a lot of exciting work that can be done in harnessing our category and Wikiproject tagging system to pull out the semantic information.
On a different topic: the easiest way to get the source code of the pages in the release is to simply download them all with the API. Alternatively, you could download a database dump of our articles and then parse out the pages you want. The "selection tools" project the Emmanuel and I worked on has tools to parse out all the categorylinks, redirects, etc. from a database dump and create text files with the information. This is available at [1]. For reference, the code for the 0.7 indexes is at [2]. But it was not a fully-automated process; I used the tools to create the indexes with a lot of hand-holding and manual work.
Like I was saying, I think that there is a lot of exciting work that can be done in the indexes; we have a huge amount of semantic information available. Indexing all the words in the released articles will probably be helpful for whatever type of indexes are created for 0.8. Good luck! — Carl (CBM · talk) 14:20, 28 February 2010 (UTC)
Thanks a lot for that, Carl! The Okawix software does indeed include quite a sophisticated search engine - you should download that from here. I think that may pull out words from articles too. There is a searchable online HTML version here, also available for download. What we need is a better version of the indexes you can see at Wikipedia:Version 0.7 which Carl referred to; can your keyword list do that more effectively? All of those collections are based on the set of 31,000 articles that make up Version 0.7. (Earlier you had mentioned doing the Selection for Schools, which is a smaller, hand-picked collection, so I'm not surprised you got a smaller total.) Walkerma (talk) 20:59, 28 February 2010 (UTC)

tentative results

alright, the program's running and it's actually fairly quick. the total word stats came back and there are just over 25,000 distinct words, but only about half of those are used more than once, and only about 3,000 of them are used at least 10 times. top words are, as expected, the, of, and, a, in, is,to, as... with the being used just over 13,000 times and a just over 5,000 times. i'm still not sure exactly how to select the best word for the index. to store the whole term-document matrix would be 30,000 x 25,0000 = 750,000,000 floating point numbers, which is the multi-gigabyte range. So i'm gonna cut it off at a minimum of 3 usages (just over 9,000 words) and go from there. if anyone has any bright ideas on how to statistically / mathematically pick the best words for the index, i'd be interested. Kevin Baastalk 16:33, 28 February 2010 (UTC)

and by my count there's only 6,471 articles in the 2008/9 wp cd selection. no wonder it was so much quicker than i expected. i suppose that's decent for testing, but that's not a big enough sample size to be of any real semantic indexing value. where would i get something like a 30,000 article selection? Kevin Baastalk 16:47, 28 February 2010 (UTC)

I'm thinking of an alphabetical human-readable index, like what you're see in the back of a book. For each term in the index it shows you what pages they're referenced on. Right now i've decided to pick out (use for the index) the terms with the most "document discriminatory information", according to the following formula:

  • p(not x) = 1-p(x)
  • term naive binary self-entropy = - ( p(term)*log(p(term) + p(not term) * log( p(not term)) )
  • term binary entropy knowing document = -sum( p(doc) * [p(term|doc)*log(p(term|doc) + p(not term|doc) * log( p(not term|doc))] )_over all docs
  • document discriminatory information of term = term binary entropy knowing document - term naive binary self-entropy

and the number of pages each term in the index links to would be roughly proportional to the amount of document-discriminatory info that term gives. that's the current plan. the goal is to have it pick out the most "interesting" terms, and for this method i'm interpreting "interesting" as un-evenly distributed among articles.

i'll try that download link provided in the above section and see if this selection criteria gives a nice list. if all goes well i'll have a big zipped file of entropy and probability numbers looking for a home. Kevin Baastalk 21:47, 28 February 2010 (UTC) Kevin Baastalk 21:47, 28 February 2010 (UTC)

97,814 distinct words / numbers. that's more like it. (funny, my program "runs out of memory" on vista, but works fine on xp. they're both 64-bit and have enough memory to handle it. yet another way in which vista sucks, it seems.) definitely gonna need to trim that list down. good way to test my "interesting" term selection method, i suppose. Kevin Baastalk 04:52, 1 March 2010 (UTC)

Good - it's big, but that sounds like quite a manageable number - only about three words for each article. I'll be interested to see how the words look - can you post a few examples (such as, the article on Sunderland has these 10 index words, etc?). Cheers, Walkerma (talk) 13:25, 1 March 2010 (UTC)
i had an error: i had "if(notword) break;" instead of "if(notword) continue;" which means the word count is actually substantially higher. ah well. with errors come innovation - and i've innovated - instead of relying on java libraries i rewrote the parsing stage to parse out the words in a single pass. it runs much faster now and uses much less memory. (i didn't realize the java libraries were so inefficient!). I plan on uploading the whole result-set for public use, with an detailed explanation of how it's organized. I'm gathering 1611 entropy statistics for each article and 1611 for each term - a lot of the 16 are going to be redundant / useless but i didn't know what was going to be useful so i figured better too many than too few. then i can figure out how to use them to select index words. the raw document-word counts are gonna be pretty big so i'm thinking that should be separate from the stats. and i might generate a word-word covariance matrix later on (a.k.a. "higher-order statistics"), if all goes well, which should also be a separate download.
FWIW, the stats i plan on generating are (H(x,y) = cross-entropy(x,y) = -x * log( y) ):
  1. p(doc) * H(p(term|doc),p(term|doc)) = H(p(doc,term),p(term|doc))
  2. p(doc) * H(p(term|doc),p(term)) = H(p(doc,term),p(term))
  3. p(doc) * H(1-p(term|doc),1-p(term|doc)) = H(p(doc,!term),p(!term|doc))
  4. p(doc) * H(1-p(term|doc),1-p(term)) = H(p(doc,!term),p(!term))
  5. p(term) * H(p(doc|term),p(doc|term)) = H(p(doc,term),p(doc|term))
  6. p(term) * H(p(doc|term),p(doc)) = H(p(doc,term),p(doc))
  7. p(term) * H(1-p(doc|term),1-p(doc|term))= H(p(!doc,term),p(!doc|term))
  8. p(term) * H(1-p(doc|term),1-p(doc)) = H(p(!doc,term),p(!doc))
  9. H(p(doc,term),p(doc,term))
  10. count(how many times the term is in the document)
  11. count(how many times the term is in the document) >= 1 ? 1 : 0
# H(p(doc,term),p(doc))
# H(p(doc,term),p(term))
which are then:
  • totaled over all terms, for each document, and
  • totaled over all documents, for each term
From that, anyone should be able to calculate pretty much anything they might need (or want) to know that wouldn't require higher-order statistics. or such, at least, is the idea. Kevin Baastalk 20:28, 1 March 2010 (UTC)
regarding an example such as "sunterland": sure. i'm gonna play around with the stats to see what formula what works best. i might have a few to pick from. if excel could handle that many rows, anyone could play around the main words stats in excel. i'll see what i can do to get it under 64k significant terms so ppl can do that. one could do the same for individual articles, too. one step at a time, thou. Kevin Baastalk 21:26, 1 March 2010 (UTC)

I got all the article level and global word counts zipped into a 161MB .zip file. The global word count only includes words / numbers used at least 5 times, which brings the total down from just over a million to just over a quarter million distinct words. It's generating the global stats now. - a much slower process because it involves floating-point division and multiplication for every word-article combo (.25 million words * 30 thousand articles * 16 stats * 3 ops per stat is a lot). you'd be surprised at some of the words. for example: flix, flit, flin, flim, flir, flip. all words in wikipedia. all show up if you search for them with that little search bar to your left. Kevin Baastalk 01:45, 2 March 2010 (UTC)

I'm going to try 16 different sort orders for the terms, each based on "kl-divergence". i'm not too sure about the descriptions being accurate, which is one reason why i'm trying multiple combinations. the other reason being that the descriptions are confusing enough anyways:

  1. stat4-stat5: "kl-div doc|term" - given that the current word you're reading is this term, on average, how well do you know what document you're in, assuming you know how the term is distributed among the documents, compared to how well you would know if you didn't know how the term was distributed among the documents?
  2. stat0-stat1: "kl-div term|doc" - how well do you know whether or not the current word you're reading is this term, given that you know what document you're in and how the term is distributed among the documents, compared to how well you would know if you didn't know how the term was distributed among the documents?
  3. (stat4+stat6)-(stat5+stat7): "kl-div binary doc|term" - given that you know whether or not the current word you're reading is this term, on average, how well do you know what document you're in, assuming you know how the term is distributed among the documents, compared to how well you would know if you didn't know how the term was distributed among the documents?
  4. (stat0+stat2)-(stat1+stat3) "kl-div binary term|doc" - my brain is already fried.
  • each of these, then divided by stat10 (total number of occurrences of the term)
    • each of those, multiplied by -1 (i.e. in reverse order)

essentially i'm working off the idea that by choosing whether a word is in the index you're deciding whether or not to retain information about how that word is distributed among the documents. so to make your decision you need some measure of that information's "value" or, more precisely, "informativeness". after reviewing the results of the sorts, i'll probably narrow it down to 4 different sort orders, and post the list-toppers of the top-lists here for comparison. Kevin Baastalk 20:49, 2 March 2010 (UTC)

Thanks - please keep us posted. Walkerma (talk) 16:43, 3 March 2010 (UTC)
Currently I'm filtering out any words used in less than 5 different articles or less than 10 times total. ideally the sort criteria should be pushing such outliers out, but that's a quick way to take some computational burden off of it. I have the preliminary sorted lists, but the list-toppers are currently way too esoteric (mostly words you've never heard of). So i'm gonna try some more variations. I also plan on eventually adding regularization via shrinkage to improve the accuracy a little. and maybe also weighing the document's probability by importance low=1,mid=2,high=4 (right now it's just by word count). but frankly right now i'm disappointed in the results: the keywords are too unique. (e.g. mori) i was concerned about that effect but was hoping it wouldn't be as pronounced. but i guess i'll see how the new variations turn out. hopefully at least one of them is much better. (and btw, if anyone wants the raw word counts just let me know where to upload them. they're in .csv's so they should be pretty easy to read w/just about anything.) Kevin Baastalk 19:16, 3 March 2010 (UTC)

I just posted a snapshot of the source code for the statistics generator as of 3/4/2010 at User:Kevin_Baas/stat_generator_code. Anyone can use it for whatever. Provided it's not malicious. Kevin Baastalk 19:03, 4 March 2010 (UTC)

Are the results in a form you can post on Zoho or Google docs as an Excel-type file? If so, can you let us know where? If not, can you at least email them to me? Thanks, Walkerma (talk) 06:00, 5 March 2010 (UTC)
Only problem is the files are too big for excel. Excel only allows for 64k rows. the results have at least twice that many. Other then that, ya, they can be opened by excel (or openoffice, etc.) in their current format. -- if they were a lot smaller. i'm still not satisfied with the results - i haven't had the time to look through all the variations yet but it looks like the terms, though a little better, are still too esoteric. my plan is if i find some good sort orders i trim them so that they'll fit in excel, but that hasn't happened yet. if i can't find a good sort order, i've got three ideas from here:
  1. maybe i should be picking from the middle of the list instead of the end (in which, case, what is the "middle"?),
  2. doing it per-article instead of globally, i.e. picking out the most notable terms from each article, and then totally that up somehow. (i'd start by just tacking on the global word counts to the per-article word count files)
  3. if all else fails, this might require full scale pca / svd on the terms, and that would require a LOT of computation time. writting that for a modern graphics processor (e.g. NVIDIA 8800 or higher) instead of a traditional CPU would speed that up about 50x, but those are a lot more difficult to write for, even using CUDA, so i'd much rather find a program that already does it.
I'm probably going to go with option 2 here, then i'd just pile that into the 161MB word count zip. That's probably bring the size up to 200-something MB. I could email you that and you could open any of the files up in excel (or openoffice, etc.), with the exception of the _term_stats.csv file, and only because that file is too big. then you can play around with formulas on the word counts and sort orders by them all you want, on a single-article at at time, at least. Kevin Baastalk 14:36, 5 March 2010 (UTC)
Or maybe i just start out with only the 60k most frequently used terms and go from there? (in addition to option 2) for option 1 it would be nice to chart the data. another project. Kevin Baastalk 14:47, 5 March 2010 (UTC)
It would be useful to have statistics on how often each articles is requested on wikipedia. something like this, except over a decade and for every article in the static version. i could then add these numbers into the doc stats file and they can be used as multipliers for data mining such as, well, keyword extraction. Kevin Baastalk 16:35, 5 March 2010 (UTC)
And likewise for terms, the respective frequency in which they are searched for. Kevin Baastalk 16:49, 5 March 2010 (UTC)
Sure, I'd really like to see #2, how it looks for a few typical articles. I think that would show if the index is really picking out the right words for that article. Let's see how those look before you go committing more of your time! Please email me that file, if it's not too big, though I'm not sure if our college server will allow a 200 MB attachment through. For historical stats going by month (with daily numbers) over the last couple of years, have you seen this (a site I look at a lot)? CBM uses those same stats in calculating the no. of hits on a page for the 1.0 selection bot. Walkerma (talk) 02:25, 6 March 2010 (UTC
Will do. I've finally got some good variations. (top words like science, film, actor, sports...) -p(term)*(ln(p(term,doc)) - ln(p(term))) and -p(term)*(ln(p(term,doc)) - ln(p(doc))) and either of those times p(term) again. (total (doc,term) entropy for the term minus document entropy and term entropy, respectively) I'm going to add a few more variations on that theme (as in like 2) then cut those lists off at 10k or so, sort them alphabetically, and put them in 1 file for visual comparision. i'm also thinking of removing all words common to all the resulting lists to just show the words unique to each. i'll probably have them (and #2) sometime this week. Kevin Baastalk 15:56, 7 March 2010 (UTC)
RE article request frequency list: ya, i saw that, and that's the stats i'm looking for, but for my purposes i need a complete list (w/respect to the offline version, at least), instead of just top 1000. And I would need a larger sample size to average out short-term trends and seasonal variations. that would mean a minimum of a few years. Kevin Baastalk 16:15, 7 March 2010 (UTC)
Oh I just saw the link on that page to the raw data: [3]. Gigabytes of it, it seems. A project in itself to download, sort through, and compile into what i want, yet it only goes back a few months so it's not large enough anyways. Kevin Baastalk 16:24, 7 March 2010 (UTC)
OK, thanks. I'll look forward to seeing your list when it's ready. If you want rough statistics on hit counts for the Version 0.7 articles, then you could use the figures listed here - click on the "By score - All" then read the "hit count" for each article (I think it's supposed to be for a one month period from around mid-2008). As a "snapshot" this may not be as accurate as time-averaged data, but it would be convenient and it would give you a "ball park" number that would be fine except for a few cases (such as Sarah Palin). In a month or two we should have the 0.7 tags in the main 1.0 bot data, so then you could get a snapshot of up-to-date statistics very easily. If that won't work for you, I guess we'll have to wait on that part. Walkerma (talk) 04:51, 8 March 2010 (UTC)

Kiwix installation

Last weekend I installed kiwix with the 30,000 article zim file at a school near Johannesberg. kiwix-index builds the search index specific for that zim file. It seems fast and snappy, served via http from one linux machine. I blogged about it at http://blog.wizzy.com/ Wizzy 13:29, 1 March 2010 (UTC)

Bot isn't generating a title

The bot has found the wikiproject, added it to the index, but has not created the table. How can I get this onto Wikipedia?

Also the Ontario Road (with a capital R) entry needs to be removed from the index. - ʄɭoʏɗiaɲ τ ¢ 17:27, 9 March 2010 (UTC)

Running the manual update on the web tool creates that page automatically. The real table is at User:WP 1.0 bot/Tables/Project/Ontario road and has been updating since March 3. The page just created is just a placeholder of sorts.
I deleted the "Ontario Road" project from the index. — Carl (CBM · talk) 19:59, 9 March 2010 (UTC)
Thanks! I tried the manual tool, but the one linked from this project has been disabled. I guess everything changed a bit with the new bot? - ʄɭoʏɗiaɲ τ ¢ 03:11, 10 March 2010 (UTC)
Where is that link? I need to update it. There are so many wiki pages, and I didn't write them, so it's hard for me to track down everything that I need to update. The new bot is completely separate from the old one, so all the old links need to be updated. — Carl (CBM · talk) 03:17, 10 March 2010 (UTC)

Ireland

What happened to the Ireland statistics? Wikipedia:Version 1.0 Editorial Team/Ireland articles by quality statistics? ww2censor (talk) 23:11, 18 March 2010 (UTC)

Yes many of the philosophy task forces aren't showing up!? [[4]] Greg Bard 23:53, 18 March 2010 (UTC)
Actually Why no articles? seem to answer the question. ww2censor (talk) 02:22, 19 March 2010 (UTC)
Update: problem solved. ww2censor (talk) 15:31, 19 March 2010 (UTC)
The problem was that the toolserver database for the bot was inadvertently lost during some maintenance work by the server admins, and they had to restore it from a backup. Now that it is restored, the bot seems to be running correctly again. Sorry for the inconvenience. — Carl (CBM · talk) 02:04, 20 March 2010 (UTC)

Bottom importance

I swear I saw somewhere (I can never find things in this WikiProject, it's just too big) that the new assessment scheme recently rolled out allows importance to be set to 'bottom'... Though the fact that it's not working for me makes me assume I'm wrong and I dreamt something. - ʄɭoʏɗiaɲ τ ¢ 18:23, 3 March 2010 (UTC)

Some projects do use bottom-importance. It is not a "default" rating but the new WP bot does support it as a pre-WikiProject option, and I have set it up for some WikiProjects. If you want me to set it up for your project, let me know. — Carl (CBM · talk) 18:45, 3 March 2010 (UTC)
If you could that would be awesome (I'm also trying to get the bot to create an assessment table. It made the entry in the index, but no table)! The project is WP:ONRD and the template is {{WPOR}}. Thank you in advance :) - ʄɭoʏɗiaɲ τ ¢ 20:34, 3 March 2010 (UTC)
The table for the Ontario road project looks ok (you want the one with the lowercase 'road'; the uppercase one seems to be due to a category error).
I just went to set up bottom-importance for you in the bot, but it looks like the template itself is not set up for it. Eventually, the pages need to end up in Category:Bottom-importance Ontario road articles. I am not an expert with the wikiproject templates, so I will ask MSGJ if he can set that up. — Carl (CBM · talk) 12:23, 4 March 2010 (UTC)
  Done. The custom importance mask is held at Template:WikiProject Ontario Roads/importance. — Martin (MSGJ · talk) 12:43, 4 March 2010 (UTC)
Thanks. Once there are a couple pages in the category, I will set up the bot to read it (via {{ReleaseVersionParameters}}; all the setup is on-wiki). I prefer to wait until I can see the pages in the bot's output, so that I can be sure that everything is working when I set it up. — Carl (CBM · talk) 13:12, 4 March 2010 (UTC)

Question: How do I get a bottom importance column in the 1.0 table, as opposed to it being on the other side of 'NA' in an 'Other" column? - ʄɭoʏɗiaɲ τ ¢ 02:33, 30 March 2010 (UTC)

I set it up for you. The setup just requires adding a template to Category:Ontario road articles by quality with the necessary info, and then updating the project data or waiting for the next daily update. I did all that and it seems to be working. — Carl (CBM · talk) 02:48, 30 March 2010 (UTC)

WikiProject banner censorship

We've got a problem at Talk:Johnny Weir with a couple of editors refusing to permit the {{WikiProject LGBT}} project banner.

The specious argument is that Weir has refused to publicly announce a specific label for his sexuality, and therefore they say WP:WikiProject LGBT studies shouldn't be allowed to place a project banner on the page to list the article in their assessment categories. Instead, a pair of editors would like silent, invisible support from the project. WhatamIdoing (talk) 18:28, 2 April 2010 (UTC)

Amazing - I don't recall tagging ever being so controversial before! I've been very busy in "real life" lately and missed a lot of WP events in the meantime. It looks like this is a case where a debate over the content has spilled over into the tagging aspect. Of course, the WikiProject has the right to tag as it sees fit, but in this heated debate the tag has become a symbol to one side - "if we let them have that on the page we are conceding that we've lost." I see that Julie Andrews and Lady Gaga uncontroversially have LGBT tags on their talk pages - that doesn't indicate that they are gay - merely that WP:LGBT wants to keep an eye on those articles because they consider them relevant. (I also consider them appropriate LGBT tags, not "bagging") I like the way there is an explanation section in the template used on the Julie Andrews page - a nice touch, though it maybe hints at a past debate.
In principle, I completely support the right of LGBT to tag the article, and personally I'm happy to endorse that anywhere if requested. From a tactical point of view, however, I suspect that until the bigger debate has been resolved, tagging will just be seen as provocative at this point. I don't think it's worth a big fight for one article. If this situation starts to spread to other LGBT pages (or other project tags), then that is a much more serious issue. Thanks a lot for alerting us. Walkerma (talk) 22:56, 2 April 2010 (UTC)
Yes, it's amazing (and not in a good way).
This is all after an RfC on the general topic, which concluded against censorship and in favor of projects' freedom to tag whatever they wanted.
If you'd put Talk:Johnny Weir on your watchlist for the next few weeks, I'd appreciate it. (I'm also advocating for the liberal use of the |explanation= parameter.) WhatamIdoing (talk) 02:38, 3 April 2010 (UTC)

A class status

We could use   to demonstrate A class articles, lists or portals. --Extra999 (Contact me + contribs) 03:51, 21 March 2010 (UTC)

The standards for A-Class can vary widely from one project to another. I think it makes more sense to limit the special markers to just "featured" content (or remove them all, but I don't care either way). — Carl (CBM · talk) 11:47, 9 April 2010 (UTC)

Kiwix 0.9 alpha1 + WP1 0.7

Have released the first alpha of Kiwix 0.9 dev. branch. Files are available here. I have also prepared a special ZIP with the Windows binary, ZIM file (all 30.000 articles) and search engine index here. It works out of the box, is fast and need really few CPU/Memory resources. Enjoy it! Kelson (talk) 11:40, 22 April 2010 (UTC)

I wrote up a howto as well. Wizzy 08:56, 27 April 2010 (UTC)

Status update

The status section says ..."we hope to release in 2009" Well did you?? -- Harry Wood (talk) 11:17, 16 April 2010 (UTC)

Well, it's released. Thanks for pointing out this oversight. — Carl (CBM · talk) 11:23, 16 April 2010 (UTC)
Sorry Harry, I'm basically on Wikibreak right now, but I finally got a bit of time to update things. Thanks for the prompt! I'll be back in July, and hopefully we'll be able to get started on Version 0.8 then. Walkerma (talk) 03:58, 15 May 2010 (UTC)

Possible list of suggested topics for Wikipedia:School and university projects?

Maybe this might seem like an odd idea, but I get the impression it might be a lot easier for the instructors trying to use wikipedia as an exercise in class if they had readily available a few lists of topics or articles which editors here thought could merit improvement, maybe like 20 suggested topics for each semester which could at least serve as a basis for consideration. Would this be of interest to the school projects, and would anyone like to maybe help prepare such listings? John Carter (talk) 17:43, 14 May 2010 (UTC)

Would you like us to use the bot to pick out a selection of articles that have high importance but poor quality? Or groups of related topics that have mainly poor-quality articles? I think we could probably do such a thing, if the school & university project is interested. Speaking personally though, I'm basically on Wikibreak for the next two months or so, so I couldn't do this now myself. Walkerma (talk) 03:56, 15 May 2010 (UTC)
That would be one possibility. Another might be to contact related WikiProjects and ask them which topics related to their subjects they consider to be possibly the easiest to develop for schools. Maybe something like "Economy of (wherever)", which could include companies of the area, a given discipline ("history of religions" a la Joseph Campbell and Mircea Eliade, for example), or whatever. John Carter (talk) 21:07, 15 May 2010 (UTC)

User:WP 1.0 bot Quality statistics

Is it possible to program the bot to create three separate quality numbers for class=File, class=FP, and class=VP. Thus, if a file gets promoted, it would show up on reports.--TonyTheTiger (T/C/BIO/WP:CHICAGO/WP:FOUR) 03:26, 1 June 2010 (UTC)

Yes, this can be done by any project, because the bot is configurable for custom ratings. It will first require changing the project's assessment template so that the three ratings are marked by different categories. Then I will write the configuration to make the bot read those categories. I am going to be traveling until Jun 8, so I will need to postpone it until then. But that should give you time to make the template change. — Carl (CBM · talk) 03:53, 1 June 2010 (UTC)

So Wikipedia on CD Version 0.7....

When is this going to be happening? I heard sometime 2008, but it is now 2010.64.26.68.82 (talk) 19:03, 3 June 2010 (UTC)

I found it! Wikipedia:Version_0.7 There is a download link here, a bit over 2.6 GB. Zell Faze (talk) 19:19, 3 June 2010 (UTC)

Massive removal

Why is there a massive removal of articles from the task forces of WP:PHILO as evidenced at the log pages of each one? (See: Template:Phil-logsGreg Bard 20:45, 17 June 2010 (UTC)

I suspect it has to do with this edit. Pinging MSGJ. Titoxd(?!? - cool stuff) 21:58, 17 June 2010 (UTC)'
Yes I see. Recently, the name of the banner was changed. So how do we deal with this most efficiently? Shall we, A) get a bot to retag every talk page, or B) Some edit to WPBannerMeta, or its supporting pages will enable both "{{phil..." and {{WikiProject Phil..." to work. Greg Bard 22:02, 17 June 2010 (UTC)
All   Fixed I hope. Sorry for any inconvenience. By the way you can carry on using {{philosophy}} as a shortcut and it will work just fine. — Martin (MSGJ · talk) 22:30, 17 June 2010 (UTC)
Wow, well done. I was dreading the potential change, I am usually the one who tags all the new articles. Thank you. Greg Bard 22:44, 17 June 2010 (UTC)

Logs

Neither Wikipedia:Version 1.0 Editorial Team/Chicago articles by quality log‎ nor Wikipedia:Version 1.0 Editorial Team/WikiProject Illinois articles by quality log‎ has been updated since the 15th.--TonyTheTiger (T/C/BIO/WP:CHICAGO/WP:FOUR) 02:19, 18 June 2010 (UTC)

Thanks for pointing it out. I started a manual run, which will also catch up on all the older logs that didn't get uploaded, and I'll figure out why they didn't run automatically. — Carl (CBM · talk) 02:38, 18 June 2010 (UTC)
Thanks.--TonyTheTiger (T/C/BIO/WP:CHICAGO/WP:FOUR) 13:02, 18 June 2010 (UTC)

IRC on producing print versions of Wikipedia content

POSTING FROM 2009

We will be holding an IRC meeting with PediaPressto discuss production of books based on Wikipedia content. More specifically, we are interested in article collections designed by WikiProjects, for example "Chemical Elements" or "Atlantic Hurricanes since 2000" Please join us, especially if your WikiProject is interested in producing such a book.

Time
Tuesday, October 20th, 2009 at 1600h UTC (noon US EDT).
Location
#wikipedia-1.0
Planning to attend
Please sign below

Walkerma (talk) 14:05, 19 July 2010 (UTC) (Actually done in 2009)

Report on Wikipedia Version 0.7

As we prepare for new offline releases of Wikipedia, I wanted to review the successes and failures of our last release, Version 0.7. Some of the issues – including the two major ones - have already been addressed, but some remain in place as we prepare for Version 0.8.

Successes

Scaling up

The main progress since Version 0.5 is that we were able to produce a collection large enough to produce a "viable" encyclopedia – we scaled up from 2000 articles to 31,000. This was possible because of various developments, every one of which represents a major step forward:

  • The WP1.0 assessment scheme was very well established when we made the selection, and it allowed us to realistically find nearly all of the high quality articles on the English Wikipedia. This system was barely established for Version 0.5.
  • SelectionBot allowed topics to be selected for importance. The system proved to be remarkably reliable; in most cases, WikiProjects were happy with our selections, and indeed they used the selection to highlight "important but neglected" articles. The system was still flexible enough to allow manual nominations, though many of these were already selected by the bot.
Consulting with WikiProjects
  • When the selection was made, we contacted WikiProjects for feedback. Although some did not respond, many WikiProjects did give us a lot of useful feedback, and became engaged in the process of making an offline selection.
OpenZIM format

The OpenZIM format for storing the corpus proved to be very effective, and gave us no significant problems.

Software improvements

Both the Okawix and Kiwix readers are improved compared to the Kiwix software used for the Version 0.5 release. In particular, the search capability is now much better.

Use

Although this collection was long delayed, it still proved to be useful. The Okawix release has been used in Zambia, and the Kiwix release was provided to schools in South Africa.

Problems

Long time to publication

Version 0.7 took more around 18 months from start to finish. This was caused by a variety of issues, but two factors dominated:

  • Article version selection: We cannot allow heavily vandalized versions of articles into the selection, but we did find some of these present in our initial dump. We checked for "bad words", and identified more than 20,000 examples. These had to be checked manually – nearly all were done by one person – and this process took around six months, and only addressed vandalism containing profanities and simple shoutouts. I (Walkerma) would consider this to be the most significant problem encountered in the production of the V0.7 collection. Probably resolved using WikiTrust, flagged revisions (possibly) and other tools (I'll post more on these developments soon).
  • Reader software: The Okawix software was undergoing development during this period, and some bugs were found once the 0.7 corpus became available. There only a couple of testers for the Mac version, and this delayed the final release. The Kiwix version was also being developed - initially it was only available for Unix and Mac, and a PC version was only released later. Now resolved (as far as we know).
Resolving manual and bot-generated selections
  • Tagging of autoselected articles: Many articles on major topics were nominated manually, because people did not realize that they had already been selected by SelectionBot. Many users also requested that selected articles be tagged on their talk pages. Not yet resolved .
  • Integrating manual and automated selections: Combining the manual selections into the bot selections was not a simple task, and involved a lot of "hacking". We would like to find a more elegant way to achieve this. Not yet resolved .
Other issues
  • Index generation: For Version 0.5 a set of index pages was prepared manually, based on familiarity with the chosen 2000 articles. For a selection of 31,000 articles, this approach was not practicable, and we needed a computer-generated index. Code was written which used keywords from categories (assigned manually), but bugs caused significant errors in the index pages, and lack of debugging time meant that many of these errors remain in the final release. We need someone with (a) an interest in the indexing issue and (b) coding experience to fix this problem, either by fixing the existing code or possibly by starting from scratch with different code or even a different approach. Not yet resolved .
  • Distribution: Since the release of Version 0.7 was delayed – and the article versions were rather old at time of publication – the announcement of the release was deliberately very low key, and large-scale distribution was not considered. (Version 0.5 showed the power of good publicity from the Foundation.) However, we need to prepare for large-scale distribution. Although Linterweb does not have experience of wide distribution, it is believed that they are gearing up for this. Not yet resolved .
  • Collaboration with Wikipedia for Schools release: There was significant consultation with User:BozMo, but divergence in methods and timelines between the two releases prevented the groups from sharing a lot of resources. The 2008 Schools release (around 6000 articles, manually checked) proved to be very successful, and almost two years later it is still heavily downloaded. BozMo's team did use the Version 0.7 selection to provide ideas for their manual selection, but closer collaboration would be very helpful for future releases. Not yet resolved .

Updated release

Although the articles chosen for Version 0.7 are still very relevant, the versions of those articles are rather outdated (mostly from December 2008). The WikiTrust team and others are currently working (this weekend) on producing an updated set of article versions, to be called Version 0.701. This will not only make the collection more up-to-date, it will also allow us to test out (and debug) the automated version selection process before we need it for Version 0.8. More information on this system will be posted soon.

I think once we can turn all of the red and orange issues to green, we will be ready to release Version 1.0. If there are other things that I have missed, please add these. Walkerma (talk) 01:19, 18 July 2010 (UTC)

IRC meeting on Thursday to plan the next offline release

We will be holding an IRC meeting to plan our next online release, tentatively called Version 0.8. This may well be the first of several meetings. User:CBM is gathering statistics for us to use for the selection; meanwhile, the WikiTrust people are testing their version-selection software on the Version 0.7 articles.

Time
Thursday, July 22nd, 2010 at 1500h UTC (11am US EDT).
Location
#wikipedia-1.0
Proposed agenda items (please add more as needed)
  • Lessons from Version 0.7 (let's limit the time on this!)
  • Main issues to be addressed (many taken from 0.7 lessons), and who will work on these
  • Scope & format of Version 0.8 release
  • Timeline for Version 0.8 release
  • Coordination of our efforts with Wikipedia for Schools and One Laptop Per Child
  • Formal publication, or just made available for download?
Planning to attend
Please sign below

Summary of meeting

We didn't manage to cover every agenda item, but we did discuss some very important topics, mainly focused around (a) producing an updated collection for Version 0.7 (tentatively called Version 0.701), (b) producing a new collection (Version 0.8) and (c) coordinating this work with Wikipedia for Schools. This report includes a few items discussed by email or on IRC shortly after the actual meeting.

  • A lot of work has been done already on preparing the 2010 version of Wikipedia for Schools. Volunteers are needed to help with things like copyright statements.
  • The Wikipedia for Schools group is interested in seeing the upcoming general releases, and particularly the WikiTrust method for article versionID selection.
  • The WikiTrust group has prepared an updated list of VersionIDs for the Version 0.701 release. However, around 5000 do not have any edits in the last month, and perhaps another 5000 which only have a couple of edits, which means it is hard to know if they are vandalized.
  • BozMo mentioned a frustrating problem - when selected articles get moved (renamed) or turned into disambiguation pages. How can we catch these? The WikiTrust group will add another check to detect big changes in article size, and use this to flag such cases.
  • It was noted that there is a big pent-up demand for vandalism-free collections to use in schools; educators daren't use the "live online" version because of the "giant penis problem" (the risk of finding a page vandalized with inappropriate content).
  • The Wikipedia for Schools group have found that portals provide a nice way to navigate through subject areas, and are proving popular with users. Walkerma admitted that nothing has been done on this for Version 0.7/0.8, but we should look into it.
  • By the beginning of August, the WikiTrust group will aim to produce a new set of versionIDs that they believe are (mostly) vandalism-free, and this should include checks for sudden changes in article size.
  • Carl should be able to have the article selections made for Version 0.701 and for Version 0.8 done in the next few days.
  • We are proposing that Version 0.8 will be close to 40-50,000 articles in size.
  • Carl has prepared a system for adding in manually selected articles, here. Walkerma will be processing the articles nominated manually as soon as Carl has the autoselections completed.
  • Kelson is ready and waiting to produce openZim and Kiwix versions of the Version 0.701 and 0.8 collections.
  • Walkerma raised the issue with Carl and Kelson (discussed recently at Wikimania), about how other language Wikipedias might be able to adapt and use the tools (assessment, importance ranking, article selection, versionID selection, etc) developed for en:WP. Gerard Meijssen at Wikimania mentioned Translatewiki.net as a possible way to facilitate this. We believe that the French Wikipedia may be a good place to begin this, since they already have a successful WikiProject-based assessment scheme in place.
  • We will have another IRC meeting on August 5th to review how the developments discussed above are progressing. (We are tentatively considering a separate IRC on July 29, to discuss other aspects, i.e., publication and dissemination of offline content.)

If I've missed anything, please add this in. Cheers, Walkerma (talk) 19:11, 24 July 2010 (UTC)

Question about listing articles

Hi WP:1.0! Is it possible to list all articles within a WikiProject in WikiMarkup format? I want to make a RecentChangesLinked page for the project and would like it to auto-update whenever the 1.0 bot is run. Thanks, Ynhockey (Talk) 21:24, 26 July 2010 (UTC)

I've cross posted this here, where the tech people are more likely to see your request. I would think it would be possible. Walkerma (talk) 23:42, 26 July 2010 (UTC)
Which wikiproject? This is easy for me to do with the WP 1.0 bot, but it will only work as long as the project is small enough that all the articles fit on a single page. — Carl (CBM · talk) 23:49, 26 July 2010 (UTC)
The project is Wikipedia:WikiProject Israel. If there's a problem with size, I believe it will be possible to slowly split it into task forces, but at this point I'd prefer not to. Thanks, Ynhockey (Talk) 04:49, 27 July 2010 (UTC)
I'm afraid that the 8,000 articles is too many. It would make a page over 400kb in size to list them all, which is too large for the wiki system to handle well (even editing the page does not always work when it's that big).
However, I think I can do something better. I should be able to write a tool that will just use the list already in the toolserver database and directly generate the list of recent changes without copying the list to the wiki. Let me try that for a couple days and I will contact you on your user page. — Carl (CBM · talk) 05:18, 27 July 2010 (UTC)

IRC meeting on Thursday to plan publication and distribution of Version 0.8

We are having another IRC meeting on Version 0.8, this time to focus on publication and distribution. The date and time have now been fixed.

Time
Thursday, July 29th, 2010 at 1500h UTC (11am US EDT) (tentative).
Location
#wikipedia-1.0
Proposed agenda items (please add more as needed)
  • Ways to publish the collection
  • Formats for publication - downloads, DVD, flash drive, mobile phone versions?
  • Methods for distribution
  • Publicity
Planning to attend
Please sign below

No formal IRC on August 5th

We had talked about a possible IRC meeting on August 5th, but it looks as if we should wait at least a week until more progress has been made in writing code. I'll log onto IRC (#wikipedia-1.0) anyway, in case anyone wants to discuss anything. Walkerma (talk) 06:01, 5 August 2010 (UTC)

Next IRC meeting Aug 12th to discuss the Version 0.8 release

We are having another IRC meeting on Version 0.8, to review where we stand with article & version selection. Quite a lot of progress has been made since the first meeting (thanks to all who helped with that!). The date and time have now been fixed.

Time
Thursday, August 12th, 2010 at 1500h UTC (11am US EDT, 5pm British Summer Time).
Location
#wikipedia-1.0
Proposed agenda items (please add more as needed)
  • Article selection for Version 0.8
  • RevisionID selection for Version 0.8
  • Time window for revisionID selection?
  • Schedule for completion of article + versionID selection
Planning to attend
Please sign below:

Walkerma (talk) 03:10, 10 August 2010 (UTC)

Question about a history of article quality

Hi Editorial Team, I am looking for the data used to construct the article quality and importance table at various time intervals in order to trace the history of quality change of all articles. Does someone have longitudinal data of article quality? I am reachable at nemoto [at] mit.edu Thank you for your help.--Nemoto76 (talk) 21:19, 18 August 2010 (UTC)

Wikipedia books

This seems to me a very good idea that could be very beneficial. It does however at least potentially present a few problems.

We currently have a policy regarding all content being "encyclopedic". I think that there is a very real chance of having more than a bit of unencylopedic content added if we were to seem to try to create books for subjects which have, well, comparatively little potentially encyclopedic content, like some short video game series, performers without much work under their belts, and the like.

This could be somewhat circumvented by creating of two classes of "books", one for "booklets", which have complete and encyclopedic coverage of a given range of content, but are of comparatively short length, and full "books", which would be the same but meet a minimum length guideline/requirement. We could thus have "booklets" available on very short-run TV series or one-hit wonder performers, for example, and books for such things as Physics, Religion, and other topics where the encyclopedic content does meet the minimum length requirements.

There would be a wide range of topics which might, in some way, meet book-length requirements, which cross a number of topical guidelines. Gold might include information on the chemical element itself, gold mining, the symbolic meaning of the metal, how it is refined, processed, and finally turned into jewelry or whatever, and so on. I would think that such topics would be best made as "booklets". Then, the relevant material for a given type of book could be added to that book, for instance a book on Elements or a book on Jewelry, while the booklet would be available for anyone who is particularly interested in that subject itself.

In general, I think that, for print/distribution, we might best aim at full-length books initially. Only 5 US presidents, the first 3, Lincoln, and FDR, are inculded in the Britannica Macropedia. And, honestly, I have trouble seeing how John Adams would have sufficient encyclopedic content for a separate book, much as I like His Rotundity. Each president would probably be substantive enough for a booklet, however, and assemblying them all into a book on the US presidency might be the most easily accomplished first-stage objective. Similar might apply for other topics as well.

Also, for full-length books, there seem to me anyway to be two likely structural variations: the topical book and the encyclopedia/historical encyclopedia variations. Both would be equally valid, although the former would be more work to put together. John Carter (talk) 15:16, 6 August 2010 (UTC)

I think perhaps we should try putting together a couple of topical book selections over the next few weeks. Are you thinking of something in the region of 1000 pages (book pages, not articles). What do you think? Perhaps you could work on one on the US presidency, and I could do one (say) on organic name reactions. I'm in quite regular contact with the PediaPress people so I'll get their comments on this. Thanks, Walkerma (talk) 03:29, 10 August 2010 (UTC)
Hello, John & Martin (and others too).
I don't have a specific reply to anything said here, but I think I should mention that there are already several well-structured collections of articles, many of which can be found in Category:Wikipedia books (community books). These are maintained by Wikiprojects, particularly WikiProject Wikipedia-Books, and several WikiProjects have adopted the book-class (a bit short of 200 projects and taskforces at the time of writing). Concerning chemical elements, many have already have books (of varying quality). There are also a bunch of them on various musical artists, such as Book:Megadeth, Book:Metallica, Book:The Supremes, Book:Justice (French band), etc..., and on video games, such as Book:StarCraft, Book:The Elder Scrolls series, Book:Castlevania series, etc...
Where books really shine IMO, is when a well-defined collection of articles is made, such as Book:American Carriers, or Book:Chemical elements (sorted by number) or Book:Messier objects, although books such as Book:Canada and Book:Hadronic Matter are also of very high quality (I'll admitted creation four of these, and having heavily edited Book:American Carriers as well).
So yeah that's about what I had to say. I'm with the PediaPress team, BTW, so if you have any questions just let me know. Headbomb {talk / contribs / physics / books} 06:04, 20 August 2010 (UTC)
I would think books of maybe 300 pages or more would potentially be more than sufficient, although there is a question what size type is used, which would affect the number of pages.
A couple of other questions come to mind, which might be worth mentioning.
Right now, I think there is some difficulty for newer editors, particularly students and teachers, to maybe both figure out what articles to improve and where to find sources for them. I know we have the list of 30,000 CD selections, but it would still be an amount of work to find out which ones need most work. Maybe having a statistics chart available of those articles available somewhere would help students find ones which can most easily get improved.
There is also the question which a lot of students and teachers might have of what sources are considered acceptable, and in some cases, how to access them. I think maybe having a list of reliable sources, like periodicals, available for the various topics, like perhaps on related WikiProject pages, might make finding sources easier for some people, particularly if some are available for free online.
And, finally, one last comment/concern. I can see two types of books from us which would be most useful. One would be the kind which can be made available for students in poorer areas where having books available can be a problem. Another would be topics about which there are a few extant good books, but not many and not necessarily widely circulated. Things like the Jehovah's Witnesses, Scientology, Global warming, some smaller countries, recent wars, and the like come to mind. I do think that maybe some schools would like having "History of (name the state)" books available for middle school or junior high students as well.
Right now there is the Wikipedia:Online Ambassadors idea. If it works, maybe, next term, we could have a page of material for schools, maybe listing the CD articles, any books and articles for books people want, and other "ideas" listed for students and classes. With any luck, we might have some of the magazine/newspaper lists for the various topics available as well, making it easier for them to find sources to develop the content. With all that together, and maybe a prominent link on the community portal and one of the welcome templates to the relevant page, we might have a bit more success in developing some of that type material. John Carter (talk) 17:33, 22 August 2010 (UTC)
I'm travelling till Aug 31, but when I get back I'd like to discuss this in more detail. (I'll try to catch up with you if I can.) We discussed some of these topics before, on the Offline strategy task force, but I'm now interested in getting some real books made and printed. BTW, I gave a talk on Wikipedia chemistry at the American Chem Soc conference today, and passed a PediaPress chemistry book around the audience! Walkerma (talk) 03:04, 23 August 2010 (UTC)

Recruiting Assessment Team for Public Policy Initiative

Hi Editorial Team, the Public Policy Initiative is recruiting Wikipedians to assess article quality improvement over the course of the project. We are testing the metric for consistency and to see if there are differences between Wikipedian scores and subject matter expert scores. We are looking to identify the strengths and weakness of the current assessment system since we are using that system to evaluate article quality improvement through the project. Check out WikiProject: U.S. Public Policy if interested. Thanks! ARoth (Public Policy Initiative) (talk) 23:28, 7 September 2010 (UTC)

Another stupid idea from the village idiot

OK, I have noticed that the Biography project did, ultimately, manage to get all of its 200 core biographies up to be level. This is a rather remarkable degree of effectiveness, particularly considering the obscurity of some of them.

Unfortunately, since then, the effort there has died. Rather predictable, I suppose. I think that a lot of the "excitement" here might have dropped off on the same basis - that people addressed all the articles that they were interested in.

On the talk page of the Biography project, I posted a link to a new proposal at Wikipedia:WikiProject Council/Proposals/Halls of Fame to a proposal to basically expand the effort of the core biographies group to other biographies which are probably, at least often, of the next level of importance.

After giving the matter some further (or maybe first) thoughts, it struck me that maybe something similar could be done with the efforts here. Maybe, and this is just a maybe, interest would pick up if editors who worked on articles already included in the 1.0 selections got a chance to nominate articles for inclusion into a broader field of articles which are, effectively, of the next level of importance. Then I, who might seriously love German shepherds, could work on any article with the purpose of getting the article on German shepherds included in the "development set" or "farm team", and potentially get some help from other editors who want to have their favorite articles get placed at that level as well. We might also be able to get a bit more interest if we did like DYK does and maybe reward editors for bringing important articles to B level, and, maybe, get a few more people to collaborate on articles which might be excluded from a release version because of problems with the article, if we gave some sort of points for participation in collaborations, with article nominations included as well. And then maybe have all the nominations of a given period reviewed at once, probably including most of those which clearly don't meet the basic "importance" or whatever requirement.

Most of the entries in the Timetables of History are supposed to be among the most important events of that time, and, thus, probably qualify as at least potentially important enough for consideration for inclusion in a release version. And some of them may well be more appealing to editors. There would be a notable gap in some areas, like maybe kinds of animals, chemicals, and mathematical articles, but they could be included during the review above, or individual articles could be nominated by those who improve others.

Anyway, just an idea. John Carter (talk) 01:10, 13 September 2010 (UTC)

John, please don't call yourself the village idiot - you're one of the most creative and dedicated Wikipedians I know! I think this is a brilliant idea! I think that we should perhaps re-brand the now-essentially-defunct "Core Topics" subproject, and have it do exactly what you suggest. I would propose using the bot importance ratings to find that "next level" of important topics, some of which may be languishing as Start-Class. If you want to take the lead on this, I will also work with you - I can commit a bit of time to it over the next few months; I can also contact a few 1.0 people who care about this sort of thing. I think this would be an excellent way to improve the quality of the articles that get a lot of online traffic, and which form the backbone of any offline collection. Let me know what you think. Thanks, Walkerma (talk) 16:41, 13 September 2010 (UTC)
Just for purposes of clarification, I would want to make it clear that I don't think this expanded number of articles would be as it were considered included in some subsequent release version just by being nominated for inclusion in this expanded group. There will, I'm thinking, probably be more baseball, football, and basketball hall of famers nominated than probably any other group, and I don't think that they would all necessarily merit inclusion over other articles. However, by being included in the "development" list, the likelihood of those articles being improved to a higher level than other articles would be significantly increased, thus increasing the likelihood of those articles being included. And, on further thinking, The Timetables of History probably includes far too many items in its listing to be a practicable baseline. Maybe, at the beginning, just add the articles specifically nominated by improvers of other articles, and, maybe, a few "groups" of articles determined to be sufficiently important, like maybe "History of" articles for individual countries articles or whatever. John Carter (talk) 15:36, 16 September 2010 (UTC)

Version 0.8 announcement to WikiProjects

Apparently we now have ready a collection of around 47,000 articles with revisionIDs for Version 0.8. Once we have the URLs for each WikiProject's selection established, we will start contacting all the WikiProject talk pages to solicit feedback. Please take a look at the draft announcement and edit/leave comments as you see fit. Thanks, Walkerma (talk) 04:25, 13 September 2010 (UTC)

OK, it looks like we're ready to start contacting the WikiProjects tomorrow. Walkerma (talk) 05:51, 17 September 2010 (UTC)
Hello Martin, it is good to see this announcement!
Does the list of "WikiProjectName articles and revisionIDs we have chosen" exist? (currently a red link)
RickJP (talk) 06:07, 18 September 2010 (UTC)
The text there is a sort of template. When it gets posted to each wikiproject's page, we replace that link with an actual link to their articles.
Last night I ran the announcement script for the first 100 WikiProjects. I'll wait until Sunday and then announce the rest - to see if there is any feedback today that I can take into account right away. For an example of what the announcement looks like, see Wikipedia_talk:WikiProject_Arthropods. — Carl (CBM · talk) 11:22, 18 September 2010 (UTC)
I see. How can we know if an article is included in the release? Will selected articles be tagged on their discussion pages, or will we be maintaining in Wikipedia a global list (reflecting the list on ~enwp10/release-data) to make it easy to check if a certain article is included?
Also, was there any progress on the okawix issues found in release 0.7?
RickJP (talk) 19:16, 18 September 2010 (UTC)
There are plans to tag the talk pages, but a few details remain to be settled. For the moment, the way to tell is to use the web tool at http://toolserver.org/~enwp10 . I don't know the answer about okiwix; hopefully Walkerma can answer that. — Carl (CBM · talk) 20:45, 18 September 2010 (UTC)
For the record: see also the discussion thread about the help for the web tool. RickJP (talk) 19:57, 20 September 2010 (UTC)

IRC meeting on Thursday to plan final steps for Version 0.8

The feedback from WikiProjects has been received and partly reviewed, and we now need to prepare for making the offline version of the release. I'm proposing we meet on IRC on Thursday. Please sign up if you plan to attend, and add any agenda items you think need covering.

  • Channel: #wikipedia-1.0
  • Time: Thursday, October 21st at 1500h UTC (11am EDT, 1700h CET) (note correction!).
Agenda
  • Producing an index
  • Producing the ZIM file from the wiki selection. Will we have a mirror version online?
  • Publication formats - Kiwix, Okawix, BitTorrent
  • Announcements/publicity?
Planning to attend
(please add four tildes)

"Redirect" choice

How dose one add the "Redirect" choice to "Latter Day Saint movement" Right now it show up as "NA" class and not "Redirect" class. I've spent some time trying to figure it out with no success.--ARTEST4ECHO (talk|contribs) 13:34, 5 November 2010 (UTC)

Invitation to participate!

Hello! As you may be aware, the Wikimedia Foundation is gearing up for our annual fundraiser. We want to hit our goal, and hit it as soon as possible, so that we can focus on Wikipedia's tenth anniversary (January 15) and on our new project, the Contribution Team.

I'm posting across WikiProjects to engage you, the community, in working to build Wikipedia not only through financial donations, but also through collaboration in building content. You can find more information in Philippe Beaudette's memo to the communities here.

Please visit the Contribution Team page and the Fundraising page to find out how you can help us support and spread free knowledge. DanRosenthal Wikipedia Contribution Team 18:57, 15 November 2010 (UTC)

CRWP-related question

Just a quick question. We've re-created the former banner template for WP:WikiProject Canada Roads (CRWP) and we are getting ready to deploy it. The new banner uses an updated set of assessment categories for "<province/Canada> road transport articles" plus "Trans-Canada Highway articles". Do we need to do anything more than what has been done to get them set up? Imzadi 1979  00:39, 3 December 2010 (UTC)

Sorry I missed this at the time! It looks as if everything is already working smoothly for you. If you see any snags please let us know. Cheers, Walkerma (talk) 03:46, 11 December 2010 (UTC)

again about okawix

(solved) Hi, I have a problem about Okawix. I have installed the corpus on a removable disk, so that sometimes windows changes the drive letters. I sometimes receive the message "en.wikipedia vanished, do you want to remove it?", according to whether or not the drive letter has changed. Is there a way to tell Okawix where to find the corpus in your disk? (of course, I could try to change the drive letter!). Something similar has been discussed here. By the way, why not creating the page Okawix? Thank you!--Popopp (talk) 17:39, 12 December 2010 (UTC)

PS: Sorry in case this is not the right page for this discussion! :(--Popopp (talk) 17:40, 12 December 2010 (UTC)
Wow, it seems that right-clicking you have the option "add a corpus from local file"!--Popopp (talk) 18:09, 12 December 2010 (UTC)
I was hoping someone from Linterweb would respond, because I don't know the answer. I've sent the developers an email, asking them to respond here. Cheers, Walkerma (talk) 01:39, 22 December 2010 (UTC)
Hi. There is no simple way to change the location of an already installed book (corpus) :
The two normal ways of choosing the location on disk where the data is stored : 1) When you download a book from the download page of the software, it lets you choose where to install it on disk. 2) When you install a .okawix file using the "add a corpus from local file" (right clic menu), the data is stored in a directory beside the concerned .okawix file. Note that the "add a corpus from local file" works only with ".okawix" files.
A less normal way : you could move an already installed book by moving its data (e.g. the whole en.wikipedia directory) and telling okawix where the new location is. This can be done by editing the corpora.xml configuration file that should be found (under windows) in something like Documents and Settings\<user name>\Application Data\okawix . Under linux in $HOME/.linterweb/okawix/xxxxx.default. Mononoke Hime (talk) 09:30, 22 December 2010 (UTC)

GA and FA

Are all GA and FA in the offline version? Thanks--Iankap99 (talk) 01:36, 19 December 2010 (UTC)

No. Importance is a criterion too. On Version 0.5 we did that for FAs, and ended up with FAs like Exploding Whale and Heavy Metal Umlaut, yet we didn't have many major cities, etc. FAs are big, so they take up a lot of space, and so if GAs/FAs are on relatively obscure topics we don't include them. Walkerma (talk) 01:27, 22 December 2010 (UTC)

Version 0.8 almost ready

Just an update on the behind-the-scenes activities. The full collection of Wikipedia:Version 0.8 articles and article versions (revIDs) is now ready and is being prepared as a ZIM file. It looks as if the new WikiTrust revID selection tool worked very well, making it much easier for us to make future offline releases. Version 0.8 should be ready for mid-December, with Kiwix, Okawix and Torrent releases coming out shortly after that. I'll post here when the release is finally ready. Walkerma (talk) 03:53, 11 December 2010 (UTC)

What is the size of 0.8 (approximately, in gigabytes)? Axl ¤ [Talk] 10:54, 11 December 2010 (UTC)
Probably around 4-5 GB. We were hoping to fit it onto a DVD, but we're not really sure until we produce the ZIM file. With the advent of cheap flash drives and memory cards, the need for a DVD is less than it used to be. Cheers, Walkerma (talk) 06:58, 12 December 2010 (UTC)
Okay, thanks. Axl ¤ [Talk] 10:37, 13 December 2010 (UTC)
As people may have gathered, we have had a delay - for details, see Wikipedia_talk:Version_0.8. When the ZIM file was compiled, there were some bugs in it, and these are being fixed at the moment. I was told by email that we should have a beta version to look at next week. Walkerma (talk) 21:16, 13 January 2011 (UTC)
OK, I've been informed that most of the bugs have now been addressed, and we're awaiting a beta version with bated breath....! Walkerma (talk) 13:44, 12 February 2011 (UTC)

Purpose of A-class

Please see /Assessment#A-class. Simply south (talk) and their tree 22:19, 22 December 2010 (UTC)

Please test Version 0.8 Kiwix beta - available now

We think we're about ready to publish the Version 0.8 Kiwix release. We're looking for people to do a final check before publication. Details are here. Thanks, Walkerma (talk) 17:05, 16 February 2011 (UTC)

Proposed release of Version 0.8 on March 1st

Since our recent testing has not shown any major problems, I'm proposing that we make the Version 0.8 release official on March 1st. This would include both a Kiwix version and an Okawix version. Please test and leave feedback, and please report ASAP any serious issues that might prevent launch of the collection on Tuesday. Thanks, Walkerma (talk) 21:38, 27 February 2011 (UTC)

Based on feedback on the above page, we postponed the release until today - I'll write something and post a release announcement later today. Walkerma (talk) 20:51, 2 March 2011 (UTC)

Version 0.8 final release now available

Wikipedia:Version 0.8 is now available for free download as a Kiwix or Okawix release. Links are given at Wikipedia:Version 0.8/downloads.

Hello, in the main page, it is said that the version is still in a work-in-progess state. Is it not time to update that page to show that the project is under active development (I'm not implied enough to update it myself) ? Cheers, Jona (talk) 03:17, 11 August 2011 (UTC)

Wikipedia:Version 1.0 Editorial Team/Chicago articles by quality log

Wikipedia:Version 1.0 Editorial Team/Chicago articles by quality log has not been updated in 6 days. Wikipedia:Version 1.0 Editorial Team/WikiProject Illinois articles by quality log has gone 5 days.--TonyTheTiger (T/C/BIO/WP:CHICAGO/WP:FOUR) 07:48, 30 May 2011 (UTC)

New class proposal

See Wikipedia:Village_pump_(proposals)#A_new_class_for_Featured_media. Rd232 talk 00:31, 1 June 2011 (UTC)

Hi, Wikiproject Essex recently merged with, and is now a task force of WikiProject East Anglia. I was wondering whether it would be possible to get the data for Wikipedia:Version 1.0 Editorial Team/Essex articles by quality statistics from articles with the WikiProject East Anglia Banner for class, and the importance from the Essex Task force class category? Thanks -- Thomas888b (Say Hi) 19:57, 9 June 2011 (UTC)

Yes, this is just a matter of updating that banner to put the articles' talk pages into the desired assessment categories. I am not an expert in the WPBannerMeta template, but if you ask on its talk page someone should be able to set it up for you. — Carl (CBM · talk) 20:51, 9 June 2011 (UTC)

Bot request for assessing articles

Folks here might like to consider some of the ideas being kicked around at Wikipedia:Bot requests/Archive 42#Project_template_fixes_and_assessment. On the list is the possibility of a bot to repair articles that are incorrectly tagged as being "Redirect" class when they're not redirects, or a normal article class (e.g., "Stub") when it's actually a redirect. WhatamIdoing (talk) 22:06, 22 June 2011 (UTC)


RfC regarding pregnancy image

Another go round here [5] Doc James (talk · contribs · email) 06:18, 21 October 2011 (UTC)