The Rough Idea

edit

I cannot divulge too much details here because of the Hawthorne effect. But suffice to say that I am investigating the cause-and-effect relationship between motivations and knowledge contributions. The ‘cause’ (motivations) is measured in the survey whereas the ‘effect’ (contributions) is measured in the archival data (the database dump).

Sample Selection

edit

Explanation in brief on how I selected the target sample for the survey:

I did a database download in Dec 2006. The data file [stub-meta-current.xml.gz] contained the edit metadata for the most recent edit made to all the articles in Wikipedia. Now, it has become useful again because we are going to pick our sample from this database.

Note: While this is certainly not the best way to get our sample, it is our next best alternative. It is better than going to Wikipedia and trying to fish randomly around for 600 names (in fact, this method is not even random). The Dec database captures a snapshot of editing activity in early Nov 2006, when the database dump began.

The sampling frame consists of all the editors whose usernames appear in the Dec database. From here, I ran two sampling stages:

  • Users who are not ‘bot’ (robot) accounts.
  • Users who are not anonymous IP addresses.

The result of the above two sampling stages is Sampling Unit A. Sampling Unit A has 278,423 users.

Next, I want to stratify Sampling Unit A into two groups: privileged and non-privileged users. Privileged users are those who have administrative powers. The Dec database has the information on who are all* the privileged users in the English Wikipedia at that point in time. Non-privileged users are those who are registered users and not anonymous IP addresses. This stratification can be used as a control variable later on. I am doing non-probabilistic sampling, so that I can hear the opinions from both sides.

Firstly, let’s look at all the non-privileged users in Sampling Unit A. After filtering all the privileged users from Sampling Unit A, we have 277,350 users. This gives us Sampling Unit B. After which, I called a pseudorandom function in the database to shuffle Sampling Unit B like a deck of cards and draw the first 300 names. These 300 names are non-privileged users that are passed on to Sampling Unit C.

Secondly, we want to find out how many privileged users exist in Sampling Unit A. The query shows that 1,073 privileged users exist in Sampling Unit A. Again, I called the pseudorandom function to shuffle these 1,073 names and draw the first 300. These 300 names are privileged users who are passed on to Sampling Unit C.

Our final result:
Sampling Unit C has a total of 600 usernames, divided equally between privileged and non-privileged users (300 in each group). Direct invitations are extended to Sampling Unit C over the weekend of Mar 3-4, 2007.

Technicalities

edit

The problem is, as I have stated on my user page, the data import is really taking a long time. Specifically, I am importing the data from the stub dump [stub-meta-history.xml.gz] of 3.1 GB. In contrast to the earlier file mentioned in the Section 'Sample Selection', this data file contains the entire history of all edits made prior to Nov 2006. This explains why I require survey participants to have registered their accounts before Jan 2006, so that I can track their edits made leading up to Nov 2006. After uncompressing the file, it ballooned up to a size of 20+ GB and after running it though the mwdumper tool, the data file shrinks to 10+ GB and it is finally ready for insertion into the database. Please do not be concerned with the variance in file size -- no data is lost; this is expected when you convert a .xml file (a structured human-readable file which requires more redundant data) to a .sql file. And so, the data import has been ongoing since Jan 12, 2007, that means it has been running for 50 days now (as at Mar 4).