Wikipedia:Search engine indexing (proposal)

Search engines such as Google and Bing deliver search results by using computer programs called web crawlers to 'surf' the internet looking for new pages to add to search indices, and for updates to previously 'crawled' pages. These potentially-intrusive programs are governed by a set of standards that allow website owners to control which pages the crawlers are allowed to visit, and which links they are allowed to follow to reach new pages. In the context of Wikipedia, this means that we have the ability to control which pages are accessible to web crawlers, and hence which pages are returned by search engines such as Google.

Background

edit

From Wikipedia's foundation, all of its content was made accessible to web crawlers and search engines. Robots.txt, the file that controls web crawler access, was used primarily to block individual web crawlers that were making excessively long or rapid crawls and hence were draining system resources. This meant that in addition to all our encyclopedic content, enormous amounts of discussion, dispute, and drama, were made available to external searches. This material is the focus of considerable numbers of complaints to the OTRS service, and can often contain unwanted personal information about users, undesirably heated debates about article subjects, and other content that does nothing to enhance Wikipedia's reputation as a professional encyclopedia. In 2006 the German Wikipedia held a 'Meinungsbilder' (roughly analogous to an RfC), and asked the developers to exclude all talk namespaces from web crawlers (see T6937), in an attempt to control some of this content.

Wikipedia's powerful presence as the internet's eighth most-popular website gives all our pages very heavy weighting in search engine rankings; a Wikipedia page that matches the search term entered is almost guaranteed a place in the top ten results, regardless of the actual page content. While this is an extremely positive status for our articles and content, it is not always beneficial:


In June 2006, MediaWiki was enhanced to provide the ability for developers to exclude individual namespaces from being indexed by web crawlers. This functionality was extended in February 2008 to allow developers to set indexing policy on individual pages. Finally, in July 2008, users were given the ability to manually set indexing policies for individual pages using two magic words __INDEX__ and __NOINDEX__; the developers can customise in which pages these magic words function.

Until late 2008, the poor quality of Wikipedia's own internal search engine meant that editors relied upon Google to find material for internal purposes, such as past discussions, useful help pages, and other information. In October 2008, the internal search function was significantly improved, enabling all the functionality already available through search engines such as Google, and also incorporating a number of features unique to Wikipedia, such as automatic identification of redirects and page sections, and more appropriate search rankings. This made the internal search a superior method for finding internal content than external searches like Google. In December 2008, new updates to the MediaWiki software enabled the insertion of inline search buttons to search through sets of subpages, such as the archives of talk pages or the Administrators' noticeboard.


The entirety of editorial pages have been spidered (pushed onto search engines such as Google) as a result. As a smaller website this was not a big deal. As a "top 5-10 website" it is. Dialog on users from Wikipedia, including their internal actions as editors, is routinely a "top hit" for individuals long after they edit, and pages other than mainspace and well patrolled parts of other spaces may contain large amounts of unchecked, unverified, user writings which any user may place within a variety of namespaces. Unless significantly problematic and actively noticed, they may go unchecked and spidered as Wikipedia content for years.

Our visitors and readers look for encyclopedic content, not inward-facing discussions, disputes by users. Our readers come first. There is considerable content we want the public to find and see. That is the end product of the project.

The rest - including popular project pages such as AFD, and all "talk" namespaces, dispute resolution pages, user pages, etc, are not of great benefit to the project if indexed on search engines. Many of them also raise considerable concerns about privacy and ease of finding harmful stuff (user disputes/allegations) on Google, far more than they help the project. We don't need those publicized. They are internal (editorial use) pages.

It is proposed that it's finally time to close the gap. Instead of NOINDEXing individual pages mostly ad-hoc, I can't see any strong current continuing rationale for any "internal" page to be spidered at all, and I can see problems reduced by killing it. Use internal search to find such material, and kill off spidering of anything that's not really of genuine public note as our "output/product".

A prior discussion has taken place at Wikipedia:Village pump (policy)#NOINDEX of all non-content namespaces (Dec 2008 - Jan 2009). This proposal is being set up to formally see if consensus exists to request these changes, and to identify the technical means to do so.

Proposal

edit
Namespace Default state Override
allowed?
Mainspace Indexed No
User: Noindexed Yes
Wikipedia: Noindexed Yes
File: Indexed Yes
Mediawiki: Noindexed No
Template: Noindexed Yes
Help: Indexed No
Category: Indexed Yes
Portal: Indexed Yes
All Talk namespaces
(Talk:, User talk:,
File talk:
, etc)
Noindexed No
Changes from the current setting are highlighted

The proposed changes fall into two areas: technical, and procedural, as described below.

Technical

edit

The Wikipedia:, MediaWiki: and Template: subject namespaces, and all talk namespaces, are set to be not indexed by default; that is, no pages in these namespaces will be found by web crawlers and hence will not appear in search engine rankings, although all pages will continue to be visible in Wikipedia's own internal search results.

In addition, the magic words __INDEX__ and __NOINDEX__ are disabled in the MediaWiki: and Help: subject namespaces, and in all talk namespaces. This has the effect of 'locking in' the default setting so it cannot be changed on a per-page basis.

The new indexing settings are shown graphically in the table to the right.

Procedural

edit

With these changes, it becomes necessary to develop new guidelines to govern the use of the magic words __INDEX__ and __NOINDEX__ in those namespaces where they function.

INDEX in User: namespace
INDEX in Wikipedia: namespace
  • Pages such as policies, guidelines, and 'any well-recognized stable reference pages' (consensus basis) will remain indexed.
  • Other pages may be individually indexed on a case-by-case basis (consensus basis).
NOINDEX in File: namespace

Some content (non-encyclopedic material such as bug reports, internal project logos, etc) may be noindexed on a consensus basis. A discussion of NOINDEXing non-free media is likely to take place, separately to this proposal.

INDEX in Template: namespace
NOINDEX in Category: namespace

'Maintenance' categories will be manually NOINDEXed, all other categories (i.e. content categories) should not be overridden and shall remain Indexed.

NOINDEX in Portal: namespace

Implementation

edit
  • Once this page is complete, the community will be asked to consider the proposals to change the index status of the various namespaces as described above. The different parts of this proposal will be asked separately so that editors may pick and choose their preferences on a per-namespace basis.
  • For those namespaces where consensus is reached, WMF and technical users will be asked to determine the most appropriate way to implement the decision.
  • Will this be a problem if users rely on Google to find non-content in Wikipedia?
No. In November 2008 the site's internal search was enhanced. The new search handles complex queries of the same kind as Google, and other features which leave it better for searching these spaces, than Google is.
For example, internal search can handle the same boolean expressions and "page title" search, as Google advanced search can, but it now also understands namespaces, page "sections", can look for words with wildcards in them, and so on, which Google cannot. In addition the many pages that are already NOINDEXED can be searched by internal search, but Google cannot see them.
  • What will users need to know?
Users will need to use internal search rather than external search to find material within past discussions. They will find that once they get used to clicking "search" rather than "Google", the same formats as Google Advanced Search are accepted, and also, that more directly useful information relevant to Wikipedians searching past discussions is available, such as limiting the search to specific namespaces, or "section" and "section title" information, that they did not have before using Google.
Such a change requires clear advance notice. Users would be notified by a clear banner, and noticeboard posts, of the change, a month in advance, and directed to a useful link and help information. Other means of making the switchover easy would also be used as fully as possible. New users would pick up "this is how one searches discussions" in the same way that they pick up how to review history revisions, or markup, or any other Wikipedia editorial know-how.
  • What else might happen during the month's advance notice?
By the time the technical side is discussed and a month's notice has passed, it's likely that most of the obvious project space pages needing to be INDEXed, or those where consensus would happen, will have been tagged as INDEXed. Users will be unlikely to wait :)
  • Will this affect Wikipedia's rankings?
Wikipedia is ranked near the top on many topics because its content is very heavily referenced. The impact of this proposal is very difficult to predict.
  • Why is Project space being proposed to be indexed the way it is?
Short answer - pages we'd want to spider in Projectspace are likely to change relatively slowly in number or location. The ones we don't want to spider will be written at the drop of a hat or obscure, and likely far outnumber them. So we default to not indexing unless decided.

  • Can a namespace actually be set as "no index, not overridable"?
Short answer: Yes, both MediaWiki developers and en.wiki admins can make these settings, although the most effective solution involves a combination of both.
  • Isn't this page pointless? Since the community has decided that it wants to let pages in non-main space be indexed?
The community has never had the opportunity to form a consensus on this issue; as explained above, the ability to restrict web crawler access to pages was implemented long after the formation of Wikipedia, and until recently the poor internal search function made noindexing an impossibility. Now that the situation has changed, we can form a legitimate consensus. Don't forget that, even if the community had decided previously that non-mainspace pages should be indexed (which it hasn't), such a consensus can change over time as the situation changes, such as the updated internal search.

See also

edit