Proximity search (text)

In text processing, a proximity search looks for documents where two or more separately matching term occurrences are within a specified distance, where distance is the number of intermediate words or characters. In addition to proximity, some implementations may also impose a constraint on the word order, in that the order in the searched text must be identical to the order of the search query. Proximity searching goes beyond the simple matching of words by adding the constraint of proximity and is generally regarded as a form of advanced search.

For example, a search could be used to find "red brick house", and match phrases such as "red house of brick" or "house made of red brick". By limiting the proximity, these phrases can be matched while avoiding documents where the words are scattered or spread across a page or in unrelated articles in an anthology.

Rationale

edit

The basic linguistic assumption of proximity searching is that the proximity of the words in a document implies a relationship between the words. Given that authors of documents try to formulate sentences which contain a single idea, or cluster of related ideas within neighboring sentences or organized into paragraphs, there is an inherent, relatively high, probability within the document structure that words used together are related. On the other hand, when two words are on the opposite ends of a book, the probability of a relationship between the words is relatively weak. By limiting search results to only include matches where the words are within the specified maximum proximity, or distance, the search results are assumed to be of higher relevance than the matches where the words are scattered.

Commercial internet search engines tend to produce too many matches (known as recall) for the average search query. Proximity searching is one method of reducing the number of pages matches, and to improve the relevance of the matched pages by using word proximity to assist in ranking. As an added benefit, proximity searching helps combat spamdexing by avoiding webpages which contain dictionary lists or shotgun lists of thousands of words, which would otherwise rank highly if the search engine was heavily biased toward word frequency.

Boolean syntax and operators

edit

Note that a proximity search can designate that only some keywords must be within a specified distance. Proximity searching can be used with other search syntax and/or controls to allow more articulate search queries. Sometimes query operators like NEAR, NOT NEAR, FOLLOWED BY, NOT FOLLOWED BY, SENTENCE or FAR are used to indicate a proximity-search limit between specified keywords: for example, "brick NEAR house".

Usage in commercial search engines

edit

In regards to implicit/automatic versus explicit proximity search, as of November 2008, most Internet search engines only implement an implicit proximity search functionality. That is, they automatically rank those search results higher where the user keywords have a good "overall proximity score" in such results. If only two keywords are in the search query, this has no difference from an explicit proximity search which puts a NEAR operator between the two keywords. However, if three or more than three keywords are present, it is often important for the user to specify which subsets of these keywords expect a proximity in search results. This is useful if the user wants to do a prior art search (e.g. finding an existing approach to complete a specific task, finding a document that discloses a system that exhibits a procedural behavior collaboratively conducted by several components and links between these components).

Web search engines which support proximity search via an explicit proximity operator in their query language include Walhello, Exalead, Yandex, Yahoo!, Altavista, and Bing:

  • When using the Walhello search-engine, the proximity can be defined by the number of characters between the keywords.[1]
  • The search engine Exalead allows the user to specify the required proximity, as the maximum number of words between keywords. The syntax is (keyword1 NEAR/n keyword2) where n is the number of words.[2]
  • Yandex uses the syntax keyword1 /n keyword2 to search for two keywords separated by at most   words, and supports a few other variations of this syntax.[3]
  • Yahoo! and Altavista both support an undocumented NEAR operator.[4][5] The syntax is keyword1 NEAR keyword2.
  • Google Search supports AROUND(#).[6][7]
  • Bing supports NEAR.[8] The syntax is keyword1 near:n keyword2 where n=the number of maximum separating words.

Ordered search within the Google and Yahoo! search engines is possible using the asterisk (*) full-word wildcards: in Google this matches one or more words,[9] and an in Yahoo! Search this matches exactly one word.[10] (This is easily verified by searching for the following phrase in both Google and Yahoo!: "addictive * of biblioscopy".)

To emulate unordered search of the NEAR operator can be done using a combination of ordered searches. For example, to specify a close co-occurrence of "house" and "dog", the following search-expression could be specified: "house dog" OR "dog house" OR "house * dog" OR "dog * house" OR "house * * dog" OR "dog * * house".

See also

edit

Notes

edit