Parikh's theorem in theoretical computer science says that if one looks only at the number of occurrences of each terminal symbol in a context-free language, without regard to their order, then the language is indistinguishable from a regular language.[1] It is useful for deciding that strings with a given number of terminals are not accepted by a context-free grammar.[2] It was first proved by Rohit Parikh in 1961[3] and republished in 1966.[4]

Definitions and formal statement edit

Let   be an alphabet. The Parikh vector of a word is defined as the function  , given by[1]

 
where   denotes the number of occurrences of the letter   in the word  .

A subset of   is said to be linear if it is of the form

 
for some vectors  . A subset of   is said to be semi-linear if it is a union of finitely many linear subsets.

Theorem — Let   be a context-free language or a regular language, let   be the set of Parikh vectors of words in  , that is,  . Then   is a semi-linear set.

If   is any semi-linear set, then there exists a regular language (which a fortiori is context-free) whose Parikh vectors is  .

In short, the image under   of context-free languages and of regular languages is the same, and it is equal to the set of semilinear sets.

Two languages are said to be commutatively equivalent if they have the same set of Parikh vectors. Thus, every context-free language is commutatively equivalent to some regular language.

Proof edit

The second part is easy to prove.

Proof

Given semi-linear set  , to construct a regular language whose set of Parikh vectors is  .

  is a union of 0 or more linear sets. Since the empty language is regular, and union of regular languages is regular, it suffices to prove that any linear set is the set of Parikh vectors of a regular language.

Let  , then it is the set of Parikh vectors of  , where each   has Parikh vector  .

The first part is less easy. The following proof is credited to Goldstine.[5]

First we need a small strengthening of the pumping lemma for context-free languages:

Lemma — If   is generated by a Chomsky normal form grammar, then  , such that

For any  , and for any   with  , there exists a way to split   into segments  , and a nonterminal symbol  , such that

  for all  , and  

 

The proof is essentially the same as the standard pumping lemma: use the pigeonhole principle to find   copies of some nonterminal symbol   in the longest path in the shortest derivation tree.

Now we prove the first part of Parikh's theorem, making use of the above lemma.

Proof

First, construct a Chomsky normal form grammar for  .

For each finite nonempty subset of nonterminals  , define   to be the set of sentences in   such that there exists a derivation that uses every nonterminal in  , no more and no less. It is clear that  , so it suffices to prove that each   is a semilinear set.

Now fix some  , and let  . We construct two finite sets  , such that  , which is obviously semilinear.

For notational clarity, write   to mean "there exists a derivation using no more (but possibly less) than nonterminals in  . With that, we define   as follows:

 
 

To prove  , we induct on the length of  .

If  , then  , so  . Otherwise, by the strengthened pumping lemma, there exists a derivation of   using precisely the elements of  , and is of the form

 
where  ,  , and  .
Since there are only   elements in  , but there are   sub-derivations   in the middle, we may delete one sub-derivation  , and obtain a shorter   that is still in  , with

 
By induction,  , and by construction,  , so  .

To prove  , we induct on the length of  .

If   then by construction,  .
Otherwise,   for some   and  .
By induction,   for some  . By construction of  , there exists some   such that  .
By construction of  , the symbol   appears in a derivation of   using exactly all of  . Then we can interpolate   into that derivation to obtain some   such that

 

Strengthening for bounded languages edit

A language   is bounded if   for some fixed words  . Ginsburg and Spanier [6] gave a necessary and sufficient condition, similar to Parikh's theorem, for bounded languages.

Call a linear set stratified, if in its definition for each   the vector   has the property that it has at most two non-zero coordinates, and for each   if each of the vectors   has two non-zero coordinates,   and  , respectively, then their order is not  . A semi-linear set is stratified if it is a union of finitely many stratified linear subsets.

Ginsburg-Spanier — A bounded language   is context-free if and only if   is a stratified semi-linear set.

Significance edit

The theorem has multiple interpretations. It shows that a context-free language over a singleton alphabet must be a regular language and that some context-free languages can only have ambiguous grammars[further explanation needed]. Such languages are called inherently ambiguous languages. From a formal grammar perspective, this means that some ambiguous context-free grammars cannot be converted to equivalent unambiguous context-free grammars.

References edit

  1. ^ a b Kozen, Dexter (1997). Automata and Computability. New York: Springer-Verlag. ISBN 3-540-78105-6.
  2. ^ Håkan Lindqvist. "Parikh's theorem" (PDF). Umeå Universitet.
  3. ^ Parikh, Rohit (1961). "Language Generating Devices". Quarterly Progress Report, Research Laboratory of Electronics, MIT.
  4. ^ Parikh, Rohit (1966). "On Context-Free Languages". Journal of the Association for Computing Machinery. 13 (4): 570–581. doi:10.1145/321356.321364. S2CID 12263468.
  5. ^ Goldstine, J. (1977-01-01). "A simplified proof of parikh's theorem". Discrete Mathematics. 19 (3): 235–239. doi:10.1016/0012-365X(77)90103-0. ISSN 0012-365X.
  6. ^ Ginsburg, Seymour; Spanier, Edwin H. (1966). "Presburger formulas, and languages". Pacific Journal of Mathematics. 16 (2): 285–296. doi:10.2140/pjm.1966.16.285.