Chomsky–Schützenberger enumeration theorem

In formal language theory, the Chomsky–Schützenberger enumeration theorem is a theorem derived by Noam Chomsky and Marcel-Paul Schützenberger about the number of words of a given length generated by an unambiguous context-free grammar. The theorem provides an unexpected link between the theory of formal languages and abstract algebra.

Statement edit

In order to state the theorem, a few notions from algebra and formal language theory are needed.

Let $\mathbb {N}$ denote the set of nonnegative integers. A power series over $\mathbb {N}$ is an infinite series of the form

f=f(x)=\sum _{k=0}^{\infty }a_{k}x^{k}=a_{0}+a_{1}x^{1}+a_{2}x^{2}+a_{3}x^{3}+\cdots

with coefficients $a_{k}$ in $\mathbb {N}$ . The multiplication of two formal power series $f$ and $g$ is defined in the expected way as the convolution of the sequences $a_{n}$ and $b_{n}$ :

f(x)\cdot g(x)=\sum _{k=0}^{\infty }\left(\sum _{i=0}^{k}a_{i}b_{k-i}\right)x^{k}.

In particular, we write $f^{2}=f(x)\cdot f(x)$ , $f^{3}=f(x)\cdot f(x)\cdot f(x)$ , and so on. In analogy to algebraic numbers, a power series $f(x)$ is called algebraic over $\mathbb {Q} (x)$ , if there exists a finite set of polynomials $p_{0}(x),p_{1}(x),p_{2}(x),\ldots ,p_{n}(x)$ each with rational coefficients such that

p_{0}(x)+p_{1}(x)\cdot f+p_{2}(x)\cdot f^{2}+\cdots +p_{n}(x)\cdot f^{n}=0.

A context-free grammar is said to be unambiguous if every string generated by the grammar admits a unique parse tree or, equivalently, only one leftmost derivation. Having established the necessary notions, the theorem is stated as follows.

Chomsky–Schützenberger theorem. If

L

is a context-free language admitting an unambiguous context-free grammar, and

a_{k}:=|L\ \cap \Sigma ^{k}|

is the number of words of length

k

in

L

, then

G(x)=\sum _{k=0}^{\infty }a_{k}x^{k}

is a power series over

\mathbb {N}

that is algebraic over

\mathbb {Q} (x)

.

Proofs of this theorem are given by Kuich & Salomaa (1985), and by Panholzer (2005).

Usage edit

Asymptotic estimates edit

The theorem can be used in analytic combinatorics to estimate the number of words of length n generated by a given unambiguous context-free grammar, as n grows large. The following example is given by Gruber, Lee & Shallit (2012): the unambiguous context-free grammar G over the alphabet {0,1} has start symbol S and the following rules

S → M | U

M → 0M1M | ε

U → 0S | 0M1U.

To obtain an algebraic representation of the power series $G(x)$ associated with a given context-free grammar G, one transforms the grammar into a system of equations. This is achieved by replacing each occurrence of a terminal symbol by x, each occurrence of ε by the integer '1', each occurrence of '→' by '=', and each occurrence of '|' by '+', respectively. The operation of concatenation at the right-hand-side of each rule corresponds to the multiplication operation in the equations thus obtained. This yields the following system of equations:

S = M + U

M = M²x² + 1

U = Sx + MUx²

In this system of equations, S, M, and U are functions of x, so one could also write $S(x)$ , $M(x)$ , and $U(x)$ . The equation system can be resolved after S, resulting in a single algebraic equation:

x(2x-1)S^{2}+(2x-1)S+1=0

.

This quadratic equation has two solutions for S, one of which is the algebraic power series $G(x)$ . By applying methods from complex analysis to this equation, the number $a_{n}$ of words of length n generated by G can be estimated, as n grows large. In this case, one obtains $a_{n}\in O(2+\epsilon )^{n}$ but $a_{n}\notin O(2-\epsilon )^{n}$ for each $\epsilon >0$ .^[1]

The following example is from Bassino & Nicaud (2011):

\left\{{\begin{array}{l }{S\rightarrow XY}\\{T\rightarrow aT|TbT|YcY}\\{Y\rightarrow YaY|cY|abTaYYa|X}\\{X\rightarrow a|b|c}\end{array}}\Rightarrow \left\{{\begin{array}{l}s(z)=x(z)y(z)\\t(z)=zt(z)+zt(z)^{2}+zy(z)^{2}\\y(z)=zy(z)^{2}+zy(z)+z^{4}t(z)y(z)^{2}+x(z)\\x(z)=3z\end{array}}\right.\right.

which simplifies to

s(z)^{8}-27\left(z^{3}-z^{2}\right)s(z)^{5}+\ldots +59049z^{10}=0

Inherent ambiguity edit

In classical formal language theory, the theorem can be used to prove that certain context-free languages are inherently ambiguous. For example, the Goldstine language $L_{G}$ over the alphabet $\{a,b\}$ consists of the words $a^{n_{1}}ba^{n_{2}}b\cdots a^{n_{p}}b$ with $p\geq 1$ , $n_{i}>0$ for $i\in \{1,2,\ldots ,p\}$ , and $n_{j}\neq j$ for some $j\in \{1,2,\ldots ,p\}$ .

It is comparably easy to show that the language $L_{G}$ is context-free.^[2] The harder part is to show that there does not exist an unambiguous grammar that generates $L_{G}$ . This can be proved as follows: If $g_{k}$ denotes the number of words of length $k$ in $L_{G}$ , then for the associated power series holds $G(x)=\sum _{k=0}^{\infty }g_{k}x^{k}={\frac {1-x}{1-2x}}-{\frac {1}{x}}\sum _{k\geq 1}x^{k(k+1)/2-1}$ . Using methods from complex analysis, one can prove that this function is not algebraic over $\mathbb {Q} (x)$ . By the Chomsky-Schützenberger theorem, one can conclude that $L_{G}$ does not admit an unambiguous context-free grammar.^[3]

Notes edit

^ See Gruber, Lee & Shallit (2012) for a detailed exposition.
^ Berstel & Boasson (1990).
^ See Berstel & Boasson (1990) for detailed account.

References edit

Bassino, Frederique; Nicaud, Cyril (December 16, 2011). "Philippe Flajolet & Analytic Combinatorics: Inherent Ambiguity of Context-Free Languages" (PDF). inria.fr. Retrieved 5 April 2023.
Berstel, Jean; Boasson, Luc (1990). "Context-free languages" (PDF). In van Leeuwen, Jan (ed.). Handbook of Theoretical Computer Science, Volume B: Formal Models and Semantics. Elsevier and MIT press. pp. 59–102. ISBN 0-444-88074-7.
Chomsky, Noam; Schützenberger, Marcel-Paul (1963). "The Algebraic Theory of Context-Free Languages" (PDF). In P. Braffort and D. Hirschberg, eds., Computer Programming and Formal Systems (pp. 118–161). Amsterdam: North-Holland.
Flajolet, Philippe; Sedgewick, Robert (2009). Analytic Combinatorics. Cambridge: Cambridge University Press. ISBN 978-0-521-89806-5.
Gruber, Hermann; Lee, Jonathan; Shallit, Jeffrey (2012). "Enumerating regular expressions and their languages". arXiv:1204.4982 [cs.FL].
Kuich, Werner; Salomaa, Arto (1985). Semirings, Automata, Languages. Berlin: Springer-Verlag. ISBN 978-3-642-69961-0.
Panholzer, Alois (2005). "Gröbner Bases and the Defining Polynomial of a Context-free Grammar Generating Function". Journal of Automata, Languages and Combinatorics. 10: 79–97.

[1] See Gruber, Lee & Shallit (2012) for a detailed exposition.

[FOOTNOTEBerstelBoasson1990-2] Berstel & Boasson (1990).

[3] See Berstel & Boasson (1990) for detailed account.

[1]

[2]

[3]