Skip to content

Normal Forms for Context-Free Grammars

A CFG can be transformed into equivalent but structurally simpler forms. These normal forms are essential for parsing algorithms (like CYK) and theoretical proofs.


Eliminating Useless Variables

A symbol is useful if it appears in some derivation SαXβw (from start to terminal string). Otherwise it is useless.

Two Kinds of Uselessness

1. Variables That Derive Nothing

Example: SAB,AaAa,BAB

B can never derive a terminal string (every production for B contains B itself). S also derives nothing because it depends on B.

Discovery Algorithm:

  • Basis: If Aw where wΣ, then A derives something
  • Induction: If AX1X2Xk and all Xi either are terminals or already discovered variables, then A derives something
  • Elimination: Remove all undiscovered variables and all productions containing them

Proof: Induction on parse tree height — if Aw, then A will be discovered in at most h rounds where h is the tree height.

2. Unreachable Symbols

A symbol is reachable if it appears in some sentential form derived from S.

Discovery Algorithm:

  • Basis: S is reachable
  • Induction: If A is reachable and Aα, then all symbols in α are reachable
  • Elimination: Remove unreachable symbols and their productions

Correct Two-Step Elimination

Important: Eliminate non-deriving variables FIRST, then unreachable symbols. Doing it in reverse order can leave useless symbols.

Why: Eliminating non-deriving variables may make previously reachable symbols unreachable.


Eliminating ε-Productions

A production of the form Aε is called an ε-production.

Goal: Eliminate all ε-productions (except possibly Sε if εL(G)).

Nullable Symbols

A is nullable if Aε.

Discovery Algorithm:

  • Basis: If Aε, A is nullable
  • Induction: If AX1X2Xk and ALL Xi are nullable, then A is nullable

Elimination Method

Key idea: Turn each production AX1X2Xk into a family of productions — for each subset of nullable Xi, create a version where those symbols are omitted.

Example: SAB,AaAε,BbBA

  • Nullable: A (because of Aε), then B (because BA), then S (because SAB)
  • New productions for S: SABAB (omit A, omit B, omit both — but Sε removed)
  • A: AaAa (omit nullable A in body)
  • B: BbBbA

Handling εL(G)

If ε is in the language, we introduce a new start symbol S0 with productions S0Sε. The rest of the grammar has no ε-productions.


Eliminating Unit Productions

A unit production is one whose body consists of a single variable: AB.

Finding Unit Pairs

Compute all pairs (A,B) such that AB using only unit productions.

Algorithm: Find all such pairs by transitive closure.

  • Basis: (A,A) is a unit pair for all A
  • Induction: If (A,B) is a unit pair and BC, then (A,C) is a unit pair

Elimination

For each unit pair (A,B), add all non-unit productions of B to A:

  • If Bα (where α is not a single variable), add Aα

Remove all unit productions.


Chomsky Normal Form (CNF)

A CFG is in Chomsky Normal Form if every production is of the form:

  1. ABC (two variables)
  2. Aa (single terminal)

(Plus Sε allowed if εL(G))

CNF Theorem

Every CFL (without ε) has a grammar in CNF.

Proof (Construction)

Step 0 (ε handling): If ε is in the language, handle with S0Sε.

Step 1 (Clean grammar): Eliminate useless variables, ε-productions, and unit productions. Now every production body is either a single terminal or a string of 2 symbols.

Step 2: For every terminal a that appears in a body of length >1, create a new variable A with production Aa, and replace a by A in all longer bodies.

Example: ABcDe → create Cc, Ee, then ABCDE

Step 3: Break right sides longer than 2 into a chain of productions, each with two variables.

Example: ABCDEABC1,C1CD1,D1DE

Significance of CNF

  • Every derivation of a string of length n uses exactly 2n1 steps
  • Parse trees are binary — every interior node has exactly 2 children
  • Enables the CYK parsing algorithm (O(n3))