September 29 * Algorithms for DFAs * DFA State Minimization Although FSM cannot perform all tasks, they are important building blocks As such, we need to find ways to minimize FSM representation (minimize the number of states in a DFA) Minimization algorithm 1. Delete unreachable states 2. Merge indistinguishable (equivalent) states b +====+ a +--+--->+====+ b +--+ --->||q1||--->|q2| a ||q3||<---|q8| +====+ +--+<---+====+ +--+ | ^ | | ^ | ^ b| |a a| b| |a a| |b | | | | | | | v | v v | v | +--+ b +--+ b +--+ a +====+ |q4|---->|q5|<----|q6|<---||q7|| +--+ +--+ +--+ +====+ ^ | | | +--+ a,b The language of this machine is (ab v ba)* 1. Delete unreachable states R = {s} while there is a state p in R and a in Sigma s.t. delta(p,a) not in R do add delta(p,a) to R In this way we remove all states that have no path from the start state. Reachable states are the closure of {s} under the relation {(p,q): delta(p,a) = q for some a in Sigma}. R = {q1} R = {q1,q2,q4} R = {q1,q2,q3,q4,q5} R = {q1,q2,q3,q4,q5,q6} Delete q7 and q8 b +====+ a +--+--->+====+ --->||q1||--->|q2| a ||q3|| +====+ +--+<---+====+ | ^ | | ^ b| |a a| b| |a | | | | | v | v v | +--+ b +--+ b +--+ |q4|---->|q5|<----|q6| +--+ +--+ +--+ ^ | | | +--+ a,b 2. Merge indistinguishable (equivalent) states Two states are equivalent if, from either state, the same set of strings is accepted. States x and y are "equivalent with respect to language L", L in Sigma*, or x ~~_L y ("~~" is actually one ~~ on top of another), if for all z in Sigma* xy in L iff yz in L. ~~_L here is an equivalence relation (clusters states). So x ~~_L y if both strings belong to L or neither is in L AND if appending any string to both of them results in strings both in L or both not in L. Remember that we denote an equivalence class by listing any member of one of the classes in square brackets. An equivalence class here is the set of strings such that for any two elements x and y in the class, x ~~_L y The example DFA has four equivalence classes: (1) [e] This is the language itself (2) [a] This is a string in the language followed by a single a (3) [b] This is a string in the language followed by a single b (4) [aa] This is a string in the language followed by either aa or bb followed by anything (no string after this will lead to a final state) Two strings x,y in Sigma* are "equivalent with respect to M", x ~_M y, if they both drive M from s to the same state. x ~_M y if there is a state q such that (s,x) |-*M (q,e) and (x,y) |-* (q,e). The relation "~_M" is also an equivalence relation - its equivalence classes can be identified by states of M that are reachable from s and have at least one string in the corresponding equivalence class. The class corresponding to state qi is E_qi. The example DFA has six equivalence classes: (1) E_q1 = (ba)* All of these strings end up in state q1 (2) E_q2 = abLa String ab followed by any string in L followed by a (3) E_q3 = abL (4) E_q4 = b(ab)* (5) E_q5 = L(b v aa)Sigma* (6) E_q6 = abLb For any DFA M and strings x,y in Sigma*, if x ~_M y, then x ~~_L y. Thus ~_M is a "refinement" of ~~_L in that ~_M implies ~~_L but not the other way around. The equivalence relation that relates any two cities in the US that are in the same county is a refinement of the relation between any two cities that are in the same state. If we can whittle the equivalence classes of ~_M down to the equivalence classes of ~~_L, we have minimized the DFA. Myhill-Nerode Theorem Let L in Sigma* be a regular language. Then there is a DFA with the same number of states as there are equivalence classes in ~~L. In our example, we can merge q1 and q3 and merge q4 and q6 yielding a four-state DFA b +====+ a +--+ --->||q1||--->|q2| +====+ +--+ | ^ | b| |a a| | | | v | v +--+ b +--+ |q4|---->|q5| +--+ +--+ ^ | | | +--+ a,b Algorithm to distinguish states (1) Every final state is distinguished from every nonfinal state (2) Forall pairs p,q not distinguished, check if there exists a s.t. delta(p,q), delta(q,a) are distinguished. If so, make p,q distinguished. Example 1 1 +-+ +------+ | | | | v | v | +-+ 0 +-+ 0 +-+ 0 +===+ |a|--->|b|--->|c|--->||e|| +-+ +-+ +-+ +===+ ^ \1 ^-\ / 1| | \ / | 0 1| \ / \ | | \+v v+ \1v | +-+ 0 +===+ +------|d|--->||f|| +-+ +===+ ^ | | | +---+ 0 a +-+ b |3| +-+-+ c |2|2| +-+-+-+ d |2|2| | +-+-+-+-+ e |1|1|1|1| +-+-+-+-+-+ f |1|1|1|1| | +-+-+-+-+-+ a b c d e PAIRS INPUT RESULT RESULT a,b 0,1 b,c a,d a,c 0,1 *b,e* a,b a,d 0,1 *b,f* a,b b,c 0,1 *c,e* b,d b,d 0,1 *c,f* d,b c,d 0,1 e,f b,b e,f 0,1 f,f d,c a,b 0,1 *b,c* a,c c,d 0,1 e,f b,b e,f 0,1 f,f d,c 1 0 +-+ +--+ | | | | v | 0,1 0 v | +-+ 0 +-+--->+---+--->+=====+ |a|--->|b| |c,d| ||e,f|| +-+ +-+<---+---+<---+=====+ 1 1 * Context-Free Languages Proper subset of regular languages +-------+ | CFL | | | | +---+ | | |REG| | | +---+ | | | +-------+ For example, {0^n 1^n : n>=0} is a context-free language. We will look at a language generator and a language recognizer for CFL. Humans do this regularly, we know that "the cad is in the hat" is syntactically correct, while "hat the the in is cat" is not. Humans generate valid language when they speak and when they write. Consider the re a(a* v b*)b. How can we rewrite this re using a set of rules? Let S be a new "start" symbol that we will eventually replace with a string in the language. S is replaced with a single a followed by a middle part followed by a b. Let M be a new symbol that is a place holder for the middle part. We can then write our rule as S -> aMb Now we need a rule for the middle part. M can be rewritten as a string of 0 or more "a"s or a string of 0 or more "b"s. We will let A be a place holder for the string of "a"s and let B be a place holder for the string of "b"s: M -> A M -> B (either rule can be selected) How do we rewrite A and B? A can be replaced by e or by a single a followed by a string of 0 or more "a"s. We can rewrite B similarly. A -> e A -> aA B -> e B -> bB To generate the string abbbbb we start with S and apply the rules appropriately. S -> aMb -> aBb -> abBb -> abbBb -> abbbBb -> abbbbBb -> abbbbb Context-Free Grammars (CFG) G = (V, Sigma, R, S) V: alphabet Sigma: Sigma in V (finite set of terminals) R: finite subset of (V - Sigma) x V* (rules) S: S in V - Sigma (start symbol) Example G = ({S}, {0,1}, {S->0, S->1, S->e, S->0S0, S->1S1}, S) generates palindromes. Note that these rules are context free because we only see a single nonterminal on the left hand side of each rewrite rule. We rewrite the symbol independent of symbols around the nonterminal, or independently of the symbol's context.