Lexical analysisCS440/540
Lexical Analysis
• Process: converting input string (source program) into substrings (tokens)
• Input: source program
• Output: a sequence of tokens
• Also called: lexer, tokenizer, scanner
Token and Lexeme
• Token: a syntactic category
• Lexeme: instance of the token
Token Sample lexemes
keyword if, else, for, while,…
whitespace ‘ ’, ‘\t’, ‘\n’, …
comparison <,>,==,!=,…
identifier total, score, name, …
number 1, 3.14159, 0, …
literal “Super nice cool compiler”, “ComS”, …
Basic design
1. Define a finite set of tokens.• Keyword, whitespace, identifier, …
2. Describe which strings belong to each token• Keyword: “if” or “else” or “for” or …
• whitespace: non-empty sequence of blanks, newlines, and tabs
• identifier: strings of letters or digits, starting with a letter
Analysis example
if (i == j)
z = 0;
else
z = 1;
\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;
• Identifier: ?
• Keyword: ?
• Comparison: ?
• Number: ?
• Whitespace: ?
Analysis example
if (i == j)
z = 0;
else
z = 1;
\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;
• Identifier: i, j, z
• Keyword: if, else
• Comparison: ==
• Number: 0, 1
• Whitespace: ‘ ’, \t, \n
What would you do?
• Foo<Bar<Bazz>>
• This is nested templates in C++.
• However, do you see any conflict?
What would you do?
• Foo<Bar<Bazz>>
• This is nested templates in C++.
• However, do you see any conflict?
• Foo<Bar<Bazz>>• cin >> var
Alphabet, String, and Language
• Alphabet (Σ)• Any finite set of symbols.
• String over an alphabet• A finite sequence of symbols drawn from that alphabet.
• Language (𝐿)• Any countable set of strings over some fixed alphabet.
• Formally, Let S be a set of characters. A language over S is a set of strings of characters drawn from S.
Alphabet Language
English characters English sentences
ASCII C programs
Operations on Languages
• Single character• ′𝑐′ = {"c"}
• Epsilon• 𝜖 = {""}
• Union• 𝐴 + 𝐵 = {𝑠|𝑠 ∈ 𝐴 𝑜𝑟 𝑠 ∈ 𝐵}
• Concatenation• 𝐴𝐵 = {𝑎𝑏|𝑎 ∈ 𝐴 𝑎𝑛𝑑 𝑏 ∈ 𝐵}
• Iteration• 𝐴∗ =∪𝑖≥0 𝐴
𝑖 where 𝐴𝑖 = 𝐴… 𝑖 𝑡𝑖𝑚𝑒𝑠 …𝐴
Example
• 𝐿 = {𝐴, 𝐵, … , 𝑍, 𝑎, 𝑏, … , 𝑧}, 𝐷 = {0,1, … , 9}
• 𝐿 + 𝐷• set of letters and digits, each of which strings is either one letter or one digit
• 𝐴 , 𝑔 , 1 , …
• 𝐿𝐷• set of strings of length two, each consisting of one letter followed by one digit
• 𝑐4 , 𝑗8 , 𝑦6 ,…
• 𝐿4
• set of all 4-letter strings
• 1234 , 7416 , 2592 ,…
Regular Expressions
• Describing the language by a combination of language operations of some alphabet.
Example
• Keyword• “if” or “else” or “for” or …
• keyword = ?
Example
• Keyword• “if” or “else” or “for” or …
• keyword = ‘if’ + ‘else’ + ‘for’ + …
Examples
• Integer• non-empty string of digits
• digit = ‘0’ + ‘1’ + … + ‘9’
• integer = ?
Examples
• Integer• non-empty string of digits
• digit = ‘0’ + ‘1’ + … + ‘9’
• integer = digit digit*
• Definition• A*: zero or more of the preceding element
• A+=AA*: one or more of the preceding element• integer = digit+
• A?: zero or one of the preceding element
Examples
• Identifier• Strings of letters or digits, starting with a letter
• letter = ‘A’ + … + ‘Z’ + ‘a’ + … + ‘z’
• digit = ‘0’ + ‘1’ + … + ‘9’
• identifier = ?
Examples
• Identifier• Strings of letters or digits, starting with a letter
• letter = ‘A’ + … + ‘Z’ + ‘a’ + … + ‘z’
• digit = ‘0’ + ‘1’ + … + ‘9’
• identifier = letter (letter + digit)*
More Examples
• Phone number• (515)-294-8813
• Σ =?
• 𝑎𝑟𝑒𝑎 =?
• 𝑒𝑥𝑐ℎ𝑎𝑛𝑔𝑒 =?
• 𝑝ℎ𝑜𝑛𝑒 =?
• phone number = ?
More Examples
• Phone number• (515)-294-8813
• Σ = 𝑑𝑖𝑔𝑖𝑡𝑠 ∪ {−, , }
• 𝑎𝑟𝑒𝑎 = 𝑑𝑖𝑔𝑖𝑡3
• 𝑒𝑥𝑐ℎ𝑎𝑛𝑔𝑒 = 𝑑𝑖𝑔𝑖𝑡3
• 𝑝ℎ𝑜𝑛𝑒 = 𝑑𝑖𝑔𝑖𝑡4
• phone number = ‘(’area ‘)-’ exchange ‘-’ phone
More Examples
• email address• [email protected]
• Σ = 𝑙𝑒𝑡𝑡𝑒𝑟𝑠 ∪ {. ,@}
• 𝑛𝑎𝑚𝑒 = 𝑙𝑒𝑡𝑡𝑒𝑟+
• address = name ‘@’ name ‘.’ name
An algorithm of lexical analysis
• Transition diagram• Flowchart with states and edges; each edge is labelled with characters;
certain subset of states are marked as “final states.”• Transition from state to state proceeds along edges according to the next
input character.• Every string that ends up at a final state is accepted.• If get “stuck”, there is no transition for a given character, it is an error.• Transition diagrams can be easily translated to programs using if or case
statements
Implementation
state0:
c = getchar();
if (isalpha(c)) token += c; goto state1;
error();
state1:
c = getchar();
if (isalpha(c) || isdigit(c)) token += c; goto state1;
if (isdelimiter(c)) goto state2;
error();
state2:
return(token);
Finite automata
• Finite automata• Deterministic Finite Automata (DFAs)
• Non-deterministic Finite Automata (NFAs)
Notation
• Given a string s and a regxp R, is 𝑠 ∈ 𝐿(𝑅)
• There is variation in regular expression notation• Union: A + B ≡ A | B
• Option: A + ε ≡ A?
• Range: ‘a’+’b’+…+’z’ ≡ [a-z]
• Excluded range: complement of [a-z] ≡ [^a-z]
Lexical Spec Regular Expressions (1)
1. Write a rexp for the lexemes of each token• Number = digit+
• Keyword = ‘if’ + ‘else’ + …
• Identifier = letter (letter + digit)*
• OpenPar = ‘(‘
• ClosePar = ‘)’
2. Construct R, matching all lexemes for all tokens• R = Keyword + Identifier + Number + …
• = R1 + R2 + R3 + …
Lexical Spec Regular Expressions (2)
3. Let input be x1…xn
• For 1 ≤ i ≤ n check
• x1…xi ∈ L(R)
4. If success, then we know that• x1…xi ∈ L(Rj) for some j
5. Remove x1…xi from input and go to (3)
\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;
Ambiguities
• What if x1…xi∈L(R) and also x1…xj∈L(R)?• note that i ≠ j
• Possible rule• pick longest possible string in L(R)
• What if x1…xi∈L(Rj) and also x1…xi∈L(Rk)?• note that j ≠ k
• Possible rule• use the listed first
Finite Automata
• A finite automaton consists of• An input alphabet Σ
• A set of states S
• A start state n
• A set of accepting states F ⊆ S
• A set of transitions state →input state
Finite Automata
• Transition• s1 →a s2
• Is read:• In state s1 on input “a” go to state s2
• If end of input and in accepting state accept
• Otherwise reject
Finite Automata State Graphs
Simple examples
• A finite automaton that accepts only “1”
• A finite automaton accepting any number of 1’s followed by a single 0
And Another Example
• Alphabet {0,1}
• What language does this recognize?
And Another Example
• Alphabet {0,1}
• What language does this recognize?• (1*0(0+1?|1))+
Epsilon Moves
• Machine can move from state A to state B without reading input
Deterministic and Nondeterministic Automata
• Deterministic Finite Automata (DFA)• One transition per input per state
• No ε-moves
• Nondeterministic Finite Automata (NFA)• Can have multiple transitions for one input in a given state
• Can have ε-moves
Execution of Finite Automata
• A DFA can take only one path through the state graph• Completely determined by input
• NFAs can choose• Whether to make ε-moves
• Which of multiple transitions for a single input to take
Acceptance of NFAs
• An NFA can get into multiple states
• Rule: NFA accepts if it can get to a final state
• Input: 100
NFA vs. DFA
• NFAs and DFAs recognize the same set of languages (regular languages)
• DFAs are faster to execute
• DFA can be exponentially larger than NFA
• For a given language NFA can be simpler than DFA
(1*0(0|1)0*1?)+
Regular Expressions to NFA (1)
• For each kind of rexp, define an NFA• Notation: NFA for rexp M
• For ε
• For input a
Regular Expressions to NFA (2)
• For AB
• For A | B
Regular Expressions to NFA (3)
• For A*
Example: RegExp NFA conversion
• Consider the regular expression• (1|0)*1
• The NFA is
Example: RegExp NFA conversion
• Consider the regular expression• (1|0)*1
• The NFA is
NFA DFA
• Simulate the NFA
• Each state of DFA• a non-empty subset of states of the NFA
• Start state• the set of NFA states reachable through ε-moves from NFA start state
• Add a transition S →a S’ to DFA iff• S’ is the set of NFA states reachable from any state in S after seeing the input
a, considering ε-moves as well
NFA DFA: Example
NFA DFA: Example
S=ABCDH, T=FGHABCD, U=EGHIABCDI
Implementation
• A DFA can be implemented by a 2D table T• One dimension is “states”
• Other dimension is “input symbol”
• For every transition Si →a Sk define T[i,a] = k
• DFA “execution”• If in state Si and input a, read T[i,a] = k and skip to state Sk
• Very efficient
Table Implementation of a DFA
Table Implementation of a DFA
Top Related