Lexical analysis - Iowa State...

52
Lexical analysis CS440/540

Transcript of Lexical analysis - Iowa State...

Page 1: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

Lexical analysisCS440/540

Page 2: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

Lexical Analysis

• Process: converting input string (source program) into substrings (tokens)

• Input: source program

• Output: a sequence of tokens

• Also called: lexer, tokenizer, scanner

Page 3: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

Token and Lexeme

• Token: a syntactic category

• Lexeme: instance of the token

Token Sample lexemes

keyword if, else, for, while,…

whitespace ‘ ’, ‘\t’, ‘\n’, …

comparison <,>,==,!=,…

identifier total, score, name, …

number 1, 3.14159, 0, …

literal “Super nice cool compiler”, “ComS”, …

Page 4: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

Basic design

1. Define a finite set of tokens.• Keyword, whitespace, identifier, …

2. Describe which strings belong to each token• Keyword: “if” or “else” or “for” or …

• whitespace: non-empty sequence of blanks, newlines, and tabs

• identifier: strings of letters or digits, starting with a letter

Page 5: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

Analysis example

if (i == j)

z = 0;

else

z = 1;

\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;

• Identifier: ?

• Keyword: ?

• Comparison: ?

• Number: ?

• Whitespace: ?

Page 6: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

Analysis example

if (i == j)

z = 0;

else

z = 1;

\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;

• Identifier: i, j, z

• Keyword: if, else

• Comparison: ==

• Number: 0, 1

• Whitespace: ‘ ’, \t, \n

Page 7: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

What would you do?

• Foo<Bar<Bazz>>

• This is nested templates in C++.

• However, do you see any conflict?

Page 8: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

What would you do?

• Foo<Bar<Bazz>>

• This is nested templates in C++.

• However, do you see any conflict?

• Foo<Bar<Bazz>>• cin >> var

Page 9: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

Alphabet, String, and Language

• Alphabet (Σ)• Any finite set of symbols.

• String over an alphabet• A finite sequence of symbols drawn from that alphabet.

• Language (𝐿)• Any countable set of strings over some fixed alphabet.

• Formally, Let S be a set of characters. A language over S is a set of strings of characters drawn from S.

Alphabet Language

English characters English sentences

ASCII C programs

Page 10: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

Operations on Languages

• Single character• ′𝑐′ = {"c"}

• Epsilon• 𝜖 = {""}

• Union• 𝐴 + 𝐵 = {𝑠|𝑠 ∈ 𝐴 𝑜𝑟 𝑠 ∈ 𝐵}

• Concatenation• 𝐴𝐵 = {𝑎𝑏|𝑎 ∈ 𝐴 𝑎𝑛𝑑 𝑏 ∈ 𝐵}

• Iteration• 𝐴∗ =∪𝑖≥0 𝐴

𝑖 where 𝐴𝑖 = 𝐴… 𝑖 𝑡𝑖𝑚𝑒𝑠 …𝐴

Page 11: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

Example

• 𝐿 = {𝐴, 𝐵, … , 𝑍, 𝑎, 𝑏, … , 𝑧}, 𝐷 = {0,1, … , 9}

• 𝐿 + 𝐷• set of letters and digits, each of which strings is either one letter or one digit

• 𝐴 , 𝑔 , 1 , …

• 𝐿𝐷• set of strings of length two, each consisting of one letter followed by one digit

• 𝑐4 , 𝑗8 , 𝑦6 ,…

• 𝐿4

• set of all 4-letter strings

• 1234 , 7416 , 2592 ,…

Page 12: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

Regular Expressions

• Describing the language by a combination of language operations of some alphabet.

Page 13: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

Example

• Keyword• “if” or “else” or “for” or …

• keyword = ?

Page 14: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

Example

• Keyword• “if” or “else” or “for” or …

• keyword = ‘if’ + ‘else’ + ‘for’ + …

Page 15: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

Examples

• Integer• non-empty string of digits

• digit = ‘0’ + ‘1’ + … + ‘9’

• integer = ?

Page 16: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

Examples

• Integer• non-empty string of digits

• digit = ‘0’ + ‘1’ + … + ‘9’

• integer = digit digit*

• Definition• A*: zero or more of the preceding element

• A+=AA*: one or more of the preceding element• integer = digit+

• A?: zero or one of the preceding element

Page 17: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

Examples

• Identifier• Strings of letters or digits, starting with a letter

• letter = ‘A’ + … + ‘Z’ + ‘a’ + … + ‘z’

• digit = ‘0’ + ‘1’ + … + ‘9’

• identifier = ?

Page 18: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

Examples

• Identifier• Strings of letters or digits, starting with a letter

• letter = ‘A’ + … + ‘Z’ + ‘a’ + … + ‘z’

• digit = ‘0’ + ‘1’ + … + ‘9’

• identifier = letter (letter + digit)*

Page 19: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

More Examples

• Phone number• (515)-294-8813

• Σ =?

• 𝑎𝑟𝑒𝑎 =?

• 𝑒𝑥𝑐ℎ𝑎𝑛𝑔𝑒 =?

• 𝑝ℎ𝑜𝑛𝑒 =?

• phone number = ?

Page 20: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

More Examples

• Phone number• (515)-294-8813

• Σ = 𝑑𝑖𝑔𝑖𝑡𝑠 ∪ {−, , }

• 𝑎𝑟𝑒𝑎 = 𝑑𝑖𝑔𝑖𝑡3

• 𝑒𝑥𝑐ℎ𝑎𝑛𝑔𝑒 = 𝑑𝑖𝑔𝑖𝑡3

• 𝑝ℎ𝑜𝑛𝑒 = 𝑑𝑖𝑔𝑖𝑡4

• phone number = ‘(’area ‘)-’ exchange ‘-’ phone

Page 21: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

More Examples

• email address• [email protected]

• Σ =?

• 𝑛𝑎𝑚𝑒 =?

• address = ?

Page 22: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

More Examples

• email address• [email protected]

• Σ = 𝑙𝑒𝑡𝑡𝑒𝑟𝑠 ∪ {. ,@}

• 𝑛𝑎𝑚𝑒 = 𝑙𝑒𝑡𝑡𝑒𝑟+

• address = name ‘@’ name ‘.’ name

Page 23: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

An algorithm of lexical analysis

• Transition diagram• Flowchart with states and edges; each edge is labelled with characters;

certain subset of states are marked as “final states.”• Transition from state to state proceeds along edges according to the next

input character.• Every string that ends up at a final state is accepted.• If get “stuck”, there is no transition for a given character, it is an error.• Transition diagrams can be easily translated to programs using if or case

statements

Page 24: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

Implementation

state0:

c = getchar();

if (isalpha(c)) token += c; goto state1;

error();

state1:

c = getchar();

if (isalpha(c) || isdigit(c)) token += c; goto state1;

if (isdelimiter(c)) goto state2;

error();

state2:

return(token);

Page 25: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

Finite automata

• Finite automata• Deterministic Finite Automata (DFAs)

• Non-deterministic Finite Automata (NFAs)

Page 26: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

Notation

• Given a string s and a regxp R, is 𝑠 ∈ 𝐿(𝑅)

• There is variation in regular expression notation• Union: A + B ≡ A | B

• Option: A + ε ≡ A?

• Range: ‘a’+’b’+…+’z’ ≡ [a-z]

• Excluded range: complement of [a-z] ≡ [^a-z]

Page 27: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

Lexical Spec Regular Expressions (1)

1. Write a rexp for the lexemes of each token• Number = digit+

• Keyword = ‘if’ + ‘else’ + …

• Identifier = letter (letter + digit)*

• OpenPar = ‘(‘

• ClosePar = ‘)’

2. Construct R, matching all lexemes for all tokens• R = Keyword + Identifier + Number + …

• = R1 + R2 + R3 + …

Page 28: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

Lexical Spec Regular Expressions (2)

3. Let input be x1…xn

• For 1 ≤ i ≤ n check

• x1…xi ∈ L(R)

4. If success, then we know that• x1…xi ∈ L(Rj) for some j

5. Remove x1…xi from input and go to (3)

\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;

Page 29: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

Ambiguities

• What if x1…xi∈L(R) and also x1…xj∈L(R)?• note that i ≠ j

• Possible rule• pick longest possible string in L(R)

• What if x1…xi∈L(Rj) and also x1…xi∈L(Rk)?• note that j ≠ k

• Possible rule• use the listed first

Page 30: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

Finite Automata

• A finite automaton consists of• An input alphabet Σ

• A set of states S

• A start state n

• A set of accepting states F ⊆ S

• A set of transitions state →input state

Page 31: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

Finite Automata

• Transition• s1 →a s2

• Is read:• In state s1 on input “a” go to state s2

• If end of input and in accepting state accept

• Otherwise reject

Page 32: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

Finite Automata State Graphs

Page 33: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

Simple examples

• A finite automaton that accepts only “1”

• A finite automaton accepting any number of 1’s followed by a single 0

Page 34: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

And Another Example

• Alphabet {0,1}

• What language does this recognize?

Page 35: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

And Another Example

• Alphabet {0,1}

• What language does this recognize?• (1*0(0+1?|1))+

Page 36: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

Epsilon Moves

• Machine can move from state A to state B without reading input

Page 37: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

Deterministic and Nondeterministic Automata

• Deterministic Finite Automata (DFA)• One transition per input per state

• No ε-moves

• Nondeterministic Finite Automata (NFA)• Can have multiple transitions for one input in a given state

• Can have ε-moves

Page 38: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

Execution of Finite Automata

• A DFA can take only one path through the state graph• Completely determined by input

• NFAs can choose• Whether to make ε-moves

• Which of multiple transitions for a single input to take

Page 39: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

Acceptance of NFAs

• An NFA can get into multiple states

• Rule: NFA accepts if it can get to a final state

• Input: 100

Page 40: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

NFA vs. DFA

• NFAs and DFAs recognize the same set of languages (regular languages)

• DFAs are faster to execute

• DFA can be exponentially larger than NFA

• For a given language NFA can be simpler than DFA

(1*0(0|1)0*1?)+

Page 41: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

Regular Expressions to NFA (1)

• For each kind of rexp, define an NFA• Notation: NFA for rexp M

• For ε

• For input a

Page 42: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

Regular Expressions to NFA (2)

• For AB

• For A | B

Page 43: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

Regular Expressions to NFA (3)

• For A*

Page 44: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

Example: RegExp NFA conversion

• Consider the regular expression• (1|0)*1

• The NFA is

Page 45: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

Example: RegExp NFA conversion

• Consider the regular expression• (1|0)*1

• The NFA is

Page 46: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

NFA DFA

• Simulate the NFA

• Each state of DFA• a non-empty subset of states of the NFA

• Start state• the set of NFA states reachable through ε-moves from NFA start state

• Add a transition S →a S’ to DFA iff• S’ is the set of NFA states reachable from any state in S after seeing the input

a, considering ε-moves as well

Page 47: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

NFA DFA: Example

Page 48: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

NFA DFA: Example

S=ABCDH, T=FGHABCD, U=EGHIABCDI

Page 49: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

Implementation

• A DFA can be implemented by a 2D table T• One dimension is “states”

• Other dimension is “input symbol”

• For every transition Si →a Sk define T[i,a] = k

• DFA “execution”• If in state Si and input a, read T[i,a] = k and skip to state Sk

• Very efficient

Page 50: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

Table Implementation of a DFA

Page 51: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings

Table Implementation of a DFA

Page 52: Lexical analysis - Iowa State Universityweb.cs.iastate.edu/~weile/cs440.540/2.LexicalAnalysis.pdfLexical Analysis •Process: converting input string (source program) into substrings