Lexical analysis - Iowa State...

Post on 11-May-2018

229 views 3 download

Transcript of Lexical analysis - Iowa State...

Lexical analysisCS440/540

Lexical Analysis

• Process: converting input string (source program) into substrings (tokens)

• Input: source program

• Output: a sequence of tokens

• Also called: lexer, tokenizer, scanner

Token and Lexeme

• Token: a syntactic category

• Lexeme: instance of the token

Token Sample lexemes

keyword if, else, for, while,…

whitespace ‘ ’, ‘\t’, ‘\n’, …

comparison <,>,==,!=,…

identifier total, score, name, …

number 1, 3.14159, 0, …

literal “Super nice cool compiler”, “ComS”, …

Basic design

1. Define a finite set of tokens.• Keyword, whitespace, identifier, …

2. Describe which strings belong to each token• Keyword: “if” or “else” or “for” or …

• whitespace: non-empty sequence of blanks, newlines, and tabs

• identifier: strings of letters or digits, starting with a letter

Analysis example

if (i == j)

z = 0;

else

z = 1;

\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;

• Identifier: ?

• Keyword: ?

• Comparison: ?

• Number: ?

• Whitespace: ?

Analysis example

if (i == j)

z = 0;

else

z = 1;

\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;

• Identifier: i, j, z

• Keyword: if, else

• Comparison: ==

• Number: 0, 1

• Whitespace: ‘ ’, \t, \n

What would you do?

• Foo<Bar<Bazz>>

• This is nested templates in C++.

• However, do you see any conflict?

What would you do?

• Foo<Bar<Bazz>>

• This is nested templates in C++.

• However, do you see any conflict?

• Foo<Bar<Bazz>>• cin >> var

Alphabet, String, and Language

• Alphabet (Σ)• Any finite set of symbols.

• String over an alphabet• A finite sequence of symbols drawn from that alphabet.

• Language (𝐿)• Any countable set of strings over some fixed alphabet.

• Formally, Let S be a set of characters. A language over S is a set of strings of characters drawn from S.

Alphabet Language

English characters English sentences

ASCII C programs

Operations on Languages

• Single character• ′𝑐′ = {"c"}

• Epsilon• 𝜖 = {""}

• Union• 𝐴 + 𝐵 = {𝑠|𝑠 ∈ 𝐴 𝑜𝑟 𝑠 ∈ 𝐵}

• Concatenation• 𝐴𝐵 = {𝑎𝑏|𝑎 ∈ 𝐴 𝑎𝑛𝑑 𝑏 ∈ 𝐵}

• Iteration• 𝐴∗ =∪𝑖≥0 𝐴

𝑖 where 𝐴𝑖 = 𝐴… 𝑖 𝑡𝑖𝑚𝑒𝑠 …𝐴

Example

• 𝐿 = {𝐴, 𝐵, … , 𝑍, 𝑎, 𝑏, … , 𝑧}, 𝐷 = {0,1, … , 9}

• 𝐿 + 𝐷• set of letters and digits, each of which strings is either one letter or one digit

• 𝐴 , 𝑔 , 1 , …

• 𝐿𝐷• set of strings of length two, each consisting of one letter followed by one digit

• 𝑐4 , 𝑗8 , 𝑦6 ,…

• 𝐿4

• set of all 4-letter strings

• 1234 , 7416 , 2592 ,…

Regular Expressions

• Describing the language by a combination of language operations of some alphabet.

Example

• Keyword• “if” or “else” or “for” or …

• keyword = ?

Example

• Keyword• “if” or “else” or “for” or …

• keyword = ‘if’ + ‘else’ + ‘for’ + …

Examples

• Integer• non-empty string of digits

• digit = ‘0’ + ‘1’ + … + ‘9’

• integer = ?

Examples

• Integer• non-empty string of digits

• digit = ‘0’ + ‘1’ + … + ‘9’

• integer = digit digit*

• Definition• A*: zero or more of the preceding element

• A+=AA*: one or more of the preceding element• integer = digit+

• A?: zero or one of the preceding element

Examples

• Identifier• Strings of letters or digits, starting with a letter

• letter = ‘A’ + … + ‘Z’ + ‘a’ + … + ‘z’

• digit = ‘0’ + ‘1’ + … + ‘9’

• identifier = ?

Examples

• Identifier• Strings of letters or digits, starting with a letter

• letter = ‘A’ + … + ‘Z’ + ‘a’ + … + ‘z’

• digit = ‘0’ + ‘1’ + … + ‘9’

• identifier = letter (letter + digit)*

More Examples

• Phone number• (515)-294-8813

• Σ =?

• 𝑎𝑟𝑒𝑎 =?

• 𝑒𝑥𝑐ℎ𝑎𝑛𝑔𝑒 =?

• 𝑝ℎ𝑜𝑛𝑒 =?

• phone number = ?

More Examples

• Phone number• (515)-294-8813

• Σ = 𝑑𝑖𝑔𝑖𝑡𝑠 ∪ {−, , }

• 𝑎𝑟𝑒𝑎 = 𝑑𝑖𝑔𝑖𝑡3

• 𝑒𝑥𝑐ℎ𝑎𝑛𝑔𝑒 = 𝑑𝑖𝑔𝑖𝑡3

• 𝑝ℎ𝑜𝑛𝑒 = 𝑑𝑖𝑔𝑖𝑡4

• phone number = ‘(’area ‘)-’ exchange ‘-’ phone

More Examples

• email address• weile@iastate.edu

• Σ =?

• 𝑛𝑎𝑚𝑒 =?

• address = ?

More Examples

• email address• weile@iastate.edu

• Σ = 𝑙𝑒𝑡𝑡𝑒𝑟𝑠 ∪ {. ,@}

• 𝑛𝑎𝑚𝑒 = 𝑙𝑒𝑡𝑡𝑒𝑟+

• address = name ‘@’ name ‘.’ name

An algorithm of lexical analysis

• Transition diagram• Flowchart with states and edges; each edge is labelled with characters;

certain subset of states are marked as “final states.”• Transition from state to state proceeds along edges according to the next

input character.• Every string that ends up at a final state is accepted.• If get “stuck”, there is no transition for a given character, it is an error.• Transition diagrams can be easily translated to programs using if or case

statements

Implementation

state0:

c = getchar();

if (isalpha(c)) token += c; goto state1;

error();

state1:

c = getchar();

if (isalpha(c) || isdigit(c)) token += c; goto state1;

if (isdelimiter(c)) goto state2;

error();

state2:

return(token);

Finite automata

• Finite automata• Deterministic Finite Automata (DFAs)

• Non-deterministic Finite Automata (NFAs)

Notation

• Given a string s and a regxp R, is 𝑠 ∈ 𝐿(𝑅)

• There is variation in regular expression notation• Union: A + B ≡ A | B

• Option: A + ε ≡ A?

• Range: ‘a’+’b’+…+’z’ ≡ [a-z]

• Excluded range: complement of [a-z] ≡ [^a-z]

Lexical Spec Regular Expressions (1)

1. Write a rexp for the lexemes of each token• Number = digit+

• Keyword = ‘if’ + ‘else’ + …

• Identifier = letter (letter + digit)*

• OpenPar = ‘(‘

• ClosePar = ‘)’

2. Construct R, matching all lexemes for all tokens• R = Keyword + Identifier + Number + …

• = R1 + R2 + R3 + …

Lexical Spec Regular Expressions (2)

3. Let input be x1…xn

• For 1 ≤ i ≤ n check

• x1…xi ∈ L(R)

4. If success, then we know that• x1…xi ∈ L(Rj) for some j

5. Remove x1…xi from input and go to (3)

\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;

Ambiguities

• What if x1…xi∈L(R) and also x1…xj∈L(R)?• note that i ≠ j

• Possible rule• pick longest possible string in L(R)

• What if x1…xi∈L(Rj) and also x1…xi∈L(Rk)?• note that j ≠ k

• Possible rule• use the listed first

Finite Automata

• A finite automaton consists of• An input alphabet Σ

• A set of states S

• A start state n

• A set of accepting states F ⊆ S

• A set of transitions state →input state

Finite Automata

• Transition• s1 →a s2

• Is read:• In state s1 on input “a” go to state s2

• If end of input and in accepting state accept

• Otherwise reject

Finite Automata State Graphs

Simple examples

• A finite automaton that accepts only “1”

• A finite automaton accepting any number of 1’s followed by a single 0

And Another Example

• Alphabet {0,1}

• What language does this recognize?

And Another Example

• Alphabet {0,1}

• What language does this recognize?• (1*0(0+1?|1))+

Epsilon Moves

• Machine can move from state A to state B without reading input

Deterministic and Nondeterministic Automata

• Deterministic Finite Automata (DFA)• One transition per input per state

• No ε-moves

• Nondeterministic Finite Automata (NFA)• Can have multiple transitions for one input in a given state

• Can have ε-moves

Execution of Finite Automata

• A DFA can take only one path through the state graph• Completely determined by input

• NFAs can choose• Whether to make ε-moves

• Which of multiple transitions for a single input to take

Acceptance of NFAs

• An NFA can get into multiple states

• Rule: NFA accepts if it can get to a final state

• Input: 100

NFA vs. DFA

• NFAs and DFAs recognize the same set of languages (regular languages)

• DFAs are faster to execute

• DFA can be exponentially larger than NFA

• For a given language NFA can be simpler than DFA

(1*0(0|1)0*1?)+

Regular Expressions to NFA (1)

• For each kind of rexp, define an NFA• Notation: NFA for rexp M

• For ε

• For input a

Regular Expressions to NFA (2)

• For AB

• For A | B

Regular Expressions to NFA (3)

• For A*

Example: RegExp NFA conversion

• Consider the regular expression• (1|0)*1

• The NFA is

Example: RegExp NFA conversion

• Consider the regular expression• (1|0)*1

• The NFA is

NFA DFA

• Simulate the NFA

• Each state of DFA• a non-empty subset of states of the NFA

• Start state• the set of NFA states reachable through ε-moves from NFA start state

• Add a transition S →a S’ to DFA iff• S’ is the set of NFA states reachable from any state in S after seeing the input

a, considering ε-moves as well

NFA DFA: Example

NFA DFA: Example

S=ABCDH, T=FGHABCD, U=EGHIABCDI

Implementation

• A DFA can be implemented by a 2D table T• One dimension is “states”

• Other dimension is “input symbol”

• For every transition Si →a Sk define T[i,a] = k

• DFA “execution”• If in state Si and input a, read T[i,a] = k and skip to state Sk

• Very efficient

Table Implementation of a DFA

Table Implementation of a DFA