A Simple Lexical AnalyzerConvenient utility subprograms:
getChar - gets the next character of input, puts it in nextChar, determines its class and puts the class in charClass
addChar - puts the character from nextChar into the place the lexeme is being accumulated, lexeme
lookup - determines whether the string in lexeme is a reserved word (returns a code)
Programming Language(程式語言) Chapter 4
Lexical and Syntax Analysis (詞彙與語法分析)
Topics
Introduction
Lexical Analysis (詞彙分析)
The Parsing Problem (剖析問題)
Recursive-Descent Parsing (遞迴下降剖析)
Bottom-Up Parsing (由下往上剖析)
Syntax Analyzers
Syntax analyzers, or parsers (剖析器), are nearly based on a formal description (正規化表示) of the syntax of programs
The most commonly used syntax-description formalism is context-free grammars or BNF
Using BNF to Describe Syntax
Provides a clear and concise (清楚與簡潔) syntax description
The parser can be based directly (直接) on the BNF
Parsers based on BNF are easy to maintain (維護)
Syntax Analysis
The syntax analysis portion of a language processor nearly always consists of two parts:
The lexical analyzer (詞彙分析器) deals with small-scale language constructs, such as names (名稱) and numeric literals (數字字母)
The syntax analyzer (語法分析器) deal with the large-scale language constructs, such as expressions (表示式), statements (敘述), and program units (程式單元)
Reasons to Separate Lexical and Syntax Analysis (分別詞彙與語法分析的理由)
Simplicity (簡單)
Less complex approaches can be used for lexical analysis; separating them simplifies (簡化) the parser
Efficiency (效率)
Separation allows optimization (最佳化) of the lexical analyzer
Portability (可攜)
Parts of the lexical analyzer is somewhat platform-dependent (平台依賴), but the syntax analyzer can be platform-independent (平台獨立)
Lexical Analysis
A lexical analyzer is a pattern matcher (樣式匹配器) for character strings
A lexical analyzer is a “front-end” for the parser
Identifies (識別) substrings of the source program that belong together - lexemes (詞彙)
Lexemes match a character pattern, which is associated with a lexical category (詞彙的類型) called a token (符號、符記,字彙分析– 基本單位 )
value is a lexeme; its token may be IDENT
The Lexemes and Tokens
result = oldsum – value / 100; Lexemes (詞彙)
result
=
oldsum
-
value
/
100
; Tokens (符號、符記)
IDENT
ASSIGN_OP
IDENT
SUBTRACT_OP
IDENT
DIVISION_OP
INT_LIT
SEMICOLON
Lexical Analysis
The lexical analyzer is usually a function that is called by the parser when it needs the next token
Three approaches to building a lexical analyzer:
Write a formal description (正規描述) of the tokens and use a software tool that automatically generates a lexical analyzer
Design a state transition diagram (狀態轉變圖) that describes the token patterns (符號樣式) and write a program that implements the state diagram
Design a state transition diagram that describes the token patterns and hand-construct a table-driven (表格驅動) implementation of the state diagram
Lexical Analysis
A lexical analyzer that recognizes only program names (名稱), reserved word (保留字), and integer literals (數字字母)
Reserved words (保留字) and identifiers (識別符) can be recognized together (rather than having a part of the diagram for each reserved word)
Use a table lookup to determine whether a possible identifier is in fact a reserved word
State Diagram
A state diagram to recognize names, reserved words,
and integer literals
A Simple Lexical Analyzer
Implementation (by C language)
int lex() {
getChar();
switch (charClass) {
case LETTER:
addChar();
getChar();
while (charClass == LETTER || charClass == DIGIT)
{
addChar();
getChar();
}
return lookup(lexeme);
break;
…
A Simple Lexical Analyzer
…
case DIGIT:
addChar();
getChar();
while (charClass == DIGIT) {
addChar();
getChar();
}
return INT_LIT;
break;
} /* End of switch */
} /* End of function lex */
A Simple Lexical Analyzer
Convenient utility subprograms:
getChar - gets the next character of input, puts it in nextChar, determines its class and puts the class in charClass
addChar - puts the character from nextChar into the place the lexeme is being accumulated, lexeme
lookup - determines whether the string in lexeme is a reserved word (returns a code)
The Parsing Problem
Goals of the parser
Check the input program to determine whether it is syntactically (在語句構造上) correct, and produce a diagnostic (診斷) message and recover (修復) when an error is found
Produce either a complete parse tree, or least trace the structure of the complete parse tree, for syntactically correct input
Two Categories of Parsers
Top-down
Produce the parse tree, beginning at the root
Order is that of a leftmost derivation (最左邊衍生)
Traces or builds the parse tree in preorder (前序)
Bottom-up
Produce the parse tree, beginning at the leaves
Order is that of the reverse (相反) of a rightmost derivation (最右邊衍生)
Parsers look only one token ahead (先前) in the input
Top-Down Parsers
Top-down Parsers
Given a sentential form, xA , the parser must choose the correct A-rule to get the next sentential (句子的) form in the leftmost derivation, using only the first token produced by A
The most common top-down parsing algorithms
Recursive descent (遞迴下降) - a coded implementation (BNF)
LL algorithms - table driven implementation
Bottom-Up Parsers
Bottom-up parsers
Given a right sentential form, , determine what substring of is the right-hand side (RHS) of the rule in the grammar that must be reduced to produce the previous sentential form in the right derivation
The most common bottom-up parsing algorithms
The LR family
The Complexity (複雜度) of Parsing
Parsers that work for any unambiguous (不含糊的, 清楚的) grammar are complex (複雜) and inefficient (無效率) ( O(n3), where n is the length of the input )
Compilers use parsers that only work for a subset of all unambiguous grammars, but do it in linear time ( O(n), where n is the length of the input )
Recursive-Descent (遞迴下降) Parsing
A recursive-descent parser has a subprogram for each nonterminal in the grammar, which can parse sentences that can be generated by that nonterminal
EBNF is ideally (觀念上) suited (符合) for being the basis for a recursive-descent parser, because EBNF minimizes the number of nonterminals
Comments