CSE 331 Compiler Design
Lecture 01
Sharun Akter Khushbu
Lecturer, CSE, DIU
What is a Compiler?
Compiler
Source program Target program
Target Program
Input Output
A compiler translate texts from one language to other language.
Interpreter
Interpreter Source program
Output
Input
Hybrid Compiler
Translator
Source program Intermediate program
(byte codes for Java)
Virtual Machine Intermediate
program
Output Input
Some java compiler, just-in-time compilers compile intermediate
program just before start the program for faster processing.
Bootstrapping
Develop a system using itself
A Language Processing System
Preprocess or
Compiler
Assembler
Linker/Load er
Source Program
Modified Source Program
Target Assembly Program
Relocatable Machine Code Target Machine
Code
Library Files,
Relocatable
Object Files
Structure of a Compiler
Lexical Analyzer
Syntax Analyzer
Semantic Analyzer
Int. Code Generator
Code Optimizer Code Generator
S Y M
B O L T A B L E M
A N G E R
E R R O R H A N D L E R Front
End
Back End
Compiler Construction Tools
1. Scanner generators – generate lexical analyzers 2. Parser generators – generate syntax generators
3. Syntax-directed translation engines – generate collection of procedures that walk through the parse tree.
4. Automatic code generators
5. Data-flow engines
The Role of the Lexical Analyzer
Roles
Primary role: Scan a source program (a string) and break it up into small, meaningful units, called tokens.
Example: position := initial + rate * 60;
Transform into meaningful units: identifiers, constants, operators, and punctuation.
Other roles:
Removal of comments
Case conversion
Removal of white spaces
Interpretation of compiler directives or pragmas: For instance, in Turbo Pascal {$R+ means range checking is enabled.
Communication with symbol table: Store information regarding an identifier in the symbol table. Not advisable in cases where scopes can be nested.
Preparation of output listing: Keep track of source program, line numbers, and correspondences between error messages and line numbers.
Why separate LA from parser?
Simpler design of both LA and parser
More efficient compiler
More portable compiler
Tokens
Examples of Tokens
Operators = + − > ( { := == <>
Keywords if while for int double
Numeric literals 43 6.035 -3.6e10 0x13F3A
Character literals ‘a’ ‘~’ ‘\’’
String literals “3.142” “aBcDe” “\”
• Examples of non-tokens
White space space(‘ ’) tab(‘\t’) eoln(‘\n’)
Comments /*this is not a token*/
Interaction of Lexical analyzer and parser
Example
Lexical analyzer
symbol table
parser Source
program
token
Nexttoken()
How it works
The Lexical analyzer perform certain other tasks besides identification of lexemes. One such task is stripping out
comments and whitespace (blank, newline, tab, and perhaps other characters that are used to separate tokens in the input).
Sometimes, lexical analyzers are divided into two processes:
a) Scanning consists of the simple processes that do not require tokenization of the input, such as deletion of comments and
compaction of consecutive whitespace characters into one.
b) Lexical analysis proper is the more complex portion, where
the scanner produces the sequence of tokens as output.
Two issues in lexical analysis.
How to specify tokens (patterns)?
How to recognize the tokens given a token
specification (how to implement the nexttoken() routine)?
How to specify tokens:
all the basic elements in a language must be tokens so that they can be recognized.
There are not many types of tokens in a typical programming language: constant, identifier, reserved word, operator and misc. symbol.
main() { int i, j;
for (i=0; i<50; i++) { printf(“i = %d”, i);
} }
Type of tokens in C++:
Constants:
char constants: ‘a’
string constants: “I=%d”
int constants: 50
float point constants
Identifiers: i, j, counter, ……
Reserved words: main, int, for, …
Operators: +, =, ++, /, …
Misc. symbols: (, ), {, }, …
Tokens are specified by regular expressions.
main() { int i, j;
for (I=0; I<50; I++) { printf(“I = %d”, I);
} }
Lexical Analysis vs Parsing
There are a number of reasons why the analysis portion of a compiler is
normally separated into lexical analysis and parsing (syntax analysis) phases.
Simplicity of design is the most important consideration.
The separation of lexical and syntactic analysis often allows us to simplify at least one of these tasks. For example, a parser that had to deal with comments and whitespace as syntactic units would be considerably more complex than one that can assume comments and whitespace have already been removed by the lexical analyzer.
Compiler efficiency is improved.
a separate lexical analyzer allows us to apply specialized techniques that serve only the lexical task, not the job of parsing. In addition, specialized buffering techniques for reading input characters can speed up the compiler significantly.
Compiler portability is enhanced.
Input-device-specific peculiarities can be restricted to the lexical analyzer
Tokens, Patterns, and Lexemes
Token: a certain classification of entities of a program.
four kinds of tokens in previous example: identifiers, operators, constraints, and punctuation.
Lexeme: A specific instance of a token. Used to differentiate tokens.
For instance, both position and initial belong to the identifier class, however each a different lexeme.
Lexical analyzer may return a token type to the Parser, but must also keep track of “attributes” that distinguish one lexeme from another.
Examples of attributes: Identifiers: string, Numbers: value
Attributes are used during semantic checking and code generation. They are not needed during parsing.
Patterns: Rule describing how tokens are specified in a program.
Needed because a language can contain infinite possible strings. They all cannot be enumerated (calculated/specified)
Formal mechanisms used to represent these patterns. Formalism helps in describing precisely (i) which strings belong to the
language, and (ii) which do not.
Also, form basis for developing tools that can automatically determine if a string belongs to a language.
How are patterns specified?
Using a meta-language, called regular expressions.
Alphabet: finite set of symbols. Use term Σ for specifying an alphabet.
Sentence or term: string.
Empty string: denoted , string of length 0.
Language: Any set of strings defined over an alphabet. From the lexical analyzer point of view, this language denotes the set of all tokens in
programming language.
Define following operators over sets of strings:
1. Union: L ∪ U
S = L ∪ U = {s|(s ∈ L) ∨ (s ∈ U)}
2. Concatenation: LU or L.U
S = L U = {s t|(s ∈ L) ∧ (t ∈ U)}
3. Kleene closure: L∗, set of all strings of letters, including ∈,
S = L∗ = ∞i=0 Li
4. Positive closure: L+.
S = LL∗
Regular expression: a notation for defining the set of tokens that normally occur in programming languages.
For each regular expression r, there is a corresponding set of strings, say L(r) that is said to be derived from regular expressions. Also called regular set.