Introduction to compilers

(1)

CSE 331 Compiler Design

Lecture 01

Sharun Akter Khushbu

Lecturer, CSE, DIU

(2)

What is a Compiler?

Compiler

Source program Target program

Target Program

Input Output

A compiler translate texts from one language to other language.

(3)

Interpreter

Interpreter Source program

Output

Input

(4)

Hybrid Compiler

Translator

Source program Intermediate program

(byte codes for Java)

Virtual Machine Intermediate

program

Output Input

Some java compiler, just-in-time compilers compile intermediate

program just before start the program for faster processing.

(5)

Bootstrapping

Develop a system using itself

(6)

A Language Processing System

Preprocess or

Compiler

Assembler

Linker/Load er

Source Program

Modified Source Program

Target Assembly Program

Relocatable Machine Code Target Machine

Code

Library Files,

Relocatable

Object Files

(7)

Structure of a Compiler

Lexical Analyzer

Syntax Analyzer

Semantic Analyzer

Int. Code Generator

Code Optimizer Code Generator

S Y M

B O L T A B L E M

A N G E R

E R R O R H A N D L E R Front

End

Back End

(8)

Compiler Construction Tools

1. Scanner generators – generate lexical analyzers 2. Parser generators – generate syntax generators

3. Syntax-directed translation engines – generate collection of procedures that walk through the parse tree.

4. Automatic code generators

5. Data-flow engines

(9)

The Role of the Lexical Analyzer

 Roles

 Primary role: Scan a source program (a string) and break it up into small, meaningful units, called tokens.

 Example: position := initial + rate * 60;

 Transform into meaningful units: identifiers, constants, operators, and punctuation.

 Other roles:

 Removal of comments

 Case conversion

 Removal of white spaces

 Interpretation of compiler directives or pragmas: For instance, in Turbo Pascal {$R+ means range checking is enabled.

 Communication with symbol table: Store information regarding an identifier in the symbol table. Not advisable in cases where scopes can be nested.

 Preparation of output listing: Keep track of source program, line numbers, and correspondences between error messages and line numbers.

 Why separate LA from parser?

 Simpler design of both LA and parser

 More efficient compiler

 More portable compiler

(10)

Tokens

 Examples of Tokens

 Operators = + − > ( { := == <>

 Keywords if while for int double

 Numeric literals 43 6.035 -3.6e10 0x13F3A

 Character literals ‘a’ ‘~’ ‘\’’

 String literals “3.142” “aBcDe” “\”

• Examples of non-tokens

 White space space(‘ ’) tab(‘\t’) eoln(‘\n’)

 Comments /this is not a token/

(11)

Interaction of Lexical analyzer and parser

 Example

Lexical analyzer

symbol table

parser Source

program

token

Nexttoken()

(12)

How it works



The Lexical analyzer perform certain other tasks besides identification of lexemes. One such task is stripping out

comments and whitespace (blank, newline, tab, and perhaps other characters that are used to separate tokens in the input).

Sometimes, lexical analyzers are divided into two processes:



a) Scanning consists of the simple processes that do not require tokenization of the input, such as deletion of comments and

compaction of consecutive whitespace characters into one.



b) Lexical analysis proper is the more complex portion, where

the scanner produces the sequence of tokens as output.

(13)

 Two issues in lexical analysis.



How to specify tokens (patterns)?



How to recognize the tokens given a token

specification (how to implement the nexttoken() routine)?

 How to specify tokens:

 all the basic elements in a language must be tokens so that they can be recognized.



There are not many types of tokens in a typical programming language: constant, identifier, reserved word, operator and misc. symbol.

main() { int i, j;

for (i=0; i<50; i++) { printf(“i = %d”, i);

} }

(14)

 Type of tokens in C++:

 Constants:



char constants: ‘a’



string constants: “I=%d”



int constants: 50



float point constants

 Identifiers: i, j, counter, ……

 Reserved words: main, int, for, …

 Operators: +, =, ++, /, …

 Misc. symbols: (, ), {, }, …

 Tokens are specified by regular expressions.

main() { int i, j;

for (I=0; I<50; I++) { printf(“I = %d”, I);

} }

(15)

Lexical Analysis vs Parsing

 There are a number of reasons why the analysis portion of a compiler is

normally separated into lexical analysis and parsing (syntax analysis) phases.

 Simplicity of design is the most important consideration.

The separation of lexical and syntactic analysis often allows us to simplify at least one of these tasks. For example, a parser that had to deal with comments and whitespace as syntactic units would be considerably more complex than one that can assume comments and whitespace have already been removed by the lexical analyzer.

 Compiler efficiency is improved.

a separate lexical analyzer allows us to apply specialized techniques that serve only the lexical task, not the job of parsing. In addition, specialized buffering techniques for reading input characters can speed up the compiler significantly.

 Compiler portability is enhanced.

Input-device-specific peculiarities can be restricted to the lexical analyzer

(16)

Tokens, Patterns, and Lexemes

 Token: a certain classification of entities of a program.

 four kinds of tokens in previous example: identifiers, operators, constraints, and punctuation.

 Lexeme: A specific instance of a token. Used to differentiate tokens.

For instance, both position and initial belong to the identifier class, however each a different lexeme.

 Lexical analyzer may return a token type to the Parser, but must also keep track of “attributes” that distinguish one lexeme from another.

 Examples of attributes: Identifiers: string, Numbers: value

 Attributes are used during semantic checking and code generation. They are not needed during parsing.

 Patterns: Rule describing how tokens are specified in a program.

Needed because a language can contain infinite possible strings. They all cannot be enumerated (calculated/specified)

 Formal mechanisms used to represent these patterns. Formalism helps in describing precisely (i) which strings belong to the

language, and (ii) which do not.

 Also, form basis for developing tools that can automatically determine if a string belongs to a language.

(17)

How are patterns specified?

 Using a meta-language, called regular expressions.

 Alphabet: finite set of symbols. Use term Σ for specifying an alphabet.

 Sentence or term: string.

 Empty string: denoted , string of length 0.

 Language: Any set of strings defined over an alphabet. From the lexical analyzer point of view, this language denotes the set of all tokens in

programming language.

 Define following operators over sets of strings:

 1. Union: L ∪ U

 S = L ∪ U = {s|(s ∈ L) ∨ (s ∈ U)}

 2. Concatenation: LU or L.U

 S = L U = {s t|(s ∈ L) ∧ (t ∈ U)}

 3. Kleene closure: L∗, set of all strings of letters, including ∈,

 S = L∗ = ∞i=0 Li

 4. Positive closure: L+.

 S = LL∗

 Regular expression: a notation for defining the set of tokens that normally occur in programming languages.

 For each regular expression r, there is a corresponding set of strings, say L(r) that is said to be derived from regular expressions. Also called regular set.

(18)

Example… ^cntd

printf (“Total = %d\n”, score) ;

(19)

Example

1. One token for each keyword. The pattern for a keyword is the same as the keyword itself.

2. Tokens for the operators, either individually or in classes

3. One token representing all identifiers.

4. One or more tokens representing constants, such as numbers and literal strings.

5. Tokens for each punctuation symbol, such as left and right parentheses, comma, and

semicolon.

(20)

Introduction to compilers

CSE 331 Compiler Design

Lecture 01

Sharun Akter Khushbu

Lecturer, CSE, DIU

What is a Compiler?

Compiler

Source program Target program

Target Program

Input Output

A compiler translate texts from one language to other language.

Interpreter

Interpreter Source program

Output

Input

Hybrid Compiler

Translator

Source program Intermediate program

(byte codes for Java)

Virtual Machine Intermediate

program

Output Input

Some java compiler, just-in-time compilers compile intermediate

program just before start the program for faster processing.

Bootstrapping

Develop a system using itself

A Language Processing System

Preprocess or

Compiler

Assembler

Linker/Load er

Source Program

Modified Source Program

Target Assembly Program

Relocatable Machine Code Target Machine

Code

Library Files,

Relocatable

Object Files

Structure of a Compiler

Lexical Analyzer

Syntax Analyzer

Semantic Analyzer

Int. Code Generator

Code Optimizer Code Generator

Compiler Construction Tools

1. Scanner generators – generate lexical analyzers 2. Parser generators – generate syntax generators

3. Syntax-directed translation engines – generate collection of procedures that walk through the parse tree.

4. Automatic code generators

5. Data-flow engines

The Role of the Lexical Analyzer

Tokens

 Examples of Tokens

 Operators = + − > ( { := == <>

 Keywords if while for int double

 Numeric literals 43 6.035 -3.6e10 0x13F3A

 Character literals ‘a’ ‘~’ ‘\’’

 String literals “3.142” “aBcDe” “\”

• Examples of non-tokens

 White space space(‘ ’) tab(‘\t’) eoln(‘\n’)

 Comments /*this is not a token*/

Interaction of Lexical analyzer and parser

 Example

Lexical analyzer

symbol table

parser Source

program

token

Nexttoken()

How it works

The Lexical analyzer perform certain other tasks besides identification of lexemes. One such task is stripping out

comments and whitespace (blank, newline, tab, and perhaps other characters that are used to separate tokens in the input).

Sometimes, lexical analyzers are divided into two processes:

a) Scanning consists of the simple processes that do not require tokenization of the input, such as deletion of comments and

compaction of consecutive whitespace characters into one.

b) Lexical analysis proper is the more complex portion, where

the scanner produces the sequence of tokens as output.

 Two issues in lexical analysis.

How to specify tokens (patterns)?

How to recognize the tokens given a token

 Comments /this is not a token/

Example… ^cntd