171 lines
8.4 KiB
Plaintext
171 lines
8.4 KiB
Plaintext
|
Compilers and How They Work: An Overview
|
|||
|
|
|||
|
Lou Morgan
|
|||
|
Madison IBM-PC User's Group
|
|||
|
|
|||
|
What are compilers and how do they work? Many computer
|
|||
|
users ask themselves this question after the programming
|
|||
|
bug has bitten them. To most people, a compiler is a
|
|||
|
"black box program" which takes source code written in some
|
|||
|
high-level language, such as FORTRAN, BASIC, Pascal, or C,
|
|||
|
and translates (compiles) it into a language the computer
|
|||
|
can understand and execute. Compilers take source code and
|
|||
|
transform it into virtual machine language. With the IBM
|
|||
|
PC, this virtual language is 8088 machine code.
|
|||
|
|
|||
|
Compilers vs. Interpreters
|
|||
|
|
|||
|
Computers cannot understand English words and grammar.
|
|||
|
Even the highly structured words and sentences of
|
|||
|
programming languages must be translated before a computer
|
|||
|
can understand them. The compiler or interpreter must look
|
|||
|
up each "word" of your programming language in a kind of
|
|||
|
dictionary (or lexicon) and, in a series of steps,
|
|||
|
translate it into machine code. Each word initiates a
|
|||
|
separate logical task.
|
|||
|
|
|||
|
An interpreter translates one line of source code at a time
|
|||
|
into machine code, and then executes it. Debugging and
|
|||
|
testing is relatively fast and easy in interpreted
|
|||
|
languages, since the entire program doesn't have to be
|
|||
|
reprocessed each time a change is made. The BASIC(A).COM
|
|||
|
program is an interpreter. Interpreted programs run much
|
|||
|
slower than compiled programs, because they must be
|
|||
|
translated each time they are run. Programmers often test
|
|||
|
and debug their programs using an interpreter and then
|
|||
|
compile them for production use.
|
|||
|
|
|||
|
How Compilers Work
|
|||
|
|
|||
|
Most compilers convert programs in three steps. Each step
|
|||
|
is called a pass. A particular compiler may have one
|
|||
|
program per pass, or may combine two or three steps in a
|
|||
|
single program. For a very complex language, a step may be
|
|||
|
so difficult to perform that it is broken up into many
|
|||
|
smaller steps. Regardless of how many passes or programs
|
|||
|
are required, the compiler performs only three main
|
|||
|
functions: first, lexical analysis; second, syntax
|
|||
|
analysis; and third, code generation. During each pass of
|
|||
|
the compiler, the source code moves closer to becoming
|
|||
|
virtual machine language (or whatever language the compiler
|
|||
|
is designed to generate).
|
|||
|
|
|||
|
|
|||
|
Lexical Analysis
|
|||
|
|
|||
|
In the first pass of the compiler, the source code is
|
|||
|
passed through a lexical analyzer, which converts the
|
|||
|
source code to a set of tokens. A token is generally a
|
|||
|
number representing some keyword in the language. A
|
|||
|
compiler has a unique number for each keyword (i.e. IF,
|
|||
|
WHILE, END), and each arithmetic or logical operator (i.e.
|
|||
|
+, -, *, AND, OR, etc.). Numbers are represented by a
|
|||
|
token which indicates that what follows it should be
|
|||
|
interpreted as a number. The tokens put the programming
|
|||
|
language into a form that can be checked for proper
|
|||
|
structure and order.
|
|||
|
|
|||
|
The other important task of the lexical analyzer is to
|
|||
|
build a symbol table. This is a table of all the
|
|||
|
identifiers (variable names, procedures, and constants)
|
|||
|
used in the program. When an identifier is first
|
|||
|
recognized by the analyzer, it is inserted into the symbol
|
|||
|
table, along with information about its type, where it is
|
|||
|
to be stored, and so forth. This information is used in
|
|||
|
subsequent passes of the compiler.
|
|||
|
|
|||
|
Syntax Analysis
|
|||
|
|
|||
|
After the lexical analyzer translates a program into tokens
|
|||
|
of keywords, variables, constants, symbols and logical
|
|||
|
operators, the compiler makes its next pass. To describe
|
|||
|
what happens during this function, I will briefly explain
|
|||
|
grammars.
|
|||
|
|
|||
|
Grammars. Like any language, programming languages have a
|
|||
|
set of rules governing the structure of the program. Each
|
|||
|
different computer language has its own grammar which makes
|
|||
|
it unique. Some grammars are complex (PL/I) and others are
|
|||
|
relatively easy (Pascal). The programmer must observe all
|
|||
|
the structural rules of a language to make logical sense to
|
|||
|
the computer. The next step of the compiling process,
|
|||
|
parsing, checks to be sure all the rules were followed.
|
|||
|
|
|||
|
Parsing. The parsing routines of a compiler check to see
|
|||
|
that the program is written correctly (according to the
|
|||
|
language rules). The parser reads in the tokens generated
|
|||
|
by the lexical analyzer and compares them to the set
|
|||
|
grammar of the programming language. If the program
|
|||
|
follows the rules of the language, then it is syntactically
|
|||
|
correct. When the parser encounters a mistake, it issues a
|
|||
|
warning or error message and tries to continue. Some
|
|||
|
|
|||
|
parsers try to correct a faulty program, others do not.
|
|||
|
When the parser reaches the end of the token stream, it
|
|||
|
will tell the compiler that either the program is
|
|||
|
grammatically correct and compiling can continue or the
|
|||
|
program contains too many errors and compiling must be
|
|||
|
aborted. If the program is grammatically correct, the
|
|||
|
parser will call for semantic routines.
|
|||
|
|
|||
|
Semantic Routines. The semantic routines of a compiler
|
|||
|
perform two tasks: checking to make sure that each series
|
|||
|
of tokens will be understood by the computer when it is
|
|||
|
fully translated to machine code, and converting the series
|
|||
|
of tokens one step closer to machine code. The first task
|
|||
|
takes a series of tokens, called a production, and checks
|
|||
|
it to see if it makes sense. For example, a production may
|
|||
|
be correct as far as the parser is concerned, but the
|
|||
|
semantic routines check whether the variables have been
|
|||
|
declared, and are of the right type, etc. If the
|
|||
|
production makes sense, the semantic routine reduces the
|
|||
|
production for the next phase of compilation, code
|
|||
|
generation. Most of the code for the compiler lies here in
|
|||
|
the semantic routines and thus takes up a majority of the
|
|||
|
compilation time.
|
|||
|
|
|||
|
Summary. Two major routines comprise syntax analysis: the
|
|||
|
parsing routine and the semantic routine. The parser
|
|||
|
checks for the correct order of the tokens and then calls
|
|||
|
the semantic routines to check whether the series of tokens
|
|||
|
(a production) will make sense to the computer. The
|
|||
|
semantic routine then reduces the production another step
|
|||
|
toward complete translation to machine code.
|
|||
|
|
|||
|
Code Generation
|
|||
|
|
|||
|
The code generation process determines how fast the code
|
|||
|
will run and how large it will be. The first part of code
|
|||
|
generation involves optimization, and the second involves
|
|||
|
actual machine code generation.
|
|||
|
|
|||
|
Optimization. In this step, the compiler tries to make the
|
|||
|
intermediate code generated by the semantic routines more
|
|||
|
efficient. This process can be very slow and may not be
|
|||
|
able to improve the code much. Because of this, many
|
|||
|
compilers don't include optimizers, and, if they do, they
|
|||
|
look only for areas that are easy to optimize.
|
|||
|
|
|||
|
Code Generation. This process takes the intermediate code
|
|||
|
produced by the optimizer (or semantic routines if the
|
|||
|
compiler has no optimizer) and generates virtual machine
|
|||
|
code, which in our case is 8088 machine code. It is this
|
|||
|
part of the compilation phase that is machine dependent.
|
|||
|
Each type of computer has an operating system that
|
|||
|
processes virtual machine code differently; therefore, the
|
|||
|
code generator must be different for each type of
|
|||
|
computer. The choice of instructions for the fastest
|
|||
|
execution and smallest code size are made at this point,
|
|||
|
according to the machine's operating system. Each code
|
|||
|
generator is designed specifically for the machine and
|
|||
|
operating system the final code will run on.
|
|||
|
|
|||
|
If the program is free from syntactical errors, code
|
|||
|
generation should take place without any problem. When the
|
|||
|
code generator is finished, the code produced will be in
|
|||
|
8088 machine code, but the format of the code is not yet
|
|||
|
executable. It is in a format (an .OBJ file in our case)
|
|||
|
that is ready to go to a linker, which will create an
|
|||
|
executable *.EXE or *.COM file from the machine code the
|
|||
|
compiler has generated.
|
|||
|
|