171 lines
8.4 KiB
Plaintext
171 lines
8.4 KiB
Plaintext
Compilers and How They Work: An Overview
|
||
|
||
Lou Morgan
|
||
Madison IBM-PC User's Group
|
||
|
||
What are compilers and how do they work? Many computer
|
||
users ask themselves this question after the programming
|
||
bug has bitten them. To most people, a compiler is a
|
||
"black box program" which takes source code written in some
|
||
high-level language, such as FORTRAN, BASIC, Pascal, or C,
|
||
and translates (compiles) it into a language the computer
|
||
can understand and execute. Compilers take source code and
|
||
transform it into virtual machine language. With the IBM
|
||
PC, this virtual language is 8088 machine code.
|
||
|
||
Compilers vs. Interpreters
|
||
|
||
Computers cannot understand English words and grammar.
|
||
Even the highly structured words and sentences of
|
||
programming languages must be translated before a computer
|
||
can understand them. The compiler or interpreter must look
|
||
up each "word" of your programming language in a kind of
|
||
dictionary (or lexicon) and, in a series of steps,
|
||
translate it into machine code. Each word initiates a
|
||
separate logical task.
|
||
|
||
An interpreter translates one line of source code at a time
|
||
into machine code, and then executes it. Debugging and
|
||
testing is relatively fast and easy in interpreted
|
||
languages, since the entire program doesn't have to be
|
||
reprocessed each time a change is made. The BASIC(A).COM
|
||
program is an interpreter. Interpreted programs run much
|
||
slower than compiled programs, because they must be
|
||
translated each time they are run. Programmers often test
|
||
and debug their programs using an interpreter and then
|
||
compile them for production use.
|
||
|
||
How Compilers Work
|
||
|
||
Most compilers convert programs in three steps. Each step
|
||
is called a pass. A particular compiler may have one
|
||
program per pass, or may combine two or three steps in a
|
||
single program. For a very complex language, a step may be
|
||
so difficult to perform that it is broken up into many
|
||
smaller steps. Regardless of how many passes or programs
|
||
are required, the compiler performs only three main
|
||
functions: first, lexical analysis; second, syntax
|
||
analysis; and third, code generation. During each pass of
|
||
the compiler, the source code moves closer to becoming
|
||
virtual machine language (or whatever language the compiler
|
||
is designed to generate).
|
||
|
||
|
||
Lexical Analysis
|
||
|
||
In the first pass of the compiler, the source code is
|
||
passed through a lexical analyzer, which converts the
|
||
source code to a set of tokens. A token is generally a
|
||
number representing some keyword in the language. A
|
||
compiler has a unique number for each keyword (i.e. IF,
|
||
WHILE, END), and each arithmetic or logical operator (i.e.
|
||
+, -, *, AND, OR, etc.). Numbers are represented by a
|
||
token which indicates that what follows it should be
|
||
interpreted as a number. The tokens put the programming
|
||
language into a form that can be checked for proper
|
||
structure and order.
|
||
|
||
The other important task of the lexical analyzer is to
|
||
build a symbol table. This is a table of all the
|
||
identifiers (variable names, procedures, and constants)
|
||
used in the program. When an identifier is first
|
||
recognized by the analyzer, it is inserted into the symbol
|
||
table, along with information about its type, where it is
|
||
to be stored, and so forth. This information is used in
|
||
subsequent passes of the compiler.
|
||
|
||
Syntax Analysis
|
||
|
||
After the lexical analyzer translates a program into tokens
|
||
of keywords, variables, constants, symbols and logical
|
||
operators, the compiler makes its next pass. To describe
|
||
what happens during this function, I will briefly explain
|
||
grammars.
|
||
|
||
Grammars. Like any language, programming languages have a
|
||
set of rules governing the structure of the program. Each
|
||
different computer language has its own grammar which makes
|
||
it unique. Some grammars are complex (PL/I) and others are
|
||
relatively easy (Pascal). The programmer must observe all
|
||
the structural rules of a language to make logical sense to
|
||
the computer. The next step of the compiling process,
|
||
parsing, checks to be sure all the rules were followed.
|
||
|
||
Parsing. The parsing routines of a compiler check to see
|
||
that the program is written correctly (according to the
|
||
language rules). The parser reads in the tokens generated
|
||
by the lexical analyzer and compares them to the set
|
||
grammar of the programming language. If the program
|
||
follows the rules of the language, then it is syntactically
|
||
correct. When the parser encounters a mistake, it issues a
|
||
warning or error message and tries to continue. Some
|
||
|
||
parsers try to correct a faulty program, others do not.
|
||
When the parser reaches the end of the token stream, it
|
||
will tell the compiler that either the program is
|
||
grammatically correct and compiling can continue or the
|
||
program contains too many errors and compiling must be
|
||
aborted. If the program is grammatically correct, the
|
||
parser will call for semantic routines.
|
||
|
||
Semantic Routines. The semantic routines of a compiler
|
||
perform two tasks: checking to make sure that each series
|
||
of tokens will be understood by the computer when it is
|
||
fully translated to machine code, and converting the series
|
||
of tokens one step closer to machine code. The first task
|
||
takes a series of tokens, called a production, and checks
|
||
it to see if it makes sense. For example, a production may
|
||
be correct as far as the parser is concerned, but the
|
||
semantic routines check whether the variables have been
|
||
declared, and are of the right type, etc. If the
|
||
production makes sense, the semantic routine reduces the
|
||
production for the next phase of compilation, code
|
||
generation. Most of the code for the compiler lies here in
|
||
the semantic routines and thus takes up a majority of the
|
||
compilation time.
|
||
|
||
Summary. Two major routines comprise syntax analysis: the
|
||
parsing routine and the semantic routine. The parser
|
||
checks for the correct order of the tokens and then calls
|
||
the semantic routines to check whether the series of tokens
|
||
(a production) will make sense to the computer. The
|
||
semantic routine then reduces the production another step
|
||
toward complete translation to machine code.
|
||
|
||
Code Generation
|
||
|
||
The code generation process determines how fast the code
|
||
will run and how large it will be. The first part of code
|
||
generation involves optimization, and the second involves
|
||
actual machine code generation.
|
||
|
||
Optimization. In this step, the compiler tries to make the
|
||
intermediate code generated by the semantic routines more
|
||
efficient. This process can be very slow and may not be
|
||
able to improve the code much. Because of this, many
|
||
compilers don't include optimizers, and, if they do, they
|
||
look only for areas that are easy to optimize.
|
||
|
||
Code Generation. This process takes the intermediate code
|
||
produced by the optimizer (or semantic routines if the
|
||
compiler has no optimizer) and generates virtual machine
|
||
code, which in our case is 8088 machine code. It is this
|
||
part of the compilation phase that is machine dependent.
|
||
Each type of computer has an operating system that
|
||
processes virtual machine code differently; therefore, the
|
||
code generator must be different for each type of
|
||
computer. The choice of instructions for the fastest
|
||
execution and smallest code size are made at this point,
|
||
according to the machine's operating system. Each code
|
||
generator is designed specifically for the machine and
|
||
operating system the final code will run on.
|
||
|
||
If the program is free from syntactical errors, code
|
||
generation should take place without any problem. When the
|
||
code generator is finished, the code produced will be in
|
||
8088 machine code, but the format of the code is not yet
|
||
executable. It is in a format (an .OBJ file in our case)
|
||
that is ready to go to a linker, which will create an
|
||
executable *.EXE or *.COM file from the machine code the
|
||
compiler has generated.
|
||
|