4531 lines
245 KiB
Plaintext
4531 lines
245 KiB
Plaintext
|
|
|||
|
|
|||
|
EVERYTHING YOU ALWAYS WANTED TO KNOW ABOUT MATH COPROCESSORS
|
|||
|
|
|||
|
This document has been created to provide the net.community with some
|
|||
|
detailed information about mathematical coprocessors for the Intel 80x86 CPU
|
|||
|
family. It may also help to answer some of the FAQs (frequently asked
|
|||
|
questions) about this topic. The primary focus of this document is on 80387-
|
|||
|
compatible chips, but there is also some information on the other chips in
|
|||
|
the 80x87 family and the Weitek family of coprocessors. Care was taken to
|
|||
|
make the information included as accurate as possible. If you think you have
|
|||
|
discovered erroneous information in this text, or think that a certain detail
|
|||
|
needs to be clarified, or want to suggest additions, feel free to contact me
|
|||
|
at:
|
|||
|
|
|||
|
S_JUFFA@IRAVCL.IRA.UKA.DE
|
|||
|
|
|||
|
or at my SnailMail address:
|
|||
|
|
|||
|
Norbert Juffa
|
|||
|
Wielandtstr. 14
|
|||
|
7500 Karlsruhe 1
|
|||
|
Germany
|
|||
|
|
|||
|
|
|||
|
This is the fifth version of this document (dated 01-13-93) and I'd like
|
|||
|
to thank those who have helped improving it by commenting on the previous
|
|||
|
versions:
|
|||
|
|
|||
|
Fred Dunlap (cyrix!fred@texsun.Central.Sun.COM), Peter Forsberg
|
|||
|
(peter@vnet.ibm.com), Richard Krehbiel (richk@grevyn.com), Arto
|
|||
|
Viitanen (av@cs.uta.fi), Jerry Whelan (guru@stasi.bradley.edu),
|
|||
|
Eric Johnson (johnson%camax01@uunet.UU.NET), Warren Ferguson
|
|||
|
(ferguson@seas.smu.edu), Bengt Ask (f89ba@efd.lth.se), Thomas Hoberg
|
|||
|
(tmh@prosun.first.gmd.de), Nhuan Doduc (ndoduc@framentec.fr), John
|
|||
|
Levine (johnl@iecc.cambridge.ma.us), David Hough (dgh@validgh.com),
|
|||
|
Duncan Murdoch (dmurdoch@mast.QueensU.CA), Benjamin Eitan
|
|||
|
(benny.iil.intel.com)
|
|||
|
|
|||
|
A very special thanks goes to David Ruggiero (osiris@halcyon.halcyon.com),
|
|||
|
who did a great job editing and formatting this article. Thanks David!
|
|||
|
|
|||
|
|
|||
|
Contents of this document
|
|||
|
-------------------------
|
|||
|
|
|||
|
1) What are math coprocessors?
|
|||
|
2) How PC programs use a math coprocessor
|
|||
|
3) Which applications benefit from a math coprocessor
|
|||
|
4) Potential performance gains with a math coprocessor
|
|||
|
5) How various math coprocessors work
|
|||
|
6) Coprocessor emulator software
|
|||
|
7) Installing a math coprocessor
|
|||
|
8) Detailed description and specifications for all available math
|
|||
|
coprocessor chips
|
|||
|
9) Finding out which coprocessor you have (the COMPTEST program)
|
|||
|
10) Current coprocessor prices and purchasing advice
|
|||
|
11) The coprocessor benchmark programs (performance comparisons of
|
|||
|
available math coprocessors using various CPUs)
|
|||
|
12) Clock-cycle timings for each coprocessor instruction
|
|||
|
13) Accuracy tests and IEEE-754 conformance for various coprocessors
|
|||
|
14) Accuracy of transcendental function calculations for various coprocessors
|
|||
|
15) Compatibility tests with Intel's 387DX / the SMDIAG program
|
|||
|
16) References (literature)
|
|||
|
17) Addresses of manufacturers of math coprocessors
|
|||
|
18) Appendix A: Test programs for partial compatibility and accuracy checks
|
|||
|
19) Appendix B: Benchmark programs TRNSFORM and PEAKFLOP
|
|||
|
|
|||
|
|
|||
|
|
|||
|
===========================
|
|||
|
What are math coprocessors?
|
|||
|
===========================
|
|||
|
|
|||
|
A coprocessor in the traditional sense is a processor, separate from the main
|
|||
|
CPU, that extends the capabilities of a CPU in a transparent manner. This
|
|||
|
means that from the program's (and programmer's) point of view, the CPU and
|
|||
|
coprocessor together look like a single, unified machine.
|
|||
|
|
|||
|
The 80x87 family of math coprocessors (also known as MCPs [Math
|
|||
|
CoProcessors], NDPs [Numerical Data Processors], NPXs [Numerical Processor
|
|||
|
eXtensions], or FPUs [Floating-Point Units], or simply "math chips") are
|
|||
|
typical examples of such coprocessors. The 80x86 CPUs, with the exception of
|
|||
|
the 80486 (which has a built-in FPU) can only handle 8, 16, or 32 bit
|
|||
|
integers as their basic data types. However, many PC-based applications
|
|||
|
require the use of not only integers, but floating-point numbers. Simply put,
|
|||
|
the use of floating-point numbers enables a binary representation of not only
|
|||
|
integers, but also fractional values over a wide range. A common application
|
|||
|
of floating-point numbers is in scientific applications, where very small
|
|||
|
(e.g., Planck's constant) and very large numbers (e.g., speed of light) must
|
|||
|
be accurately expressed. But floating-point numbers are also useful for
|
|||
|
business applications such as computing interest, and in the geometric
|
|||
|
calculations inherent in CAD/CAM processing.
|
|||
|
|
|||
|
Because the instruction sets of all 80x86 CPUs directly support only integers
|
|||
|
and calculations upon integers, floating-point numbers and operations on them
|
|||
|
must be programmed indirectly by using series of CPU integer instructions.
|
|||
|
This means that computations when floating-point numbers are used are far
|
|||
|
slower than normal, integer calculations. And this is where the 80x87
|
|||
|
coprocessors come in: adding an 80x87 to an 80x86-based system augments the
|
|||
|
CPU architecture with eight floating-point registers, five additional data
|
|||
|
types and over 70 additional instructions, all designed to deal directly with
|
|||
|
floating-point numbers as a basic data type. This removes the 'penalty' for
|
|||
|
floating-point computations, and greatly increases overall system performance
|
|||
|
for applications which depend heavily on these calculations.
|
|||
|
|
|||
|
In addition to being able to quickly execute load/store operations on
|
|||
|
floating-point numbers, the 80x87 coprocessors can directly perform all the
|
|||
|
basic arithmetic operation on them. Besides "knowing" how to add, subtract,
|
|||
|
multiply and divide floating-point numbers, they can also operate on them to
|
|||
|
perform comparisons, square roots, transcendental functions (such as logarithms
|
|||
|
and sine/cosine/tangent), and compute their absolute value and remainder.
|
|||
|
|
|||
|
Like most things in life, floating-point arithmetic has been standardized.
|
|||
|
The relevant standard (to which I will refer quite often in this document) is
|
|||
|
the "IEEE-754 Standard for Binary Floating-Point Arithmetic" [10,11]. The
|
|||
|
standard specifies numeric formats, value sets and how the basic arithmetic
|
|||
|
(+,-,*,/,sqrt, remainder) has to work. All the coprocessors covered in this
|
|||
|
document claim full or at least partial compliance with the IEEE-754
|
|||
|
standard.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
=================================================
|
|||
|
How PC programs use 80x87 and Weitek coprocessors
|
|||
|
=================================================
|
|||
|
|
|||
|
The basic data type used by all 80x87 coprocessors is an 80-bit long
|
|||
|
floating-point number. This data type (called "temporary real" or "double
|
|||
|
extended precision") can directly represent numbers which range in size
|
|||
|
between 3.36*10^-4932 and 1.19*10^4932 (3.65*10^-4951 to 1.19*10^4932
|
|||
|
including denormal numbers) where '^' denotes the power operator. (For those
|
|||
|
familiar with floating-point formats, this format has 64 mantissa bits, 15
|
|||
|
exponent bits and 1 sign bit, for the total of 80 bits.) This format provides
|
|||
|
a precision of about 19 decimal places. 80x87s can also handle additional
|
|||
|
data types that are converted to/from the internal format upon being loaded
|
|||
|
or stored to/from the coprocessor. These include 16 bit, 32 bit, and 64 bit
|
|||
|
integers as well as a 18 digit BCD (binary coded decimal) data type occupying
|
|||
|
10 bytes and providing 18 decimal digits.
|
|||
|
|
|||
|
The 80x87 also supports two additional floating-point types. The short real
|
|||
|
data type (also called "single-precision") has 32 bits that split into 23
|
|||
|
mantissa bits, 8 exponent bit and a sign bit. By using the "hidden bit"
|
|||
|
technique, the effective length of the mantissa is increased to 24 bits. (The
|
|||
|
hidden bit technique exploits the fact that for normalized floating-point
|
|||
|
numbers, the mantissa m always is in the range 1 <= m < 2. Since the first
|
|||
|
mantissa bit represents the integer part of the mantissa, it is always set
|
|||
|
for normalized numbers, and therefore need not be stored, as it is guaranteed
|
|||
|
to always be 1.) The IEEE single-precision format provides a precision of
|
|||
|
about 6-7 decimal places and can represent numbers between 1.17*10^-38 and
|
|||
|
3.40*10^38 (1.40*10^-45 to 3.40*10^38 including denormal numbers). The long
|
|||
|
real, or double-precision, data type has 64 bits, consisting of 52 mantissa
|
|||
|
bits, 11 exponent bits, and the sign bit. It provides 15-16 decimal digits of
|
|||
|
precision and can handle numbers from 2.22*10^-308 to 1.79*10^308 (4.94*10^-
|
|||
|
324 to 1.79*10^308 including denormal numbers). (This format also uses the
|
|||
|
hidden bit technique to provide effectively 53 mantissa bits.)
|
|||
|
|
|||
|
The eight registers in the 80x87 are organized in a stack-like manner which
|
|||
|
takes some time getting used to if one programs the coprocessor directly in
|
|||
|
assembly language. However, nowadays the compilers or interpreters for most
|
|||
|
high level languages (HLLs) can give a programmer easy access to the
|
|||
|
coprocessor's data types and use their instructions, so there is not much
|
|||
|
need to deal directly with the rather unusual architecture of the 80x87.
|
|||
|
|
|||
|
|
|||
|
The architecture of the Weitek chips differs significantly from the 80x87.
|
|||
|
Strictly speaking, the Weitek Abacus 3167 and 4167 are not coprocessors in
|
|||
|
that they do not transparently extend the CPU architecture; rather, they
|
|||
|
could be described as highly-specialized, memory-mapped IO devices. But as
|
|||
|
the term "coprocessor" has been traditionally used for these chips, they will
|
|||
|
be referred to as such here.
|
|||
|
|
|||
|
The Weitek coprocessors have a RISC-like architecture which has been tuned
|
|||
|
for maximum performance. Only a small instruction set has been implemented in
|
|||
|
the chip, but each instruction executes at a very high speed (usually only a
|
|||
|
few clock cycles each). Instructions available include load/store, add,
|
|||
|
subtract, subtract reverse, multiply, multiply and negate, multiply and
|
|||
|
accumulate, multiply and take absolute value, divide reverse, negate,
|
|||
|
absolute value, compare/test, convert fix/float, and square root. In contrast
|
|||
|
to the 80x87 family, the Weitek Abacus does not support a double extended
|
|||
|
format, has no built-in transcendental functions, and does not support
|
|||
|
denormals. The resources required to implement such features have instead
|
|||
|
been devoted to implement the basic arithmetic operations as fast as
|
|||
|
possible.
|
|||
|
|
|||
|
While the 80x87 coprocessors perform all internal calculations in double
|
|||
|
extended precision and therefore have about the same performance for single
|
|||
|
and double-precision calculations, the Weitek features explicit single and
|
|||
|
double-precision operations. For applications that require only single-
|
|||
|
precision operations, the Weitek can therefore provide very high performance,
|
|||
|
as single-precision operations are about twice as fast as their double-
|
|||
|
precision counterparts. Also, since the Weitek Abacus has more registers than
|
|||
|
the 80x87 coprocessors (31 versus 8), values can be kept in registers more
|
|||
|
often and have to be loaded from memory less frequently. This also leads to
|
|||
|
performance gains.
|
|||
|
|
|||
|
The Weitek's register file consists of 31 32-bit registers, each one capable
|
|||
|
of holding an IEEE single-precision number. Pairs of consecutive single-
|
|||
|
precision registers can also be used as 64-bit IEEE double-precision
|
|||
|
registers; thus there are 15 double-precision registers. The Weitek register
|
|||
|
file has the standard organization like the register files in the 80386, not
|
|||
|
the special stack-like organization of the 80x87 coprocessors.
|
|||
|
|
|||
|
To the main CPU, the Weitek Abacus appears as a 64 KB block of memory
|
|||
|
starting at physical address 0C0000000h. Each address in this range
|
|||
|
corresponds to a coprocessor instruction. Accessing a specified memory
|
|||
|
location within this block with a MOV instruction causes the corresponding
|
|||
|
Weitek instruction to be executed. (The instructions have been cleverly
|
|||
|
assigned to memory locations in such a way that loads to consecutive
|
|||
|
coprocessor registers can make use of the 386/486 MOVS string instruction.)
|
|||
|
This memory-mapped interface is much faster than the IO-oriented protocol
|
|||
|
that is used to couple the CPU to an 80287 or 80387 coprocessor. The Weitek's
|
|||
|
memory block can actually be assigned to any logical address using the MMU
|
|||
|
(memory management unit) in the 386/486's protected and virtual modes. This
|
|||
|
also means that the Weitek Abacus *cannot* be used in the real mode of those
|
|||
|
processors, since their physical starting address (0C0000000h) is not within
|
|||
|
the 1 MByte address range and the MMU is inoperable in real mode. However,
|
|||
|
DOS programs can make use of the Weitek by using a DOS extender or a memory
|
|||
|
manager (such as QEMM or EMM386) that runs in protected/virtual mode itself
|
|||
|
and can therefore map the Weitek's memory block to any desired location in
|
|||
|
the 1 MByte address range.
|
|||
|
|
|||
|
Typically the FS segment register is then set up to point to the Weitek's
|
|||
|
memory block. On the 80486, this technique has severe drawbacks, as using the
|
|||
|
FS: prefix takes an additional clock cycle, thereby nearly halving the
|
|||
|
performance of the 4167. Most DOS-based compilers exhibit this problem, so
|
|||
|
the only way around it is to code in assembly language [75]. The Weitek
|
|||
|
Abacus 3167 and 4167 are also supported by the UNIX operating system [33].
|
|||
|
|
|||
|
|
|||
|
|
|||
|
==========================================================
|
|||
|
Which application programs benefit from a math coprocessor
|
|||
|
==========================================================
|
|||
|
|
|||
|
According to the Intel 387DX User's Guide, there are more than 2100
|
|||
|
commercial programs that can make use of a 387-compatible coprocessor. Every
|
|||
|
program that uses floating-point arithmetic somewhere and contains the
|
|||
|
instructions to support an 80x87 or Weitek chip can gain speed by installing
|
|||
|
one. However, the speedup will vary from program to program (and even within
|
|||
|
the same program) depending on how computation-intensive the program or
|
|||
|
operation within the program is. Typical applications that benefit from the
|
|||
|
use of a math coprocessor are:
|
|||
|
|
|||
|
- CAD programs (AutoCAD, VersaCAD, GenericCAD)
|
|||
|
- Spreadsheet programs (Lotus 1-2-3, Excel, Quattro, Wingz)
|
|||
|
- Business graphics programs (Arts&Letters, Freedom of Press, Freelance)
|
|||
|
- Mathematical analysis and statistical programs (Mathematica, TKSolver,
|
|||
|
SPSS/PC, Statgraphics)
|
|||
|
- Database programs (dBase IV, FoxBase, Paradox, Revelation)
|
|||
|
|
|||
|
Note that for spreadsheets and databases, a coprocessor only helps if some
|
|||
|
kind of floating-point computation is performed; this is true more often for
|
|||
|
spreadsheets than for databases. Also note that the speed of many programs
|
|||
|
depends quite heavily on factors such the speed of the graphics adapter (CAD)
|
|||
|
or the disk performance (databases), so the computational performance is only
|
|||
|
a (small) part of the total performance of the application. There are some
|
|||
|
programs that won't run without a coprocessor, among them AutoCAD (R10 and
|
|||
|
later) and Mathematica.
|
|||
|
|
|||
|
Most GUIs (graphical user interfaces) such as Microsoft Windows or the OS/2
|
|||
|
Presentation Manager do *not* gain additional speed from using a
|
|||
|
*mathematical* coprocessor, since their graphics operations only use integer
|
|||
|
arithmetic [71]. They *will* benefit from a graphics board with a graphics
|
|||
|
"coprocessor" that speeds up certain common graphics operations such as
|
|||
|
BitBlt or line drawing. A few GUIs used on PCs, such as X-Windows, use a
|
|||
|
certain amount of floating-point operations for operations such as arc
|
|||
|
drawing. However, the use of floating-point operations in X-Windows seems to
|
|||
|
have decreased significantly in versions after X11R3, so the overall
|
|||
|
performance impact of a coprocessor is small [72]. Applications running under
|
|||
|
any GUI may take advantage of a math coprocessor, of course (for example,
|
|||
|
Microsoft Excel running under Windows).
|
|||
|
|
|||
|
While support for 80x87 coprocessors is very common in application programs,
|
|||
|
the Weitek Abacus coprocessors do not enjoy such widespread support. Due to
|
|||
|
their higher price, only a few high-end PCs have been equipped with Weitek
|
|||
|
coprocessors. Some machines, such as IBM's PS/2 series, do not even have
|
|||
|
sockets to accommodate them. Therefore, most of the programs that support
|
|||
|
these coprocessors are also high-end products, like AutoCAD and Versacad-386.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
==============================================
|
|||
|
Potential performance gains with a coprocessor
|
|||
|
==============================================
|
|||
|
|
|||
|
The Intel Math Coprocessor Utilities Disk that accompanies the Intel 387DX
|
|||
|
coprocessor has a demonstration program that shows the speedup of certain
|
|||
|
application programs when run with the Intel coprocessor versus a system with
|
|||
|
no coprocessor:
|
|||
|
|
|||
|
Application Time w/o 387 Time w/387 Speedup
|
|||
|
|
|||
|
Art&Letters 87.0 sec 34.8 sec 150%
|
|||
|
Quattro Pro 8.0 sec 4.0 sec 100%
|
|||
|
Wingz 17.9 sec 9.1 sec 97%
|
|||
|
Mathematica 420.2 sec 337.0 sec 25%
|
|||
|
|
|||
|
|
|||
|
The following table is an excerpt from [70]:
|
|||
|
|
|||
|
Application Time w/o 387 Time w/387 Speedup
|
|||
|
|
|||
|
Corel Draw 471.0 sec 416.0 sec 13%
|
|||
|
Freedom Of Press 163.0 sec 77.0 sec 112%
|
|||
|
Lotus 1-2-3 257.0 sec 43.0 sec 597%
|
|||
|
|
|||
|
|
|||
|
The following table is an excerpt from [25]:
|
|||
|
|
|||
|
Application Time w/o 387 Time w/387 Speedup
|
|||
|
|
|||
|
Design CAD, Test1 98.1 sec 50.0 sec 96%
|
|||
|
Design CAD, Test2 75.3 sec 35.0 sec 115%
|
|||
|
Excel, Test 1 9.2 sec 6.8 sec 35%
|
|||
|
Excel, Test 1 12.6 sec 9.3 sec 35%
|
|||
|
|
|||
|
|
|||
|
Note that coprocessor performance also depends on the motherboard, or more
|
|||
|
specifically, the chipset used on the motherboard. In [34] and [35]
|
|||
|
identically configured motherboards using different 386 chipsets were tested.
|
|||
|
Among other tests a coprocessor benchmark was run which is based on a fractal
|
|||
|
computation and its execution time recorded. The following tables showing
|
|||
|
coprocessor performance to vary with the chipset have been copied from these
|
|||
|
articles in abridged form:
|
|||
|
|
|||
|
Cyrix Cyrix
|
|||
|
chip set 387+ chip set 83D87
|
|||
|
|
|||
|
Opti, 40 MHz 24.57 sec 97.0% PC-Chips, 33 MHz 26.97 sec 93.0%
|
|||
|
Elite,40 MHz 24.46 sec 97.4% UMC, 33 MHz 27.69 sec 90.5%
|
|||
|
ACT, 40 MHz 23.84 sec 100.0% Headland, 33 MHz 25.08 sec 100.0%
|
|||
|
Forex,40 MHz 23.84 sec 100.0% Eteq, 33 MHz 27.38 sec 91.6%
|
|||
|
|
|||
|
|
|||
|
This shows that performance of the same coprocessor can vary by up to ~10%
|
|||
|
depending on the chipset used on your board, at least for 386 motherboards
|
|||
|
(similar numbers for 286, 386SX, and 486 are, unfortunately, not available).
|
|||
|
The benchmarks for this article were run on a motherboard with the Forex chip
|
|||
|
set, one of the fastest 386 chip sets available, and not only with respect to
|
|||
|
floating-point performance [35].
|
|||
|
|
|||
|
|
|||
|
|
|||
|
==================================
|
|||
|
How various math coprocessors work
|
|||
|
==================================
|
|||
|
|
|||
|
In any 80x86 system with an 80x87 math coprocessor, CPU instructions and
|
|||
|
coprocessor instructions are executed concurrently. This means that the CPU
|
|||
|
can execute CPU instructions while the coprocessor executes a coprocessor
|
|||
|
instruction at the same time. The concurrency is restricted somewhat by the
|
|||
|
fact that the CPU has to aid the coprocessor in certain operations. As the
|
|||
|
CPU and the coprocessor are fed from the same instruction stream and both
|
|||
|
instruction streams may operate on the same data, there has to be a
|
|||
|
synchronizing mechanism between the CPU and the coprocessor.
|
|||
|
|
|||
|
|
|||
|
The 8087
|
|||
|
--------
|
|||
|
In 8086/8088 systems with 8087 coprocessors, both chips look at every opcode
|
|||
|
coming in from the bus. To do this, both chips have the same BIU (bus
|
|||
|
interface unit) and the 8086 BIU sends the status signals of its prefetch
|
|||
|
queue to the 8087 BIU. This insures that both processors always decode the
|
|||
|
same instructions in parallel. Since all coprocessor instruction start with
|
|||
|
the bit pattern 11011, it is easy for the 8087 to ignore all other
|
|||
|
instructions. Likewise the CPU ignores all coprocessor instructions, unless
|
|||
|
they access memory. In this case, the CPU computes the address of the LSB
|
|||
|
(least significant byte) of the memory operand and does a dummy read. The
|
|||
|
8087 then takes the data from the data bus. If more than one memory access is
|
|||
|
needed to load an memory operand, the 8087 requests the bus from the CPU,
|
|||
|
generates the consecutive addresses of the operand's bytes and fetches them
|
|||
|
from the data bus. After completing the operation, the 8087 hands bus control
|
|||
|
back to the CPU. Since 8087 and CPU are hooked up to the same synchronous
|
|||
|
bus, they must run at the same speed. This means that with the 8087, only
|
|||
|
synchronous operation of CPU and coprocessor is possible.
|
|||
|
|
|||
|
Another 8087 coprocessor instruction can only be started if the previous one
|
|||
|
has been completed in the NEU (numerical execution unit) of the 8087. To
|
|||
|
prevent the 8086 from decoding a new coprocessor instruction while the 8087
|
|||
|
is still executing the previous coprocessor instruction, a coding mechanism
|
|||
|
is employed: All 8087-capable compilers and assemblers automatically
|
|||
|
generate a WAIT instruction before each coprocessor instruction. The WAIT
|
|||
|
instruction tests the CPU's /TEST pin and suspends execution until its input
|
|||
|
becomes "LOW". In all 8086/8087 systems, the 8086 /TEST pin is connected to
|
|||
|
the 8087 BUSY pin. As long as the NEU executes a coprocessor instruction, it
|
|||
|
forces its BUSY pin "HIGH"; thus, the WAIT opcode preceding the coprocessor
|
|||
|
instruction stops the CPU until any still-executing coprocessor instruction
|
|||
|
has finished.
|
|||
|
|
|||
|
The same synchronization is used before the CPU accesses data that was
|
|||
|
written by the coprocessor. A WAIT instruction after any coprocessor
|
|||
|
instruction that writes to memory causes the CPU to stop until the
|
|||
|
coprocessor has completed transfer of the data to memory, after which the CPU
|
|||
|
can safely access it.
|
|||
|
|
|||
|
|
|||
|
The 80287
|
|||
|
---------
|
|||
|
The 80287 coprocessor-CPU interface is totally different from the 8087
|
|||
|
design. Since the 80286 implements memory protection via an MMU based on
|
|||
|
segmentation, it would have been much too expensive to duplicate the whole
|
|||
|
memory protection logic on the coprocessor, which an interface solution
|
|||
|
similar to the 8087 would have required. Instead, in an 80286/80287 system,
|
|||
|
the CPU fetches and stores all opcodes and operands for the coprocessor.
|
|||
|
Information is then passed through the CPU ports F8h-FFh. (As these ports are
|
|||
|
accessible under program control, care must be taken in user programs not to
|
|||
|
accidentally perform write operations to them, as this could corrupt data in
|
|||
|
the math coprocessor.)
|
|||
|
|
|||
|
The 8087/8087 combination can be characterized as a cooperation of partners
|
|||
|
with equal rights, while the 80286/287 is more a master-slave relationship.
|
|||
|
This makes synchronization easier, since the complete instruction and data
|
|||
|
flow of the coprocessor goes through the CPU. Before executing most
|
|||
|
coprocessor instructions, the 80286 tests its /BUSY pin, which is tied to the
|
|||
|
287 coprocessor and signals if the 80287 is still executing a previous
|
|||
|
coprocessor instruction or has encountered an exception. The 80286 then waits
|
|||
|
until the /BUSY signal goes to "low" before loading the next coprocessor
|
|||
|
instruction into the 80287. Therefore, a WAIT instruction before every
|
|||
|
coprocessor instruction is not required. These WAITs are permissible, but not
|
|||
|
necessary, in 80287 programs. The second form of WAIT synchronization (after
|
|||
|
the coprocessor has written a memory operand) *is* still necessary on 286/287
|
|||
|
systems.
|
|||
|
|
|||
|
The execution unit of the 80287 is practically identical to that of the 8087;
|
|||
|
that is, nearly all coprocessor instructions execute in the same number of
|
|||
|
clock cycles on both coprocessors. However, due to the additional overhead of
|
|||
|
the 80287's CPU/coprocessor interface (at least ~40 clock cycles), an 8 MHz
|
|||
|
80286/80287 combination can have lower floating-point performance than an
|
|||
|
8086/8087 system running at the same speed. Additionally, older 286 boards
|
|||
|
were often configured to run the coprocessor at only 2/3 the speed of the
|
|||
|
CPU, making use of the ability of the 80287 to run asynchronously: The 80287
|
|||
|
has a CKM pin that causes the incoming system clock to be divided by three
|
|||
|
for the coprocessor if it is tied to ground. The 80286 always divides the
|
|||
|
system clock by two internally, hence the final ratio of 2/3. However, when
|
|||
|
the CKM (ClocK Mode) pin is tied high on the 80287, it does not divide the
|
|||
|
CLK input. This feature has been exploited by the maker of coprocessor speed
|
|||
|
sockets. These sockets tie CKM high and supply their own CLK signal with a
|
|||
|
built-in oscillator, thereby allowing the 80287 or compatible to run at a
|
|||
|
much higher speed than the CPU. With an IIT or Cyrix 287 one can have a 20
|
|||
|
MHz coprocessor running with a 8 MHz 80286! Note, however, that the floating-
|
|||
|
point performance of such a configuration does not scale linearly with the
|
|||
|
coprocessor clock, since all the data has to be passed through the much
|
|||
|
slower CPU. If the coprocessor executes mostly simple instructions (such as
|
|||
|
addition and multiplication), doubling the coprocessor clock to 20 MHz in a
|
|||
|
10 MHz system does not show any performance increase at all [24].
|
|||
|
|
|||
|
The Intel 80287XL, the Cyrix 82S87, and the IIT 2C87 contain the internals of
|
|||
|
a 387 coprocessor, but are pin-compatible to the original 287. These chips
|
|||
|
divide the system clock by two internally, as opposed to three in the
|
|||
|
original 80287. Since the 80286 also divides the system clock by two, they
|
|||
|
usually run synchronously with respect to the CPU, although they can also be
|
|||
|
run asynchronously.
|
|||
|
|
|||
|
|
|||
|
The 80387
|
|||
|
---------
|
|||
|
The coprocessor interface in 80386/80387 systems is very similar to the one
|
|||
|
found in 286/287 systems. However, to prevent corruption of the coprocessor's
|
|||
|
contents by programming errors, the IO ports 800000F8h-800000FFh are used,
|
|||
|
which are not accessible to programs. The CPU/coprocessor interface has been
|
|||
|
optimized and uses full 32-bit transfers; the interface overhead has been
|
|||
|
reduced to about 14-20 clock cycles. For some operations on the 387 'clones'
|
|||
|
that take less than about 16 clock cycles to complete, this overhead
|
|||
|
effectively limits the execution rate of coprocessor instructions. The only
|
|||
|
sensible solution to provide even higher floating-point performance was to
|
|||
|
integrate the CPU and coprocessor functionality onto the same chip, which
|
|||
|
is exactly what Intel did with the 80486 CPU. The FPU in the 486 also benefits
|
|||
|
from the instruction pipelining and from the on-chip cache.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
=====================
|
|||
|
Coprocessor emulators
|
|||
|
=====================
|
|||
|
|
|||
|
In the absence of a coprocessor, floating-point calculations are often
|
|||
|
performed by a software package that simulates its operations. Such a program
|
|||
|
is called a coprocessor emulator. Simulating the coprocessor has the
|
|||
|
advantage for application programs that identical code can be generated for
|
|||
|
use with either the coprocessor and the emulator, so that it's possible to
|
|||
|
write programs that run on any system without regard to whether a coprocessor
|
|||
|
is present or not. Whether the program will use an actual coprocessor or
|
|||
|
software emulating it can easily be determined at run-time by detecting the
|
|||
|
presence or absence of the coprocessor chip.
|
|||
|
|
|||
|
Two approaches to interface an 80x87 emulator to programs are common. The
|
|||
|
first method makes use of the fact that all coprocessor instruction start
|
|||
|
with the same five bit pattern 11011. Thus the first byte of a coprocessor
|
|||
|
instruction will be in the range D8-DF hexadecimal. In addition, coprocessor
|
|||
|
instructions usually are preceded by a WAIT instruction (opcode 9Bh) which is
|
|||
|
one byte long (the reason for doing this has been described in the previous
|
|||
|
chapter dealing with the operating details of the 80x87). One common approach
|
|||
|
is to replace the WAIT instruction and the first byte of the coprocessor
|
|||
|
instruction with one out of eight interrupt instructions; the remaining bytes
|
|||
|
of the coprocessor instruction are left unchanged. Interrupts 34 to 3B
|
|||
|
hexadecimal are used for this emulation technique. (Note that the sequences
|
|||
|
9B D8 ... 9B DF can be easily converted to the interrupt instructions CD 34
|
|||
|
... CD 3B by simple addition and subtraction of constants.) The compiler or
|
|||
|
assembler initially produces code that contains these appropriate interrupt
|
|||
|
calls instead of the coprocessor instructions. If a hardware coprocessor is
|
|||
|
detected at run-time, the emulator interrupts point to a short routine that
|
|||
|
converts the interrupts calls back to coprocessor instructions (yes, this
|
|||
|
is known as "self-modifying code"). If no coprocessor is found the interrupts
|
|||
|
point to the emulation package, which examines the byte(s) following the
|
|||
|
interrupt instruction to determine which floating-point operation to perform.
|
|||
|
This method is used by many compilers, including those from Microsoft and
|
|||
|
Borland. It works with every 80x86 CPU from the 8086/8088 on.
|
|||
|
|
|||
|
The second method to interface an emulator is only available on 286/386/486
|
|||
|
machines. If the emulation bit in the machine status word of these processors
|
|||
|
is set, the processors will generate an interrupt 7 whenever a coprocessor
|
|||
|
instruction is encountered. The vector for this interrupt will have been set
|
|||
|
up to point at an emulation package that decodes the instruction and performs
|
|||
|
the desired operation. This approach has the advantage that the emulator
|
|||
|
doesn't have to be included in the program code, but can be loaded once (as a
|
|||
|
TSR or device driver) and then used by every program that requires a
|
|||
|
coprocessor. Emulation via interrupt 7 is transparent, which means that
|
|||
|
programs containing coprocessor instructions execute just like a coprocessor
|
|||
|
was present, only slower. This approach is taken by the public domain EM87
|
|||
|
emulator, the shareware program Q387, and the commercial Franke387 emulator,
|
|||
|
for example. Even programs that require a coprocessor to run like AutoCAD
|
|||
|
are 'fooled' to believe that a coprocessor is present with emulators using
|
|||
|
INT 7.
|
|||
|
|
|||
|
Operating systems such as OS/2 2.0 and Windows 3.1 provide coprocessor
|
|||
|
emulations using INT 7 automatically if they do not find a coprocessor to be
|
|||
|
installed. The emulator in Windows doesn't seem to be very fast, as people
|
|||
|
who have ported their Turbo Pascal programs from the TP 6.0 DOS compiler
|
|||
|
(using the emulation built into the TP 6.0 run-time library) to the TPW 1.5
|
|||
|
Windows compiler (using MS Windows' emulator) have noticed. Slowdowns of as
|
|||
|
much as a factor of five have been reported [79].
|
|||
|
|
|||
|
The size of the emulator used by TP 6.0 is about 9.5 KB, while EM87 occupies
|
|||
|
about 15.8 KB as a TSR, and Franke387 uses about 13.4 KB as a device driver.
|
|||
|
Note that Franke387 and especially EM87 model a real coprocessor much more
|
|||
|
closely than Turbo Pascal's emulator does. In particular, EM87 supports
|
|||
|
denormal numbers, precision control, and rounding control. The emulator in TP
|
|||
|
6.0 does not implement these features. The version of Franke387 tested (V2.4)
|
|||
|
supports denormals in single and double-precision, but not double extended
|
|||
|
precision, and it supports precision control, but not rounding control.
|
|||
|
The recently introduced shareware program Q387 only runs on 386, 386SX, 486SX
|
|||
|
and compatible processors. The program loads completely into extended memory
|
|||
|
and uses about 330 KB. To enable INT 7 trapping to a service routine in
|
|||
|
extended memory it needs to run with a memory manager (e.g. EMM386, QEMM,
|
|||
|
or 386MAX). The huge size of the program stems from the fact that it was
|
|||
|
solely optimized for speed, assuming that extended memory is a cheap resource.
|
|||
|
Presumably it uses large tables to speed computations. Intel's E80287 program
|
|||
|
is supposed to be an 100% exact emulation of the 80287 coprocessor [44]. Note
|
|||
|
that the more closely a real coprocessor is modelled by the emulator, the
|
|||
|
slower the emulator runs and the larger the code for the emulator gets.
|
|||
|
|
|||
|
|
|||
|
Relative execution times of coprocessor vs. software emulators
|
|||
|
for selected coprocessor instructions
|
|||
|
|
|||
|
Intel 387DX TP 6.0 Emulator EM87 Emulator
|
|||
|
|
|||
|
FADD ST, ST(0) 1 26 104
|
|||
|
FDIV [DWord] 1 22 136
|
|||
|
FXAM 1 10 73
|
|||
|
FYL2X 1 33 102
|
|||
|
FPATAN 1 36 110
|
|||
|
F2XM1 1 38 110
|
|||
|
|
|||
|
|
|||
|
|
|||
|
The following table is an excerpt from [44]:
|
|||
|
|
|||
|
Intel 80287 Intel E80287 Emulator
|
|||
|
|
|||
|
FADD ST, ST(0) 1 42
|
|||
|
FDIV [DWord] 1 266
|
|||
|
FXAM 1 139
|
|||
|
FYL2X 1 99
|
|||
|
FPATAN 1 153
|
|||
|
F2XM1 1 41
|
|||
|
|
|||
|
|
|||
|
|
|||
|
The following has been adapted from [43] and merged with my own
|
|||
|
data:
|
|||
|
|
|||
|
Intel 8087 TP 6.0 Emul. (8086) Intel Emul. (8086)
|
|||
|
|
|||
|
FADD ST, ST(0) 1 20 94
|
|||
|
FDIV [DWord] 1 22 82
|
|||
|
FPTAN 1 18 144
|
|||
|
F2XM1 1 6 171
|
|||
|
FSQRT 1 44 544
|
|||
|
|
|||
|
|
|||
|
|
|||
|
One of the reasons emulators are so slow is that they are often designed to
|
|||
|
run with every CPU from the 8086/8088 on upwards. This is the case with the
|
|||
|
emulators built into the compiler libraries of the Turbo Pascal 6.0 (also
|
|||
|
used by Turbo C/C++) and Microsoft C 6.0 compiler (probably also used in
|
|||
|
other Microsoft products) and is also true for the EM87 emulator in the
|
|||
|
public domain. By using code that can run on a 8086/8088, these emulators
|
|||
|
forego the speed advantage offered by the additional instructions and
|
|||
|
architectural enhancements (such as 32-bit registers) of the more advanced
|
|||
|
Intel 80x86 processors. A notable exception to this is the Franke387
|
|||
|
emulator, a commercial emulator that is also sold as shareware. It uses 386-
|
|||
|
specific 32-bit code and only runs on 386/386SX/486SX computers.
|
|||
|
|
|||
|
Besides being slow, coprocessor emulators have other drawbacks when compared
|
|||
|
with real coprocessors. Most of the emulators do not support the additional
|
|||
|
instructions that the 387-compatible coprocessors offer over the 80287.
|
|||
|
Often, some of the low-level stack-manipulating instructions like FDECSTP are
|
|||
|
not emulated. For example, [76] lists the coprocessor instructions not
|
|||
|
emulated by Microsoft's emulator (included in the MS-C and MS-FORTRAN
|
|||
|
libraries) as follows:
|
|||
|
|
|||
|
FCOS FRSTOR FSINCOS FXTRACT
|
|||
|
FDECSTP FSAVE FUCOM
|
|||
|
FINCSTP FSETPM FUCOMP
|
|||
|
FPREM1 FSIN FUCOMPP
|
|||
|
|
|||
|
Additionally, some parts of the coprocessor architecture, like the status
|
|||
|
register, are often not or only partially emulated. Some emulators do not
|
|||
|
conform to the IEEE-754 standard in their implementation of the basic
|
|||
|
arithmetic functions, while the hardware coprocessors do. Also, they
|
|||
|
sometimes lack the support for denormals (a special class of floating-point
|
|||
|
numbers) although it is required by the standard. Not all the 80x87 emulators
|
|||
|
support rounding control and precision control, also features required by
|
|||
|
IEEE-754. Most of these omissions are aimed at making the emulator faster and
|
|||
|
smaller. Because of the performance gap and these other shortcomings of
|
|||
|
coprocessor emulators, a real coprocessor is a must for anybody planning to
|
|||
|
do some serious computations. (At today's prices, this shouldn't pose much of
|
|||
|
a problem to anybody!)
|
|||
|
|
|||
|
Nhuan Doduc (ndoduc@framentec.fr) has tested a number of standalone
|
|||
|
coprocessor emulators for PCs, among them the two emulators, EM87 and
|
|||
|
Franke387 V2.4, already mentioned. He found Franke387 to be the best in terms
|
|||
|
of reliability, speed, and accuracy.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
=============================
|
|||
|
Installing a math coprocessor
|
|||
|
=============================
|
|||
|
|
|||
|
Usually, installing a coprocessor doesn't pose much of a problem, as every
|
|||
|
coprocessor comes with installation instructions and a diagnostic disk that
|
|||
|
lets you check its correct operation after installation. In addition, the
|
|||
|
user manuals of most computers have a section on coprocessor installation.
|
|||
|
|
|||
|
1) Make sure to buy the right coprocessor for your system. An 8087 works
|
|||
|
together with 8086, 8088, V20, and V30 CPUs. An 80287, 287XL or
|
|||
|
compatible works with a 80286 CPU. (There are also some old 386
|
|||
|
motherboards that accept a 80287 coprocessor, but they usually also
|
|||
|
provide a socket for the 387; given today's pricing, it makes no sense
|
|||
|
not to get a 387 for these systems.) A 80387, 387DX or compatible
|
|||
|
coprocessor is for 386-based systems, as is the Intel RapidCAD. 387
|
|||
|
coprocessors also work with the Cyrix 486DLC CPU (which, despite its
|
|||
|
name, does not include an FPU). Similarly, the 387SX or compatible
|
|||
|
coprocessor go into systems whose CPU is a 386SX or Cyrix 486SLC.
|
|||
|
|
|||
|
The Weitek Abacus 3167 works with a 386 CPU but requires a 121-pin EMC
|
|||
|
socket in the system; this is *not* the same socket used by a 80387 or
|
|||
|
compatible chip, and some computers, such as IBM's PS/2s, don't have
|
|||
|
this socket. The Weitek Abacus 4167 works together with the 486 and
|
|||
|
requires a special 142-pin socket to be present.
|
|||
|
|
|||
|
2) Always install a coprocessor that's rated at the same clock speed as the
|
|||
|
CPU. For example, in a 40 MHz 386 system using an AMD Am386-40, install
|
|||
|
a coprocessor rated for 40 MHz such as a Cyrix 83D87-40, C&T 38700DX-40,
|
|||
|
IIT 3C87-40, or ULSI 83C87-40. Running a coprocessor above its specified
|
|||
|
frequency rating may cause it to produce false results, which you might
|
|||
|
fail to recognize as such. (I have personally experienced this problem
|
|||
|
with a Cyrix 83D87-33 that I tried to push to 40 MHz. It passed all the
|
|||
|
diagnostic benchmarks on the Cyrix diagnostic disk and the tests of some
|
|||
|
commercial system test programs. However, I found it to fail the
|
|||
|
Whetstone and Linpack benchmarks, which include accuracy checks.)
|
|||
|
Although there is usually no problem with overheating when pushing a
|
|||
|
coprocessor over the specified maximum frequency rating, be warned that
|
|||
|
operation of a coprocessor above the maximum ratings stated by the
|
|||
|
manufacturer may make its operation unreliable.
|
|||
|
|
|||
|
Some 386 boards allow the coprocessor to be clocked differently than the
|
|||
|
CPU. This is called "asynchronous operation" and allows you, for
|
|||
|
example, to run the coprocessor at 33 MHz while the CPU runs at 40 MHz.
|
|||
|
Of the currently available math coprocessors, only the Intel 80387 and
|
|||
|
387DX support asynchronous operation. The 387-compatible "clones" from
|
|||
|
Cyrix, C&T, IIT and ULSI always run at the full speed of the CPU, even
|
|||
|
if you have set up your motherboard for asynchronous operation.
|
|||
|
|
|||
|
3) Once you've got the correct coprocessor for your system you can start
|
|||
|
the actual installation process. Turn off the computer's power switch
|
|||
|
and unplug the power cord from the wall outlet, remove the case, and
|
|||
|
locate the math coprocessor socket. This socket is always located right
|
|||
|
next to the main CPU, which can be identified by the printing on top of
|
|||
|
the chip. (It's also usually one of the biggest chips on the board). The
|
|||
|
8078 and 80287 DIL sockets are rectangular sockets with 20 pin holes on
|
|||
|
each of the longer sides. The 387SX PLCC socket is a square socket that
|
|||
|
has 17 vertical connector strips on the 'wall' of each side. The 387 PGA
|
|||
|
socket is square and has two rows of pin holes on each side. The EMC
|
|||
|
socket for the Weitek 3167 is similar but has three rows of holes on
|
|||
|
each side. The PGA socket for the Weitek 4167 is also square with three
|
|||
|
rows of holes on each side. If you can't find the math coprocessor
|
|||
|
socket, consult your owner's manual, your computer dealer, or a
|
|||
|
knowledgeable friend.
|
|||
|
|
|||
|
If you are installing the Intel RapidCAD chipset in a 386 system, you
|
|||
|
will have to remove the 386 CPU first. Intel provides an easy-to-use
|
|||
|
chip extractor and a storage box for the 386 chip for this purpose. Just
|
|||
|
follow the instructions in the RapidCAD installation manual.
|
|||
|
|
|||
|
On many systems, the motherboard is supported only at a small number of
|
|||
|
points. Since considerable force is required to insert a pin grid chip
|
|||
|
like the 80387, RapidCAD, or Weitek Abacus 3167 into its socket, the
|
|||
|
board may bend quite a lot due to the insertion pressure. This could
|
|||
|
cause cracks in the board's conductive traces that may render it
|
|||
|
intermittently or completely inoperable. Damage done to the board in
|
|||
|
this way is usually not covered by the computer's warranty! Therefore,
|
|||
|
it may be a good idea to first check how much the board bends by
|
|||
|
pressing on the math coprocessor socket with your finger. If you find it
|
|||
|
to bend easily, try to put something under the board directly beneath
|
|||
|
the coprocessor socket. If this is impossible, as it is in many desktop
|
|||
|
cases, consider removing the whole mother board from the case, and
|
|||
|
placing it on a hard, flat surface free of static electricity. (You will
|
|||
|
also have to do this if your system's CPU and coprocessor socket are on
|
|||
|
a separate card rather than on the motherboard, as is typical in many
|
|||
|
modular systems.)
|
|||
|
|
|||
|
Be sure you are properly grounded before you remove the coprocessor from
|
|||
|
its antistatic box, as even a tiny jolt of static electricity can ruin
|
|||
|
the coprocessor. Make sure you do not touch the pins on the bottom of
|
|||
|
the chip.
|
|||
|
|
|||
|
Check the pins and make sure none are bent; if some are, you can
|
|||
|
*carefully* straighten them with needle-nose pliers or tweezers.
|
|||
|
|
|||
|
4) Match the coprocessor's orientation with the orientation of the socket.
|
|||
|
Correct orientation of the coprocessor is absolutely essential, because
|
|||
|
if you insert it the wrong way it may be damaged.
|
|||
|
|
|||
|
8087 and 287 coprocessors have a notch on one the shorter sides of their
|
|||
|
rectangular DIL package that should be matched with the notch of the
|
|||
|
coprocessor socket. Usually the 286 CPU and the 287 coprocessor are
|
|||
|
placed alongside each other and both have the same orientation, (that
|
|||
|
is, their respective notches point in the same direction). 387SX
|
|||
|
coprocessors feature a white dot or similar mark that matches with some
|
|||
|
sort of marking on the socket. 387 coprocessors have a bevelled corner
|
|||
|
that is also marked with a white dot or similar marking. This should be
|
|||
|
matched with the bevelled or otherwise marked corner of the socket. If
|
|||
|
your system has only a large EMC socket and you are installing a 387 in
|
|||
|
it, you will leave one row of pin holes free on each side of the chip.
|
|||
|
|
|||
|
Once you have found the correct orientation, place the chip over the
|
|||
|
socket and make sure all pins are correctly aligned with their
|
|||
|
respective holes. Press firmly and evenly on the chip -- you may have to
|
|||
|
press hard to seat the coprocessor all the way. Again, make sure your
|
|||
|
motherboard does not bend more than slightly under the insertion
|
|||
|
pressure. For 8087, 287, and 387 coprocessors it is normal that the
|
|||
|
coprocessor does not go all the way in; about one millimeter (1/25 inch)
|
|||
|
of space is usually left between the socket and the bottom of the
|
|||
|
coprocessor chip. (This allows the insertion of a extraction device
|
|||
|
should it become necessary to remove the chip. Note that the
|
|||
|
construction of the 387SX's PLCC socket makes it next-to-impossible to
|
|||
|
remove the coprocessor once fully inserted, as the top of the chip is
|
|||
|
level with the socket's 'walls'.)
|
|||
|
|
|||
|
5) Check your computer's manual for the proper position of any jumpers or
|
|||
|
switches that need to be set to tell the system it now has a coprocessor
|
|||
|
(and possibly, which kind it has). Put the cover back on the system
|
|||
|
unit, reconnect the power, and turn on your computer. Depending on your
|
|||
|
system's BIOS, you may now have to run a setup or configuration program
|
|||
|
to enable the coprocessor. Finally, run the programs supplied on the
|
|||
|
diagnostic disk (included with your coprocessor) to check for its
|
|||
|
correct operation.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
=================================================================
|
|||
|
Descriptions of available coprocessors, CPU+FPU (as of 01-11-93):
|
|||
|
=================================================================
|
|||
|
|
|||
|
Intel 8087
|
|||
|
|
|||
|
[43] This was the first coprocessor that Intel made available for the
|
|||
|
80x86 family. It was introduced in 1980 and therefore does not have full
|
|||
|
compatibility with the IEEE-754 standard for floating-point arithmetic,
|
|||
|
(which was finally released in 1985). It complements the 8088 and 8086
|
|||
|
CPUs and can also be interfaced to the 80188 and 80186 processors.
|
|||
|
|
|||
|
The 8087 is implemented using NMOS. It comes in a 40-pin CERDIP (ceramic
|
|||
|
dual inline package). It is available in 5 MHz, 8 MHz (8087-2), and 10
|
|||
|
MHz (8087-1) versions. Power consumption is rated at max. 2400 mW [42].
|
|||
|
|
|||
|
A neat trick to enhance the processing power of the 8087 for
|
|||
|
computations that use only the basic arithmetic operations (+,-,*,/) and
|
|||
|
do not require high precision is to set the precision control to single-
|
|||
|
precision. This gives one a performance increase of up to 20%. For
|
|||
|
details about programming the precision control, see program PCtrl in
|
|||
|
appendix A.
|
|||
|
|
|||
|
With the help of an additional chip, the 8087 can in theory be
|
|||
|
interfaced to an 80186 CPU [36]. The 80186 was used in some PCs (e.g.
|
|||
|
from Philips, Siemens) in the 1982/1983 time frame, but with IBM's
|
|||
|
introduction of the 80286-based AT in 1984, it soon lost all
|
|||
|
significance for the PC market.
|
|||
|
|
|||
|
|
|||
|
Intel 80187
|
|||
|
|
|||
|
The 80187 is a rather new coprocessor designed to support the 80C186
|
|||
|
embedded controller (a CMOS version of the 80186 CPU; see above). It was
|
|||
|
introduced in 1989 and implements the complete 80387 instruction set. It
|
|||
|
is available in a 40 pin CERDIP (ceramic dual inline package) and a 44
|
|||
|
pin PLCC (plastic leaded chip carrier) for 12.5 and 16 MHz operation.
|
|||
|
Power consumption is rated at max. 675 mW for the 12.5 MHz version and
|
|||
|
max. 780 mW for the 16 MHz version [37].
|
|||
|
|
|||
|
|
|||
|
Intel 80287
|
|||
|
|
|||
|
[44] This is the original Intel coprocessor for the 80286, introduced in
|
|||
|
1983. It uses the same internal execution unit as the 8087 and therefore
|
|||
|
has the same speed (actually, it is sometimes slower due to additional
|
|||
|
overhead in CPU-coprocessor communication). As with the 8087, it does
|
|||
|
not provide full compatibility with the IEEE-754 floating point standard
|
|||
|
released in 1985.
|
|||
|
|
|||
|
The 80287 was manufactured in NMOS technology, and is packaged in a 40-
|
|||
|
pin CERDIP (ceramic dual inline package). There are 6 MHz, 8 MHz, and 10
|
|||
|
MHz versions. Power consumption can be estimated to be the same as that
|
|||
|
for the 8087, which is 2400 mW max.
|
|||
|
|
|||
|
The 80287 has been replaced in the Intel 80x87 family with its faster
|
|||
|
successor, the CMOS-based Intel 287XL, which was introduced in 1990 (see
|
|||
|
below). There may still be a few of the old 80287 chips on the market,
|
|||
|
however.
|
|||
|
|
|||
|
|
|||
|
Intel 80287XL
|
|||
|
|
|||
|
This chip is Intel's second-generation 287, first introduced in 1990.
|
|||
|
Since it is based on the 80387 coprocessor core, it features full IEEE
|
|||
|
754 compatibility and faster instruction execution. Intel claims about
|
|||
|
50% faster operation than the 80287 for typical benchmark tests such as
|
|||
|
Whetstone [45]. Comparison with benchmark results for the AMD 80C287,
|
|||
|
which is identical to the Intel 80287, support this claim [1]: The Intel
|
|||
|
287XL performed 66% faster than the AMD 80C287 on a fractal benchmark
|
|||
|
and 66% faster on the Whetstone benchmark in these tests. Whetstone
|
|||
|
results from [46] show the Intel 287XL at 12.5 MHz to perform 552
|
|||
|
kWhets/sec as opposed to the AMD's 80C287 289 kWhets/sec, a 91%
|
|||
|
performance increase. A benchmark using the MathPak program showed the
|
|||
|
Intel 287XL to be 59% faster than the Intel 80287 (6.9 sec. vs. 11.0
|
|||
|
sec.) [26]. Since the 287XL has all the additional instructions and
|
|||
|
enhancements of a 387, most software automatically identifies it as an
|
|||
|
80387-compatible coprocessor and therefore can make use of extra 387-
|
|||
|
only features, such as the FSIN and FCOS instructions.
|
|||
|
|
|||
|
The 287XL is manufactured in CMOS and therefore uses much less power
|
|||
|
than the older NMOS-based 80287. At 12.5 MHz, the power consumption is
|
|||
|
rated at max. 675 mW, about 1/4 of the 80287 power consumption. The
|
|||
|
287XL is available in either a 40-pin CERDIP (ceramic dual inline
|
|||
|
package) or a 44 pin PLCC (plastic leaded chip carrier). (This latter
|
|||
|
version is called the 287XLT and intended mainly for laptop use.) The
|
|||
|
287XL is rated for speeds of up to 12.5 MHz.
|
|||
|
|
|||
|
|
|||
|
AMD 80C287
|
|||
|
|
|||
|
This chip, manufactured by Advanced Micro Devices (AMD), is an exact
|
|||
|
clone of the old Intel 80287, and was first brought to market by AMD in
|
|||
|
1989. It contains the original microcode of the 80287 and is therefore
|
|||
|
100% compatible with it. However, as the name indicates, the 80C287 is
|
|||
|
manufactured in CMOS and therefore uses less power than an equivalent
|
|||
|
Intel 80287. At 12.5 MHz, its power consumption is rated at max. 625 mW
|
|||
|
or slightly less than that of the Intel 80287XL [27]. There is also
|
|||
|
another version called AMD 80EC287 that uses an 'intelligent' power save
|
|||
|
feature to reduce the power consumption below 80C287 levels. Tests at
|
|||
|
10.7 MHz show typical power consumption for the 80EC287 to be at 30 mW,
|
|||
|
compared to 150 mW for the AMD 80C287, 300 mW for the Intel 287XL and
|
|||
|
1500 mW for the Intel 80287 [57]. The 80EC287 is therefore ideally
|
|||
|
suited for low power laptop systems.
|
|||
|
|
|||
|
The AMD 80C287 is available in speeds of 10, 12, and 16 MHz. (I have
|
|||
|
only seen it being offered in 10 MHz and 12 MHz versions, however.) At
|
|||
|
about US$ 50, it is currently the cheapest coprocessor available. Note
|
|||
|
that it provides less performance than the newer Intel 287XL (see
|
|||
|
above). The AMD 80C287 is available in 40 pin ceramic and plastic DIPs
|
|||
|
(dual inline package) and as 44 pin PLCC (plastic leaded chip carrier).
|
|||
|
|
|||
|
Due to recent legal battles with Intel over the right to use the 287
|
|||
|
microcode, which AMD lost, AMD may have to discontinue this product
|
|||
|
(disclaimer: I am not a legal expert).
|
|||
|
|
|||
|
|
|||
|
Cyrix 82S87
|
|||
|
|
|||
|
This 80287-compatible chip was developed from the Cyrix 83D87, (Cyrix's
|
|||
|
80387 'clone') and has been available since 1991. It complies completely
|
|||
|
with the IEEE-754 standard for floating-point arithmetic and features
|
|||
|
nearly total compatibility with Intel's coprocessors, including
|
|||
|
implementation of the full Intel 80387 instruction set. It implements
|
|||
|
the transcendental functions with the same degree of accuracy and the
|
|||
|
superior speed of the Cyrix 83D87. This makes the Cyrix 82S87 the
|
|||
|
fastest [1] and most accurate 287 compatible coprocessor available.
|
|||
|
Documentation by Cyrix [46] rates the 82S87 at 730 kWhets/sec for a 12.5
|
|||
|
MHz system, while the Intel 287XL performs only 552 kWhets/sec. 82S87
|
|||
|
chips manufactured after 1991 use the internals of the Cyrix 387+, which
|
|||
|
succeeds the original 83D87 [73].
|
|||
|
|
|||
|
The 82S87 is a fully static CMOS design with very low power requirements
|
|||
|
that can run at speeds of 6 to 20 MHz. Cyrix documentation shows the
|
|||
|
82S87 to consume about the same amount of power as the AMD 80C287 (see
|
|||
|
above). The 82S87 comes in a 40 pin DIP or a 44 pin PLCC (plastic leaded
|
|||
|
chip carrier) compatible with the pinout of the Intel 287XLT and
|
|||
|
ideally suited for laptop use.
|
|||
|
|
|||
|
|
|||
|
IIT 2C87
|
|||
|
|
|||
|
This chip was the first 80287 clone available, introduced to the market
|
|||
|
in 1989. It has about the same speed as the Intel 287XL [1]. The 2C87
|
|||
|
implements the full 80387 instruction set [38]. Tests I ran on the 3C87
|
|||
|
seem to indicate that it is not fully compatible with the IEEE-754
|
|||
|
standard for floating-point arithmetic (see below for details), so it
|
|||
|
can be assumed that the 2C87 also fails these test (as it presumably
|
|||
|
uses the same core as the 3C87).
|
|||
|
|
|||
|
The IIT 2C87 provides extra functions not available on any other 287
|
|||
|
chip [38]. It has 24 user-accessible floating-point registers organized
|
|||
|
into three register banks. Additional instructions (FSBP0, FSBP1, FSBP2)
|
|||
|
allow switching from one bank to another. (Transfers between registers
|
|||
|
in different banks are not supported, however, so this feature by itself
|
|||
|
is of limited usefulness. Also, there seems to be only one status
|
|||
|
register (containing the stack top pointer), so it has to be manually
|
|||
|
loaded and stored when switching between banks with a different number
|
|||
|
of registers in use [40]). The register bank's main purpose is to aid
|
|||
|
the fourth additional instruction the 2C87 has (F4X4), which does a full
|
|||
|
multiply of a 4x4 matrix by a 4x1 vector, an operation common in 3D-
|
|||
|
graphics applications [39]. The built-in matrix multiply speeds this
|
|||
|
operation up by a factor of 6 to 8 when compared to a programmed
|
|||
|
solution according to the manufacturer [38]. Tests show the speed-up to
|
|||
|
be indeed in this range [40]. For the 3C87, I measured the execution
|
|||
|
time of F4X4 to be about 280 clock cycles; the execution time on the
|
|||
|
2C87 should be somewhat larger - I estimate it to be around 310 clock
|
|||
|
cycles due to the higher CPU-NDP communication overhead in instruction
|
|||
|
execution in 286/287 systems (~45-50 clock cycles) compared with 386/387
|
|||
|
systems (~16-20 clock cycles). As desirable as the F4X4 instruction may
|
|||
|
seem, however, there are very few applications that make use of it when
|
|||
|
an IIT coprocessor is detected at run time (among them Schroff
|
|||
|
Development's Silver Screen and Evolution Computing's Fast-CAD 3-D
|
|||
|
[25]).
|
|||
|
|
|||
|
The 2C87 is available for speeds of up to 20 MHz. It is implemented in
|
|||
|
an advanced CMOS process and has therefore a low power consumption of
|
|||
|
typically about 500 mW [38].
|
|||
|
|
|||
|
|
|||
|
Intel 80387
|
|||
|
|
|||
|
This chip was the first generation of coprocessors designed specifically
|
|||
|
for the Intel 80386 CPU. It was introduced in 1986, about one year after
|
|||
|
the 80386 was brought to market. Early 386 system were therefore
|
|||
|
equipped with both a 80287 and a 80387 socket. The 80386 does work with
|
|||
|
an 80287, but the numerical performance is hardly adequate for such a
|
|||
|
system.
|
|||
|
|
|||
|
The 80387 has itself since been superseded by the Intel 387DX introduced
|
|||
|
by a quiet change in 1989 (see below). You might find it when acquiring
|
|||
|
an older 386 machine, though. The old 80387 is about 20% slower than the
|
|||
|
newer 387DX.
|
|||
|
|
|||
|
The 80387 is packaged in a 68-pin ceramic PGA, and was manufactured
|
|||
|
using Intel's older 1.5 micron CHMOS III technology, giving it moderate
|
|||
|
power requirements. Power consumption at 16 MHz is max. 1250 mW (750 mW
|
|||
|
typical), at 20 MHz max. 1550 mW (950 mW typical), and at 25 MHz max.
|
|||
|
1950 mW (1250 mW typical) [60].
|
|||
|
|
|||
|
|
|||
|
Intel 387DX
|
|||
|
|
|||
|
The 387DX is the second-generation Intel 387; it was quietly introduced
|
|||
|
to replace the original 80387 in 1989. This version is done in a more
|
|||
|
advanced CMOS process which enables the coprocessor to run at a maximum
|
|||
|
frequency of 33 MHz (the 80387 was limited to a maximum frequency of 25
|
|||
|
MHz). The 387DX is also about 20% faster than the 80387 on the average
|
|||
|
for the same clock frequency. For a 386/387 system operating at 29 MHz
|
|||
|
the Whetstone benchmark (compiled with the highly optimizing Metaware
|
|||
|
High-C V1.6) runs at 2377 kWhetstones/sec for the 80387 and at 2693
|
|||
|
kWhetstones/sec for the 387DX, a 13% increase. In a fractal calculation
|
|||
|
programmed in assembly language, the 387DX performance was 28% higher
|
|||
|
than the performance of the 80387. The transcendental functions have
|
|||
|
also sped up from the 80387 to the 387DX. In the Savage benchmark
|
|||
|
(again, compiled with Metaware High-C V1.6 and running on a 29 MHz
|
|||
|
system), the 80387 evaluated 77600 function calls/second, while the
|
|||
|
387DX evaluated 97800 function calls/second, a 26% increase [7]. Some
|
|||
|
instructions have been sped up a lot more than the average 20%. For
|
|||
|
example, the performance of the FBSTP instruction has increased by a
|
|||
|
factor of 3.64.
|
|||
|
|
|||
|
The Intel 387DX (and its predecessor 80387) are the only 387
|
|||
|
coprocessors that support asynchronous operation of CPU and coprocessor.
|
|||
|
The 387 consists of a bus interface unit and a numerical execution unit.
|
|||
|
The bus interface unit always runs at the speed of the CPU clock
|
|||
|
(CPUCLK2). If the CKM (ClocK Mode) pin of the 387 is strapped to Vcc,
|
|||
|
the numerical execution unit runs at the same speed as the bus interface
|
|||
|
unit. If CKM is tied to ground, the numerical execution unit runs at the
|
|||
|
speed provided by the NUMCLK2 input. The ratio of NUMCLK2 (coprocessor
|
|||
|
clock) to CPUCLK2 (CPU clock) must lie within the range 10:16 to 14:10.
|
|||
|
For example, for a 20 MHz 386, the Intel 387DX could be clocked from
|
|||
|
12.5 MHz to 28 MHz via the NUMCLK2 input. (On the Cyrix 83D87, Cyrix
|
|||
|
387+, ULSI 83C87, and the IIT 387, the CKM pin is not connected. These
|
|||
|
coprocessors are therefore not capable of asynchronous operation and
|
|||
|
always run at the speed of the CPU.)
|
|||
|
|
|||
|
The Intel 387DX is manufactured using Intel's advanced low power CHMOS
|
|||
|
IV technology. Power consumption at 20 MHz is max. 900 mW (525 mW
|
|||
|
typical), at 25 MHz max. 1050 mW (625 mW typical), and at 33 MHz max.
|
|||
|
1250 mW (750 mW typical) [59].
|
|||
|
|
|||
|
|
|||
|
Intel 387SX
|
|||
|
|
|||
|
This is the coprocessor paired with the Intel 386SX CPU. The 386SX is an
|
|||
|
Intel 80386 with a 16-bit, rather than 32-bit, data path. This reduces
|
|||
|
(somewhat) the costs to build a 386SX system as compared to a full 32-
|
|||
|
bit design required by a 386DX. (The 386SX's main *marketing* purpose
|
|||
|
was to replace the 80286 CPU, which was being sold more cheaply by other
|
|||
|
manufacturers [such as AMD], and which Intel subsequently stopped
|
|||
|
producing.) Due to the 16-bit data path, the 386SX is slower than the
|
|||
|
386DX and offers about the same speed as an 80286 at the same clock
|
|||
|
frequency for 16-bit applications. But as the 386SX is a complete 80386
|
|||
|
internally, it offers also the possibility to run 32-bit applications
|
|||
|
and supports the virtual 8086 mode (used for example by Windows' 386
|
|||
|
enhanced mode).
|
|||
|
|
|||
|
The 387SX has all the features of the Intel 80387, including the ability
|
|||
|
of asynchronous operation of CPU and coprocessor (see Intel 387DX
|
|||
|
information, above). Due to the 16 bit data path between the CPU and the
|
|||
|
coprocessor, the 387SX is a bit slower than a 80387 operating at the
|
|||
|
same frequency. In addition, the 387SX is based on the core of the
|
|||
|
original 80387, which executes instructions slower than the second
|
|||
|
generation 387DX.
|
|||
|
|
|||
|
The 387SX comes in a 68-pin PLCC (plastic leaded chip carrier) package
|
|||
|
and is available in 16 MHz and 20 MHz versions. (Coprocessors for faster
|
|||
|
386SX systems based on the Am386SX CPU are available from IIT, Cyrix,
|
|||
|
and ULSI.) Power consumption for the 387SX at 16 MHz is max. 1250 mW
|
|||
|
(740 mW typical); for the 20 MHz version it is max. 1500 mW (1000 mW
|
|||
|
typical) [62].
|
|||
|
|
|||
|
|
|||
|
Intel 387SL
|
|||
|
|
|||
|
This coprocessor is designed for use in systems that contain an Intel
|
|||
|
386SL as the CPU. The 386SL is directly derived from the 386SX. It is a
|
|||
|
static CHMOS IV design with very low power requirements that is intended
|
|||
|
to be used in notebook and laptop computers. It features an integrated
|
|||
|
cache controller, a programmable memory controller, and hardware support
|
|||
|
for expanded memory according to the LIM EMS 4.0 standard. The 387SL,
|
|||
|
introduced in early 1992, has been designed to accompany the 386SL in
|
|||
|
machines with low power consumption and substitute the 387SX for this
|
|||
|
purpose. It features advanced power saving mechanisms. It is based on
|
|||
|
the 387DX core, rather than on the older and slower 80387 core (which is
|
|||
|
used by the 387SX).
|
|||
|
|
|||
|
|
|||
|
IIT 3C87
|
|||
|
|
|||
|
This IIT chip was introduced in 1989, about the same time as the Cyrix
|
|||
|
83D87. Both coprocessors are faster than Intel's 387DX coprocessor. The
|
|||
|
IIT 3C87 also provides extra functions not available on any other 387
|
|||
|
chip [38]. It has 24 user-accessible floating-point registers organized
|
|||
|
into three register banks. Three additional instructions (FSBP0, FSBP1,
|
|||
|
FSBP2) allow switching from one bank to another. (Transfers between
|
|||
|
registers in different banks are not supported, however, so this feature
|
|||
|
by itself is of limited usefulness. Also, there seems to be only one
|
|||
|
status register [containing the stack top pointer], so it has to be
|
|||
|
manually loaded and stored when switching between banks with a different
|
|||
|
number of registers in use [40]). The register bank's main purpose is to
|
|||
|
aid the fourth additional instruction the 3C87 has (F4X4), which does a
|
|||
|
full multiply of a 4x4 matrix by a 4x1 vector, an operation common in
|
|||
|
3D-graphics applications [39]. The built-in matrix multiply speeds this
|
|||
|
operation up by a factor of 6 to 8 when compared to a programmed
|
|||
|
solution according to the manufacturer [38]. Tests show the speed-up to
|
|||
|
be indeed in this range [40]. I measured the F4X4 to execute in about
|
|||
|
280 clock cycles, during which time it executes 16 multiplications and
|
|||
|
12 additions. The built-in matrix multiply speeds up the matrix-by-
|
|||
|
vector multiply by a factor of 3 compared with a programmed solution
|
|||
|
according to IIT [39]. The results for my own TRNSFORM benchmark support
|
|||
|
this claim (see results below), showing a performance increase by a
|
|||
|
factor of about 2.5. This makes matrix multiplies on the IIT 3C87 nearly
|
|||
|
as fast as on an Intel 486 at the same clock frequency. As desirable as
|
|||
|
the F4X4 instruction may seem, however, there are very few applications
|
|||
|
that make use of it when an IIT coprocessor is detected at run time
|
|||
|
(among them Schroff Development's Silver Screen and Evolution
|
|||
|
Computing's Fast-CAD 3-D [25]).
|
|||
|
|
|||
|
These IIT-specific instructions also work correctly when using a Chips &
|
|||
|
Technologies 38600DX or a Cyrix 486DLC CPU, which are both marketed as
|
|||
|
faster replacements for the Intel 386DX CPU.
|
|||
|
|
|||
|
Tests I ran with the IEEETEST program show that the 3C87 is not fully
|
|||
|
compatible with the IEEE-754 standard for floating-point arithmetic,
|
|||
|
although the manufacturer claims otherwise. It is indeed possible that
|
|||
|
the reported errors are due to personal interpretations of the standard
|
|||
|
by the program's author that have been incorporated into IEEETEST and
|
|||
|
that the standard also supports the different interpretation chosen by
|
|||
|
IIT. On the other hand, the IEEE test vectors incorporated into IEEETEST
|
|||
|
have become somewhat of an industry standard [66] and Intel's 387, 486,
|
|||
|
and RapidCAD chips pass the test without a single failure, so the fact
|
|||
|
that the IIT 3C87 fails some of the tests indicates that it is not fully
|
|||
|
compatible with the Intel 387 coprocessor. My tests also show that the
|
|||
|
IIT 3C87 does not support denormals for the double extended format. It
|
|||
|
is not entirely clear whether the IEEE standard mandates support for
|
|||
|
extended precision denormals, as the IEEE-754 document explicitly only
|
|||
|
mentions single and double-precision denormals. Missing support for
|
|||
|
denormals is not a critical issue for most applications, but there are
|
|||
|
some programs for which support of denormals is at the very least quite
|
|||
|
helpful [41]. In any case, failure of the 3C87 to support extended
|
|||
|
precision denormal numbers does represent an incompatibility with the
|
|||
|
Intel 387 and 486 chips.
|
|||
|
|
|||
|
The 3C87 is implemented in an advanced CMOS process and has low power
|
|||
|
requirements, typically about 600 mW. Like the 387 'clones' from Cyrix
|
|||
|
and ULSI, the 3C87 does not support asynchronous operation of the CPU
|
|||
|
and the coprocessor, but always runs at the full speed of the CPU. It is
|
|||
|
available in 16, 20, 25, 33, and 40 MHz versions.
|
|||
|
|
|||
|
|
|||
|
IIT 3C87SX
|
|||
|
|
|||
|
This is the version of the IIT 3C87 that is intended for use with
|
|||
|
Intel's 386SX or AMD's Am386SX CPU, and is functionally equivalent to
|
|||
|
the IIT3C87. Due to the 16-bit data path between the CPU and the
|
|||
|
coprocessor in a 386SX- based system, coprocessor instructions will
|
|||
|
execute somewhat more slowly than on the 3C87. At present, the IIT
|
|||
|
3C87SX is the only 387SX coprocessor that is offered at speeds of 16,
|
|||
|
20, 25, and 33 MHz. (I have read that Cyrix has also announced an 83S87-
|
|||
|
33, but haven't seen it being offered yet.) The 3C87SX is packaged in a
|
|||
|
68-pin PLCC.
|
|||
|
|
|||
|
|
|||
|
Cyrix FasMath 83D87
|
|||
|
|
|||
|
This chip was introduced in 1989, only shortly after the coprocessors
|
|||
|
from IIT. It has been found to be the fastest 387-compatible coprocessor
|
|||
|
in several benchmark comparisons [1,7,68,69]. It also came out as the
|
|||
|
fastest coprocessor in my own tests (see benchmark results below).
|
|||
|
Although the Cyrix 83D87 provides up to 50% more performance than the
|
|||
|
Intel 387DX in benchmarks comparisons, the speed advantage over other
|
|||
|
387-compatible coprocessors in real applications is usually much
|
|||
|
smaller, because coprocessor instructions represent only a small part of
|
|||
|
the total application code. For example, in a test using the program 3D-
|
|||
|
Studio, the Cyrix 83D87 was 6% faster than the Intel 387DX [1].
|
|||
|
|
|||
|
Besides being the fastest 387 coprocessor, the 83D87 also offers the
|
|||
|
most accurate transcendental functions results of all coprocessors
|
|||
|
tested (see test results below). The new "387+" version of the 83D87,
|
|||
|
available since November 1991, even surpasses the level of accuracy of
|
|||
|
the original 83D87 design. Note that the name 387+ is used in European
|
|||
|
distribution only. In other parts of the world, the new chip still goes
|
|||
|
by the name 83D87.
|
|||
|
|
|||
|
Unlike Intel's coprocessors, which use the CORDIC [18,19] algorithm to
|
|||
|
compute the transcendental functions, Cyrix uses polynomial and rational
|
|||
|
approximations to the functions. In the past the CORDIC method has been
|
|||
|
popular since it requires only shifts and adds, which made it relatively
|
|||
|
easy to implement a reasonably fast algorithm. Recently, the cost for the
|
|||
|
implementation of fast floating-point hardware multipliers has dropped
|
|||
|
significantly (due to the availability of VLSI), making the use of
|
|||
|
polynomial and rational approximations superior to CORDIC for the
|
|||
|
generation of transcendental functions [61]. The Cyrix 83D87 uses a fast
|
|||
|
array multiplier, making its transcendental functions faster than those
|
|||
|
of any other 387 compatible coprocessor. It also uses 75 bit for the
|
|||
|
mantissa in intermediate calculations (as opposed to 68 bits on other
|
|||
|
coprocessors), making its transcendental functions more accurate than
|
|||
|
those of any other coprocessor or FPU (see results below).
|
|||
|
|
|||
|
The 83D87 (and its successor, the 387+) are the 387 'clones' with the
|
|||
|
highest degree of compatibility to the Intel 387DX. A few minor software
|
|||
|
and hardware incompatibilities have been documented by Cyrix [12]. The
|
|||
|
software differences are caused by some bugs present in the 387DX that
|
|||
|
Cyrix fixed in the 83D87. Unlike the Intel 387DX, the 83D87 (and all
|
|||
|
other 387-compatible chips as well) does not support asynchronous
|
|||
|
operation of CPU and coprocessor. There were also problems in the past
|
|||
|
with the CPU-coprocessor communications, causing the 83D87 to
|
|||
|
occasionally hang on some machines. The reason behind this was that
|
|||
|
Cyrix shaved off a wait state in the communication protocol, which
|
|||
|
caused a communications breakdown between the CPU and the 83D87 for some
|
|||
|
systems running at 25 MHz or faster. (One notable example of this
|
|||
|
behavior was the Intel 302 board.) Also there were problems with boards
|
|||
|
based on early revisions of the OPTI chipset. These problem are only
|
|||
|
rarely encountered with the current generation of 386 motherboards, and
|
|||
|
it is possible that it has been entirely eliminated in the 387+, the
|
|||
|
successor to the 83D87.
|
|||
|
|
|||
|
To reduce power consumption the 83D87 features advanced power saving
|
|||
|
features. Those portions of the coprocessor that are not needed are
|
|||
|
automatically shut down. If no coprocessor instructions are being
|
|||
|
executed, *all* parts except the bus interface unit are shut down [12].
|
|||
|
Maximal power consumption of the Cyrix 83D87 at 33 MHz is 1900 mW, while
|
|||
|
typical power consumption at this clock frequency is 500 mW [15].
|
|||
|
|
|||
|
|
|||
|
Cyrix EMC87
|
|||
|
|
|||
|
This coprocessor is basically a special version of the Cyrix 83D87,
|
|||
|
introduced in 1990. In addition to the normal 387 operating mode, in
|
|||
|
which coprocessor-CPU communication is handled through reserved IO
|
|||
|
ports, it also offers a memory-mapped mode of operation similar to the
|
|||
|
operation principle of the Weitek Abacus. Like the Weitek chip, the
|
|||
|
EMC87 occupies a block of memory starting at physical address C0000000h
|
|||
|
(the Abacus occupies a memory block of 64 KB, while the EMC87 uses only
|
|||
|
4 KB [77]). It can therefore only be accessed in the protected or
|
|||
|
virtual modes of the 386 CPU. DOS programs can access the EMC87 with the
|
|||
|
help of DOS extenders or memory managers like EMM386 which run in
|
|||
|
protected/virtual mode themselves. To implement the memory-mapped
|
|||
|
interface, the usual 80x87 architecture has been slightly expanded with
|
|||
|
three additional registers and eleven additional instructions that can
|
|||
|
only be used if the memory-mapped mode is enabled.
|
|||
|
|
|||
|
Using this special mode of the EMC87 provides a significant speed
|
|||
|
advantage. The traditional 387 CPU-coprocessor interface via IO ports
|
|||
|
has an overhead of about 14-20 clock cycles. Since the Cyrix 83D87
|
|||
|
executes some operations like addition and multiplication in much less
|
|||
|
time, its performance is actually limited by the CPU-coprocessor
|
|||
|
interface. Since the memory-mapped mode has much less overhead, it
|
|||
|
allows all coprocessor instructions to be executed at full speed with no
|
|||
|
penalty.
|
|||
|
|
|||
|
Originally, Cyrix claimed support for the fast memory-mapped mode of the
|
|||
|
EMC87 from a number of software vendors (including Borland and
|
|||
|
Microsoft). However, there are only very few applications that make use
|
|||
|
of it, among them Evolution Computing's FastCAD 3D, MicroWay Inc.'s NDP
|
|||
|
FORTRAN-386 compiler, Metaware's High-C compiler version 1.6 and newer,
|
|||
|
and Intusofts's Spice [63,73]. Part of the problem in supporting the
|
|||
|
memory-mapped mode is that the application must reserve one of the
|
|||
|
general purpose registers of the CPU to use memory-mapped mode
|
|||
|
instructions that access memory.
|
|||
|
|
|||
|
(Note that the EMC87 is *not* compatible with Weitek's Abacus
|
|||
|
coprocessor. They both use the same CPU interface technique [memory
|
|||
|
mapping], but while the EMC87 uses the standard 387 instruction set, the
|
|||
|
Weitek Abacus coprocessors use a different instruction set entirely its
|
|||
|
own.)
|
|||
|
|
|||
|
Since the EMC87 provides also the standard 386/387 CPU interface via IO
|
|||
|
ports, it can be used just like any other 387-compatible coprocessor and
|
|||
|
delivers the same performance as the Cyrix 83D87 in this mode. The EMC87
|
|||
|
even allows mixed use of memory-mapped and traditional instructions in
|
|||
|
the same code. Cyrix has also implemented some additional instructions
|
|||
|
in the EMC87 that are also available in the 387-compatible mode:
|
|||
|
FRICHOP, FRINT2, and FRINEAR. These instructions enable rounding to
|
|||
|
integer without setting the rounding mode by manipulating the
|
|||
|
coprocessor control word, and are intended to make life easier for
|
|||
|
compiler writers.
|
|||
|
|
|||
|
In a test, the EMC87 at 33 MHz ran the single-precision Whetstone
|
|||
|
benchmark at 7608 kWhetstones/sec, while the Cyrix 83D87 at 33 MHz had a
|
|||
|
speed of only 5049 kWhetstones/sec, an increase of 50.6% [63]. In
|
|||
|
another test, the EMC87 ran a fractal computation at twice the speed of
|
|||
|
the Cyrix 83D87 and 2.6 times as fast as an Intel 387DX [64]. A third
|
|||
|
test found the EMC87's overall performance to be 20% higher than the
|
|||
|
performance of the Cyrix 83D87 [65].
|
|||
|
|
|||
|
The Cyrix FasMath EMC87 has also been marketed as Cyrix AutoMATH; the
|
|||
|
two chips are identical. Unlike the Cyrix 83D87, which fits into the 68-
|
|||
|
pin 387 coprocessor socket, the EMC87 comes in a 121-pin PGA and
|
|||
|
requires the 121-pin EMC (Extended Math Coprocessor) socket. Note that
|
|||
|
not all boards have such a socket (a notable exception being IBM's
|
|||
|
PS/2s, for example). The EMC87 is available 25 and 33 MHz versions.
|
|||
|
Maximum power consumption at 33 MHz is 2000 mW.
|
|||
|
|
|||
|
Cyrix appears currently to be phasing out the EMC87.
|
|||
|
|
|||
|
|
|||
|
Cyrix FasMath 387+
|
|||
|
|
|||
|
This chip is the second-generation successor to the Cyrix 83D87. (The
|
|||
|
name "387+" is only used for European distribution; in other parts of
|
|||
|
the world, it goes by the original 83D87 designation.) According to a
|
|||
|
source within Cyrix [73], the 387+ was designed to make a smaller (and
|
|||
|
thus cheaper to manufacture) coprocessor chip that could also be pushed
|
|||
|
to higher frequencies than the original chip: the 387+ is available in
|
|||
|
versions of up to 40 MHz, whereas the original 83D87 could go no faster
|
|||
|
than 33 MHz.
|
|||
|
|
|||
|
The Cyrix 387+ is ideally suited to be used with Cyrix's 486DLC CPU,
|
|||
|
which is a 486SX compatible replacement chips for the Intel 386DX.
|
|||
|
Indeed Cyrix sells upgrade kits consisting of a 486DLC CPU and a
|
|||
|
Cyrix 387+.
|
|||
|
|
|||
|
In my tests, I found the Cyrix 387+ to be about five to 10 percent
|
|||
|
*slower* than the Cyrix 83D87. However, some instructions like the
|
|||
|
square root (FSQRT) now run at only half the speed at which they ran in
|
|||
|
the 83D87, and most transcendental functions show about a 40% drop in
|
|||
|
performance compared to their 83D87 averages (see performance results,
|
|||
|
below). However, I did find the transcendental functions on the 387+ to
|
|||
|
be a bit *more* accurate than those implemented in the 83D87. The new
|
|||
|
design uses a slower hardware multiplier that needs six clock cycles to
|
|||
|
multiply the floating-point mantissa of an internal precision number,
|
|||
|
while the multiplier in the 83D87 takes only 4 clocks to accomplish the
|
|||
|
same task. Since the transcendental functions in Cyrix math coprocessors
|
|||
|
are generated by polynomial and rational approximations, this slows them
|
|||
|
down significantly.
|
|||
|
|
|||
|
The divide/square root logic has also been changed from the 83D87
|
|||
|
design. The original design used an algorithm that could generate both
|
|||
|
the quotient and square root, so the execution times for these
|
|||
|
instructions were nearly identical. The algorithm chosen for the
|
|||
|
division in the 387+ doesn't allow the square root to be taken so
|
|||
|
easily, so it takes nearly twice as long.
|
|||
|
|
|||
|
In the 387+, the available argument range for the FYL2XP1 instruction
|
|||
|
has been extended, from the usual range -1+sqrt(2)/2..sqrt(2)/2 that is
|
|||
|
found on all 80x87 coprocessors, to include all floating-point numbers.
|
|||
|
Also, four additional instructions have been implemented: FRICHOP
|
|||
|
(opcode DD FC), FRINT2 (opcode DB FC), FRINEAR (opcode DF FC), and FTSTP
|
|||
|
(opcode D9 E6).
|
|||
|
|
|||
|
|
|||
|
Cyrix FasMath 83S87
|
|||
|
|
|||
|
The 83S87 is the SX version of the Cyrix 83D87. Just as the 83D87 is the
|
|||
|
fastest 387-compatible coprocessor, the Cyrix 83S87 is the fastest of
|
|||
|
the 387SX compatible coprocessors [1], as well as providing the most
|
|||
|
accurate transcendental functions. 83S87 chips manufactured after 1991
|
|||
|
use the internals of the Cyrix 387+, the successor to the original 83D87
|
|||
|
[73] (above). The Cyrix 83S87 is ideally suited to be used with the
|
|||
|
Cyrix Cx486SLC CPU, a 486SX compatible CPU which is a replacement chip
|
|||
|
for the Intel 386SX CPU.
|
|||
|
|
|||
|
The 83S87 is packaged in a 68-pin PLCC and is available in 16, 20, and
|
|||
|
25 MHz versions. Due to the advanced power saving features of the Cyrix
|
|||
|
coprocessor, the typical power consumption of the 20 MHz version is only
|
|||
|
about 350 mW [67].
|
|||
|
|
|||
|
|
|||
|
ULSI Math*Co 83C87
|
|||
|
|
|||
|
The ULSI 83C87 is an 80387-compatible coprocessor first introduced in
|
|||
|
early 1991, well after the IIT 3C87 and Cyrix 83D87 appeared. Like other
|
|||
|
387 clones, it is somewhat faster than the Intel 387DX, particularly in
|
|||
|
its basic arithmetic functions. The transcendental functions, however,
|
|||
|
show only a slight speed improvement over the Intel 387DX (see benchmark
|
|||
|
results below).
|
|||
|
|
|||
|
In my tests, the ULSI had the most inaccurate transcendental functions
|
|||
|
of all tested coprocessors. However, the maximum relative error is still
|
|||
|
within the limits set by Intel, so this is probably not an important
|
|||
|
issue for all but a very few applications. The ULSI 83C87 shows some
|
|||
|
minor flaws in the tests for IEEE 754 compatibility, but this, too, is
|
|||
|
probably unimportant under typical operating conditions. ULSI claims
|
|||
|
that the program IEEETEST, which was used to test for IEEE
|
|||
|
compatibility, contains many personal interpretations of the IEEE
|
|||
|
standard by the program's author and states that there is no ANSI-
|
|||
|
certified IEEE-754 compliance test. While this may be true, it is
|
|||
|
also a fact that the IEEE test vectors used in IEEETEST are a de facto
|
|||
|
industry standard, and that Intel's 387, 486, and RapidCAD chips pass it
|
|||
|
without a single failure, as do the coprocessors from Cyrix. Since the
|
|||
|
ULSI Math*Co 83C87 fails some of the tests, it is certainly less than
|
|||
|
100% compatible with Intel's chips, although this will likely make
|
|||
|
little or no difference in typical operating conditions. (It is
|
|||
|
interesting to note that an ULSI 83S87 manufactured in 92/17 showed
|
|||
|
fewer errors in the IEEETEST test run [74] than the ULSI 83C87,
|
|||
|
manufactured in 91/48, I used in my original test. This indicates that
|
|||
|
ULSI might have applied some quick fixes to newer revisions of their
|
|||
|
math coprocessors.)
|
|||
|
|
|||
|
The ULSI 83C87 fails to be compatible with the IEEE-754 in that is does
|
|||
|
not implement the "precision control" feature. While all the internal
|
|||
|
operations of 80x87 coprocessors are usually performed with the maximum
|
|||
|
precision available (double-extended precision with 64 mantissa bits),
|
|||
|
the 80x87 architecture also offer the possibility to force lower
|
|||
|
precision to be used for the basic arithmetic functions (add, subtract,
|
|||
|
multiply, divide, and square root). This feature is required by IEEE-754
|
|||
|
for all coprocessors that can not store results *directly* to a single
|
|||
|
or double-precision location. Since 80x87 coprocessors lack this storage
|
|||
|
capability, they all implement precision control to provide correctly
|
|||
|
rounded single- and double-precision results according to the floating-
|
|||
|
point standard - except the ULSI chips. For programs that make use of
|
|||
|
precision control (e.g., Interactive UNIX), correct implementation of
|
|||
|
the feature may be essential for correct arithmetic results.
|
|||
|
|
|||
|
Like other non-Intel 387 compatibles, the 83C87 does not support
|
|||
|
asynchronous operation of the CPU and the coprocessor. This means that
|
|||
|
the 83C87 always runs at the full speed of the CPU. It is available in
|
|||
|
20, 25, 33, and 40 MHz versions. The ULSI is produced in low power CMOS;
|
|||
|
power consumption at 20 MHz is max. 800 mW (400 mW typical), at 25 MHz
|
|||
|
it is max. 1000 mW (500 mW typical), at 33 MHz it is max. 1250 mW (625
|
|||
|
mW), and at 40 MHz it is max. 1500 mW (750 mW typical) [58]. The 83C87
|
|||
|
is packaged in a 68-pin ceramic PGA.
|
|||
|
|
|||
|
ULSI coprocessors come with a lifetime warranty. ULSI Systems, Inc.,
|
|||
|
will replace the coprocessor up to three times free of charge should it
|
|||
|
ever fail to function properly.
|
|||
|
|
|||
|
|
|||
|
ULSI Math*Co 83S87
|
|||
|
|
|||
|
This chip is the SX version of the ULSI 83C87, for use in systems with
|
|||
|
an Intel 387SX or an AMD Am387SX CPU. It is functionally equivalent to
|
|||
|
the 83C87. To aid low-power laptop designs, the ULSI 83S87 features an
|
|||
|
advanced power saving design with a sleep mode and a standby mode with
|
|||
|
only minimal power requirements. Power consumption under normal
|
|||
|
operating conditions (dynamic mode) is max. 400 mW at 16 MHz (300 mW
|
|||
|
typical), max. 450 mW at 20 MHz (350 mW typical), and max. 500 mW at 25
|
|||
|
MHz (400 mW typical) [58]. The ULSI 83S87 is packaged in a 68-pin PLCC.
|
|||
|
|
|||
|
|
|||
|
C&T SuperMATH 38700DX
|
|||
|
|
|||
|
Produced by Chips&Technologies, this is the latest entry into the 387-
|
|||
|
compatible marketplace. Originally announced in October, 1991, it has
|
|||
|
apparently not been available to end-users before the third quarter of
|
|||
|
1992, at least here in Germany. My tests show that its compatibility
|
|||
|
with Intel products is very good, even for the more arcane features of
|
|||
|
the 387DX and comparable to the coprocessors from Cyrix. Like these
|
|||
|
chips, it passes the IEEETEST program without a single failure. It
|
|||
|
passes, of course, all tests in Chips&Technologies' own compatibility
|
|||
|
test program, SMDIAG. However, some of the tests (the transcendental
|
|||
|
functions) in this program are selected in such a way that the C&T 38700
|
|||
|
passes while the Cyrix 83D87 or Intel RapidCAD fail, so they are not
|
|||
|
very useful. (There is also a 'bug' in the test for FSCALE that hides a
|
|||
|
true bug in the C&T 38700.) My tests show the accuracy of the
|
|||
|
transcendental functions on the C&T 38700DX varies. Overall, accuracy of
|
|||
|
the transcendentals is slightly better than on the Intel 387DX.
|
|||
|
|
|||
|
In my own speed tests [see below] and those reported in [1], the C&T
|
|||
|
38700DX showed performance at about 90-100% the level of the Cyrix
|
|||
|
83D87, which is the 387 clone with the highest performance. For
|
|||
|
floating-point-intensive benchmarks, the C&T 38700DX provides up to 50%
|
|||
|
more computational performance than the Intel 387DX. However, as with
|
|||
|
all other 387 compatible coprocessors, the speed advantage over the
|
|||
|
Intel 387DX is far less significant in real applications.
|
|||
|
|
|||
|
The SuperMATH 38700DX is implemented in 1.2 micron CMOS with on-chip
|
|||
|
power management, which makes for low power consumption. The 38700DX is
|
|||
|
packaged in a 68-pin ceramic PGA (pin grid array and available in speeds
|
|||
|
of 16, 20, 25, 33, and 40 MHz.
|
|||
|
|
|||
|
|
|||
|
C&T 38700SX
|
|||
|
|
|||
|
This chip is the SX version of the 38700DX and compatible with the Intel
|
|||
|
387SX. It provides performance comparable to a Cyrix 83S87 [1], the
|
|||
|
387SX clone with the highest performance. Compatibility with the Intel
|
|||
|
387SX is very good and on par with the high degree of the compatibility
|
|||
|
found in the Cyrix 83S87.
|
|||
|
|
|||
|
The 38700SX has low power consumption. It is packaged in a 68-pin PLCC
|
|||
|
(plastic leaded chip carrier) and available in speeds of 16, 20, and 25
|
|||
|
MHz.
|
|||
|
|
|||
|
|
|||
|
Intel RapidCAD
|
|||
|
|
|||
|
The RapidCAD is not a coprocessor, strictly seen, although it is
|
|||
|
marketed as one. Rather, it is a full replacement for a 80386 CPU:
|
|||
|
basically, an Intel 486DX CPU chip without the internal cache and with a
|
|||
|
standard 386 pinout. RapidCAD is delivered as a set of two chips.
|
|||
|
RapidCAD-1 goes into the 386 socket and contains the CPU and FPU.
|
|||
|
RapidCAD-2 goes into the coprocessor (387) socket and contains a simple
|
|||
|
PAL whose only purpose is to generate the FERR signal normally generated
|
|||
|
by a coprocessor (This is needed by the motherboard circuitry to provide
|
|||
|
287 compatible coprocessor exception handling in 386/387 systems.) The
|
|||
|
RapidCAD instruction set is compatible with the 386, so it doesn't have
|
|||
|
any newer, 486-specific instructions like BSWAP. However, since the
|
|||
|
RapidCAD CPU core is very similar to 80486 CPU core, most of the
|
|||
|
register-to-register instructions execute in the same number of clock
|
|||
|
cycles as on the 486.
|
|||
|
|
|||
|
RapidCAD's use of the standard 386 bus interface causes instructions
|
|||
|
that access memory to execute at about the same speed as on the 386. The
|
|||
|
integer performance on the RapidCAD is definitely limited by the low
|
|||
|
memory bandwidth provided by this interface (2 clock cycles per bus
|
|||
|
cycle) and the lack of an internal cache. CPU instructions often execute
|
|||
|
faster than they can be fetched from memory, even with a big and fast
|
|||
|
external cache. Therefore, the integer performance of the RapidCAD
|
|||
|
exceeds that of a 386 by *at most* 35%. This value was derived by
|
|||
|
running some programs that use mostly register-to-register operations
|
|||
|
and few memory accesses, and is supported by the SPEC ratings that Intel
|
|||
|
reports for the 386-33 and the RapidCAD-33: while the 386-33 has a
|
|||
|
SPECint of 6.4, the RapidCAD has a SPECint of 7.3 [28], a 14% increase.
|
|||
|
(Note that these tests used the old [1989] SPEC benchmarks suite.)
|
|||
|
|
|||
|
While CPU and integer instructions often execute in one clock cycle on
|
|||
|
the RapidCAD, floating-point operations always take more than seven
|
|||
|
clock cycles. They are therefore rarely slowed down by the low-bandwidth
|
|||
|
386 bus interface; My tests show a 70%-100% performance increase for
|
|||
|
floating-point intensive benchmarks over a 386-based system using the
|
|||
|
Intel 387DX math coprocessor. This is consistent with the SPECfp rating
|
|||
|
reported by Intel. The 386/387 at 33 MHz is rated at 3.3 SPECfp, while
|
|||
|
the RapidCAD is rated at 6.1 SPECfp at the same frequency, an 85%
|
|||
|
increase. This means that a system that uses the RapidCAD is faster than
|
|||
|
*any* 386/387 combination, regardless of the type of 387 used, whether
|
|||
|
an Intel 387DX or a faster 387 clone. The diagnostic disk for the
|
|||
|
RapidCAD also gives some application performance data for the RapidCAD
|
|||
|
compared to the Intel 387DX:
|
|||
|
|
|||
|
Application Time w/ 387DX Time w/ RapidCAD Speedup
|
|||
|
|
|||
|
AutoCAD 11 52 sec 32 sec 63%
|
|||
|
AutoShade/Renderman 180 sec 108 sec 67%
|
|||
|
Mathematica(Windows ) 139 sec 103 sec 35%
|
|||
|
SPSS/PC+ 4.01 17 sec 14 sec 21%
|
|||
|
|
|||
|
RapidCAD is available in 25 MHz and 33 MHz versions. It is distributed
|
|||
|
through different channels than the other Intel math coprocessors, and I
|
|||
|
have therefore been unable to obtain a data sheet for it. [78] gives the
|
|||
|
typical power consumption of the 33 MHz RapidCAD as 3500 mW, which is
|
|||
|
the same as for the 33 MHz 486DX. The RapidCAD-1 chip gets quite hot
|
|||
|
when operating. Therefore, I recommend extra cooling for it (see the
|
|||
|
paragraph below on the 486 for details). The RapidCAD-1 is packaged in a
|
|||
|
132-pin PGA, just like the 80386, and the RapidCAD-2 is packaged in a
|
|||
|
68-pin PGA like a 80387 coprocessor.
|
|||
|
|
|||
|
|
|||
|
Intel 486DX
|
|||
|
|
|||
|
The Intel 486DX is, of course, not solely a coprocessor. This chip,
|
|||
|
first introduced by Intel in 1989, functionally combines the CPU (a
|
|||
|
heavily-pipelined implementation of the 386 architecture) with an
|
|||
|
enhanced 387 (the chip's floating-point unit, FPU) and 8 KB of unified
|
|||
|
on-chip code/data cache. (This description is necessarily simplified;
|
|||
|
for a detailed hardware description, see [52].) The 486DX offers about
|
|||
|
two to three times the integer performance of a 386 at the same clock
|
|||
|
frequency, while floating-point performance is about three to four times
|
|||
|
as high as the Intel 387DX at the same clock rate [29]. Since the FPU is
|
|||
|
on the same chip as the CPU, the considerable communication overhead
|
|||
|
between CPU and coprocessor in a 386/387 system is omitted, letting FPU
|
|||
|
instructions run at the full speed permitted by the implementation. The
|
|||
|
FPU also takes advantage of the on-chip cache and the highly pipelined
|
|||
|
execution unit. The concurrent execution of CPU and coprocessor
|
|||
|
instructions typical for 80x86/80x87 systems is still in existence on
|
|||
|
the 486, but some FPU instructions like FSIN have nearly no concurrency
|
|||
|
with CPU instructions, indicating that they make heavy use of both, CPU
|
|||
|
and FPU resources [53, 1].
|
|||
|
|
|||
|
Besides its higher performance, the 486 FPU provides more accurate
|
|||
|
transcendental functions than the 387DX coprocessor, according to my
|
|||
|
tests (see below). To achieve better interrupt latency, FPU instructions
|
|||
|
with a long execution times have been made abortable if an interrupt
|
|||
|
occurs during their execution.
|
|||
|
|
|||
|
Due to the considerable amount of heat produced by these chips, and
|
|||
|
taking into consideration the slow air flow provided by the fan in
|
|||
|
garden-variety PC tower cases, I recommend an extra fan directly above
|
|||
|
the CPU for safer operation. If you measure the surface temperature of
|
|||
|
an 486DX after some time of operation in a normal tower case without
|
|||
|
extra cooling, you may well come up with something like 80-90 degrees
|
|||
|
Celsius (that is 175-195 degrees Fahrenheit for those not familiar with
|
|||
|
metric units) [54,55]. You don't need the well known (and expensive)
|
|||
|
IceCap[tm] to effectively cool your CPU; a simple fan mounted directly
|
|||
|
above the CPU can bring the temperature of the chip down to about 50-60
|
|||
|
degrees Celsius (120-140 degrees Fahrenheit), depending on the room
|
|||
|
temperature and the temperature within the PC case (which depends on the
|
|||
|
total power dissipation of all the components and the cooling provided
|
|||
|
by the fan in the system's power supply). According to a simple rule
|
|||
|
known as Arrhenius' Law, lowering the temperature by 10 degrees Celsius
|
|||
|
slows down chemical reactions by a factor of two, so lowering the
|
|||
|
temperature of your CPU by 30 degrees should prolong the life of the
|
|||
|
device by a factor of eight, due to the slower ageing process. If you
|
|||
|
are reluctant to add a fan to your system because of the additional
|
|||
|
noise, settle for a low-noise fan like those available from the German
|
|||
|
manufacturer Pabst (this is not meant to be an advertisement; I am just
|
|||
|
the happy owner of such a fan, and have no other connections to the
|
|||
|
firm).
|
|||
|
|
|||
|
The 486DX comes in a 168 pin ceramic PGA (pin grid array). It is
|
|||
|
available in 25 MHz and 33 MHz versions. Since the end of 1991, a 50 MHz
|
|||
|
version has also been available, manufactured by a CHMOS V process (the
|
|||
|
25 MHz and 33 MHz are produced using the CHMOS IV process). Maximum
|
|||
|
power consumption is 3500 mW for the 25 MHz 486 (2600 mW typical), 4500
|
|||
|
mW for the 33 MHz version (3500 mW typical), and 5000 mW (3875 mW
|
|||
|
typical) for the 50 MHz chip.
|
|||
|
|
|||
|
|
|||
|
Intel 486DX2
|
|||
|
|
|||
|
The 486DX2 represents the latest generation of Intel CPUs. The "DX2"
|
|||
|
suffix (instead of simply DX) is meant to be an indicator that these are
|
|||
|
clock-doubled versions of the basic CPU. A normal 486DX operates at the
|
|||
|
frequency provided by the incoming clock signal. A 486DX2 instead
|
|||
|
generates a new clock signal from the incoming clock by means of a PLL
|
|||
|
(phase locked loop). In the DX2, this clock signal has twice the
|
|||
|
frequency of the incoming clock, hence the name clock-doubler. All
|
|||
|
internal parts of the 486DX2 (cache, CPU core, and FPU) run at this
|
|||
|
higher frequency; only the bus interface runs at the normal (undoubled)
|
|||
|
speed. Using this technique, an Intel 486DX2-50 can run on an unmodified
|
|||
|
motherboard designed for 25 MHz operation. Since motherboards which run
|
|||
|
at 50 MHz are much harder to design and build than those for 25 MHz,
|
|||
|
this makes a 486DX2-50 system cheaper than an 'equivalent' 486DX-50
|
|||
|
system.
|
|||
|
|
|||
|
For all operations that don't access off-chip resources (e.g., register
|
|||
|
operations), a 486DX2-50 provides exactly the same performance as a
|
|||
|
486DX-50, and twice the performance of a 486DX-25. However, since the
|
|||
|
main memory in a 486DX2-50 systems still operates at 25 MHz, all
|
|||
|
instructions involving memory accesses are potentially slower than in a
|
|||
|
486DX-50 system, whose memory also (presumably) runs at 50 MHz. The
|
|||
|
internal cache of the 486 helps this problem a bit, but overall
|
|||
|
performance of a 486DX2-50 is still lower than that of a 486DX-50.
|
|||
|
Intel's documentation [32] shows this drop to be quite small, although
|
|||
|
it is highly dependent upon the particular application.
|
|||
|
|
|||
|
The truly wonderful thing about the 486DX2 is that it allows easy
|
|||
|
upgrading of 25 and 33 MHz 486 systems, since the 486DX2 is completely
|
|||
|
pin-compatible with the 486DX: you need just take out the 486DX and plug
|
|||
|
in the new 486DX2. Note that power consumption of the 486DX2-50 equals
|
|||
|
that of the 486DX-50 (4000 mW typical, 4750 mW max.), and that the
|
|||
|
486DX2-66 exceeds this by about 25% (4875 mW typical, 6000 mW max.).
|
|||
|
These chips get *really* hot in a standard PC case with no extra
|
|||
|
cooling, even if they come with an attached heat sink by default. (See
|
|||
|
the discussion above for more detailed information on this problem and
|
|||
|
possible solutions).
|
|||
|
|
|||
|
|
|||
|
Intel 487SX
|
|||
|
|
|||
|
The 487SX is the math coprocessor intended for use in 486SX systems. The
|
|||
|
486SX is basically a 486DX without the floating-point unit (FPU) [48,
|
|||
|
50]. (Originally Intel sold 486DXs with a defective FPU as 486SXs but it
|
|||
|
has now completely removed the FPU part from the 486SX mask for mass
|
|||
|
production.) The introduction of the 486SX in 1991 has been viewed by
|
|||
|
many as a marketing 'trick' by Intel to take market share from the 386
|
|||
|
based systems once AMD became successful with their Am386. (AMD has
|
|||
|
taken as much as 40% of the 386 market due to some superior features
|
|||
|
such as higher clock frequency, lower power consumption, fully static
|
|||
|
design, and availability of a 3V version). A 486SX at 20 MHz delivers
|
|||
|
a bit less integer performance than a 40 MHz Am386.
|
|||
|
|
|||
|
To add floating-point capabilities to a 486SX based system, it would
|
|||
|
seem to be easiest to swap the 486SX for a 486DX, which includes the FPU
|
|||
|
on-chip. However, Intel has prevented this easy solution by giving the
|
|||
|
486SX a slightly different pin out [48, 51]. Since only three pins are
|
|||
|
assigned differently, clever board manufacturers have come out with
|
|||
|
boards that accept anything from a 486SX-20 to a 486DX2-50 in their CPU
|
|||
|
socket and by doing so provide a clean upgrade path. A set of three
|
|||
|
jumpers ensures correct signal assignment to the changed pins for either
|
|||
|
CPU type. To upgrade 486SX systems without this feature, you are forced
|
|||
|
to buy a 487SX and install it in the "Performance Upgrade Socket"
|
|||
|
(present in most systems).
|
|||
|
|
|||
|
Once the 487SX was available, it was quickly found out that it is just a
|
|||
|
normal 486DX with a slightly different pinout [49]. Technically
|
|||
|
speaking, the solution Intel chose was the only practical way to provide
|
|||
|
a 486SX system with the high level of floating-point performance the
|
|||
|
486DX offers. The CPU and FPU must be on the same chip; otherwise, the
|
|||
|
FPU cannot make use of the CPU's internal cache and there would be
|
|||
|
considerable overhead in CPU-FPU communication (similar to a 386/387
|
|||
|
system), nullifying most of the arithmetic speedups over the 387. That
|
|||
|
the 486SX, 487SX, and 486DX are *not* pin-compatible seems to be purely
|
|||
|
for marketing reasons.
|
|||
|
|
|||
|
To upgrade a 486SX based system, Intel also offers the OverDrive chip,
|
|||
|
which is just the same as a 487SX with internal clock doubling. It also
|
|||
|
goes into the motherboard's "Performance Upgrade Socket". The OverDrive
|
|||
|
roughly doubles the performance of a 486SX/487SX based system. (For a
|
|||
|
explanation of clock doubling, see the description of the Intel 486DX2
|
|||
|
above.)
|
|||
|
|
|||
|
Inserting the 487SX effectively shuts down the 486SX in the 486SX/487SX
|
|||
|
system, so the 486SX could be removed once the 487SX is installed. Since
|
|||
|
the shut down is logical, not electrical, the 486SX still uses power if
|
|||
|
used with the 487SX, although it is inoperational. As with the 486SX,
|
|||
|
the 487SX is currently available in 20 MHz and 25 MHz versions. At 20
|
|||
|
MHz, the 487SX has a power consumption of max. 4000 mW (3250 mW
|
|||
|
typical). It is available in a 169 pin ceramic PGA (pin grid array).
|
|||
|
|
|||
|
|
|||
|
Weitek 1167
|
|||
|
|
|||
|
This math coprocessor was the predecessor of the Weitek Abacus 3167. It
|
|||
|
was actually a small printed circuit board with three chips mounted on
|
|||
|
it. In contrast to the Weitek 3167, the 1167 did not have a square root
|
|||
|
instruction; instead, the square root function was computed by means of
|
|||
|
a subroutine in the Weitek transcendental function library. However, the
|
|||
|
1167 did have a mode in which it supported denormal numbers. (The Weitek
|
|||
|
3167 and 4167 only implement the 'fast' mode, in which denormals are not
|
|||
|
supported.) Overall performance of the 1167 is slightly less than that
|
|||
|
of the Weitek 3167.
|
|||
|
|
|||
|
|
|||
|
Weitek 3167
|
|||
|
|
|||
|
The 3167 was introduced by Weitek in 1989 and provided the fastest
|
|||
|
floating-point performance possible on a 386 based system at that time.
|
|||
|
The 3167 is not a real coprocessor, strictly speaking, but rather a
|
|||
|
memory-mapped peripheral device. The architecture of the 3167 was
|
|||
|
optimized for speed wherever possible. Besides using the faster memory
|
|||
|
mapped interface to the CPU (the 80x87 uses IO-ports), it does not
|
|||
|
support many of the features of the 80x87 coprocessors, allowing all of
|
|||
|
the chip's resources to be concentrated on the fast execution of the
|
|||
|
basic arithmetic operations. (For a more detailed description of the
|
|||
|
Weitek 3167, see the first chapter of this document.)
|
|||
|
|
|||
|
In benchmark comparisons, the Weitek 3167 provided up to 2.5 times the
|
|||
|
performance of an Intel 387DX coprocessor. For example, on a 33 MHz 3167
|
|||
|
the Whetstone benchmark performed at 7574 kWhetstones/sec compared with
|
|||
|
the 3743 kWhetstones/s for the Intel 387DX. (Note, however, that these
|
|||
|
are single-precision results and that the Weitek 3167's performance
|
|||
|
would drop to about half the stated rate for double-precision, while the
|
|||
|
value for the Intel 387DX would change very little.) In any case, before
|
|||
|
the advent of the Intel RapidCAD, the Weitek 3167 usually outperformed
|
|||
|
all 387-compatible coprocessors, even for double-precision operations
|
|||
|
[63,65,69]. For typical applications, the advantage of the Weitek 3167
|
|||
|
over the 387 clones is much smaller. In a benchmark test using
|
|||
|
AutoDesk's 3D-Studio the Weitek 3167 performed at 123% of the Intel
|
|||
|
387DX's performance compared with 106% for the Cyrix FasMath 83D87 and
|
|||
|
118% for the Intel RapidCAD.
|
|||
|
|
|||
|
The Weitek Abacus 3167 is packaged in a 121-pin PGA that fits into an
|
|||
|
EMC socket (provided in most 386-based systems). It does *not* fit into
|
|||
|
the normal 68-pin PGA socket intended for a 387 coprocessor.
|
|||
|
|
|||
|
To get the best of both worlds, one might want to use a Weitek 3167 and
|
|||
|
a 387 compatible coprocessor in the same system. These coprocessors can
|
|||
|
coexist in the same system without problems; however, most 386-based
|
|||
|
systems contain only one coprocessor socket, usually of the EMC
|
|||
|
(extended math coprocessor) type. Thus, you can install either a 387
|
|||
|
coprocessor or a Weitek 3167, but not both at the same time. There *are*
|
|||
|
small daughter boards available that plug into the EMC socket and
|
|||
|
provide two sockets, an EMC and a standard coprocessor socket.
|
|||
|
|
|||
|
At 25 MHz, the Weitek 3167 has a power consumption of max. 1750 mW. At
|
|||
|
33 MHz, max. power consumption is 2250 mW.
|
|||
|
|
|||
|
|
|||
|
Weitek 4167
|
|||
|
|
|||
|
The 4167 is a memory-mapped coprocessor that has the same architecture
|
|||
|
as the 3167; it is designed to provide 486-based systems with the
|
|||
|
highest floating-point performance available. It executes coprocessor
|
|||
|
instructions at three to four times the speed of the Weitek 3167.
|
|||
|
Although it is up to 80% faster than the Intel 486 in some benchmarks
|
|||
|
[1,69], the performance advantage for real application is probably more
|
|||
|
like 10%. The introduction of the 486DX2 processors has more or less
|
|||
|
obliterated the need for a Weitek 4167, since the DX2 CPUs provide the
|
|||
|
same performance as the Weitek, as well as the additional features the
|
|||
|
80x87 architecture has that the Weitek does not.
|
|||
|
|
|||
|
The Weitek 4167 is packaged in a 142-pin PGA package that is only
|
|||
|
slightly smaller than the 486's package. At 25 MHz, it has a max. power
|
|||
|
consumption of 2500 mW [32].
|
|||
|
|
|||
|
|
|||
|
|
|||
|
======================================
|
|||
|
Finding out which coprocessor you have
|
|||
|
======================================
|
|||
|
|
|||
|
If you are interested in programming techniques which allow the detection and
|
|||
|
differentiation of the coprocessors described above, I refer you to my
|
|||
|
COMPTEST program. COMPTEST reliably detects the type and clock frequency of
|
|||
|
the CPU and coprocessor installed in your machine. The current version is
|
|||
|
CTEST257.ZIP, with future versions to be called CTEST258, CTEST259 and so on.
|
|||
|
COMPTEST can correctly identify all of the coprocessors described above, with
|
|||
|
the exception of the Weitek chips, for which the detection mechanism is not
|
|||
|
that reliable.
|
|||
|
|
|||
|
COMPTEST is in the public domain and comes with complete source code. It is
|
|||
|
available via anonymous ftp from garbo.uwasa.fi and additional ftp sites that
|
|||
|
mirror garbo.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
================================================
|
|||
|
Current coprocessor prices and purchasing advice
|
|||
|
================================================
|
|||
|
|
|||
|
Due to mid-1992 price slashing by Cyrix (and subsequently, Intel) for 387
|
|||
|
coprocessors, prices have dropped significantly for all 287 and 387
|
|||
|
compatibles, with hardly any price difference between manufacturers. 387DX
|
|||
|
compatible coprocessors typically sell for ~US$ 80 for all speeds except for
|
|||
|
40 MHz versions, which are typically ~US$ 90. 387SX compatible coprocessors
|
|||
|
sell for ~US$ 70, regardless of speed, with the exception of the 33 MHz
|
|||
|
versions, which are ~US$ 80. The Intel 287XL sells for ~US$ 90, while the
|
|||
|
IIT 2C87 and Cyrix 82S87 each sell for about US$ 60. 8087s may be more
|
|||
|
expensive, the price of an 8087-10 being ~US$ 150. I purchased the Intel
|
|||
|
RapidCAD for US$ 300 and haven't seen it offered for a better price. I see the
|
|||
|
Weitek Abacus 3167-33 being offered for US$ 230 and the 4167-33 being offered
|
|||
|
for US$ 850. The Intel 486SX OverDrive is available for ~US$ 570 for the 20 MHz
|
|||
|
version, while the Intel 486DX2-50 costs ~650 US$. This price information
|
|||
|
reflects the price situation as of 01-11-93; prices can be expected to drop
|
|||
|
slightly in the near future.
|
|||
|
|
|||
|
|
|||
|
Which coprocessor should you buy?
|
|||
|
---------------------------------
|
|||
|
Several computer magazines have published application-level performance
|
|||
|
comparisons for various 387 coprocessors and Weitek's ABACUS 3167 and 4167
|
|||
|
chips [1,25,68,70]. Applications tested included AutoCAD R11, RenderStar,
|
|||
|
Quattro Pro, Lotus 1-2-3, and AutoDesk's 3D-Studio. For most tests,
|
|||
|
performance improvements for the 387 clones over Intel's 387DX were small to
|
|||
|
marginal, the clones running the applications no more than 5-15% faster than
|
|||
|
the Intel 387DX. In the test of 3D-Studio, one of the few programs that
|
|||
|
directly supports the Weitek Abacus, the Weitek 3167 improved performance by
|
|||
|
23% over an Intel 387DX and the 4167 improved performance by 10% over the
|
|||
|
486DX [1].
|
|||
|
|
|||
|
If you have a demand for high floating-point performance, you should consider
|
|||
|
buying a full 486-based system, rather than a 386-based system with an
|
|||
|
additional coprocessor. Consider: A 386/33 MHz motherboard currently sells for
|
|||
|
~US$ 270; together with the coprocessor, the cost totals ~US$ 350. A 486/33 MHz
|
|||
|
ISA motherboard sells for US$ 650. While this means that the 486 system is 85%
|
|||
|
more expensive than the 386/387 system, it also provides 100% more integer
|
|||
|
and floating-point performance (twice the performance), giving it better
|
|||
|
price/performance for math-intensive applications. As prices for 486 chips
|
|||
|
fall in the future, the price difference between these two systems should
|
|||
|
become even smaller.
|
|||
|
|
|||
|
If you want to push your 386-based system to its maximum floating-point
|
|||
|
performance and can't switch to a 486, I recommend the Intel RapidCAD
|
|||
|
chipset. It is both faster [1] and cheaper than installing a Weitek Abacus
|
|||
|
3167 in a 386 system, which used to be the highest performing combination
|
|||
|
before the RapidCAD was introduced.
|
|||
|
|
|||
|
In a similar vein, the introduction of the Intel 486DX2 clock-doubler chips
|
|||
|
has obliterated the need for a Weitek 4167 to get maximum floating-point
|
|||
|
performance out of a 486-based system. A 486DX2-66 performs at or above the
|
|||
|
performance level of a 33 MHz Weitek 4167, even if the latter uses single-
|
|||
|
precision rather than double-precision. The 486DX-66 is rated by Intel at
|
|||
|
24700 double-precision kWhetstones/sec and 3.1 double-precision Linpack
|
|||
|
MFLOPS. (Of course, these benchmarks used the highest performance compilers
|
|||
|
available. But even with a Turbo Pascal 6.0 program, I managed to squeeze 1.6
|
|||
|
double-precision MFLOPS out of the 486DX2-66 for the LLL benchmark [for a
|
|||
|
description of these benchmarks, see the paragraph on benchmarks below].)
|
|||
|
Although I haven't yet seen 486DX2-66 processors being offered to end users
|
|||
|
for upgrade purposes, I recommend the 486DX2-66 to those that need highest
|
|||
|
floating-point performance and are planning to buy a new PC. The price
|
|||
|
difference between a 33 MHz 486DX motherboard and a 486DX2-66 motherboard is
|
|||
|
around US$ 450, well below the price for the Weitek Abacus 4167.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
============================================================
|
|||
|
The benchmark programs / Coprocessor performance comparisons
|
|||
|
============================================================
|
|||
|
|
|||
|
The performance statistics below were put together with the help of four
|
|||
|
widely-known numeric benchmarks and two benchmarks developed by me. Three
|
|||
|
Pascal programs, one FORTRAN program, and two assembly language programs were
|
|||
|
used. The assembly language programs were linked with Borland's Turbo Pascal
|
|||
|
6.0 for library support, especially to include the coprocessor emulator of
|
|||
|
the TP 6.0 run-time library. The Pascal programs were compiled with Turbo
|
|||
|
Pascal 6.0, a non-optimizing compiler that produces 16-bit code. The FORTRAN
|
|||
|
program was compiled using Microsoft's FORTRAN 5.0, an optimizing compiler
|
|||
|
that generates 16-bit code. All programs use double-precision variables
|
|||
|
(except PEAKFLOP and SAVAGE, which use double extended precision).
|
|||
|
|
|||
|
Note that the use of a highly optimizing compiler producing 32-bit code can
|
|||
|
give much higher performance for some benchmarks. For example, Intel rates
|
|||
|
the 33 MHz 386/387DX at 3290 kWhetstones/sec and 0.4 double-precision LINPACK
|
|||
|
MFLOPS [28,29], and it rates the Intel 486 at 12300 kWhetstones/sec and 1.6
|
|||
|
double-precision LINPACK MFLOPS [30]. The compilers used in these benchmarks
|
|||
|
run by the chip's manufacturer are the ones that give the highest performance
|
|||
|
available, and sell in the US$ 1000+ price range. Some of them may even be
|
|||
|
experimental or prereleased versions not available to the general public. The
|
|||
|
relative performance of one coprocessor to another can and does vary greatly
|
|||
|
depending on the code generated by compilers. Non-optimizing compilers tend
|
|||
|
to generate a high percentage of operations which access variables in memory,
|
|||
|
while optimizing compiler produce code that contains many operations
|
|||
|
involving registers. Thus it is well possible that coprocessor A beats
|
|||
|
coprocessor B running benchmark Z if compiled with compiler C, but B beats A
|
|||
|
when the same benchmark is compiled using compiler D.
|
|||
|
|
|||
|
All benchmark in this overview were run from floppy under a 'bare-bones' MS-
|
|||
|
DOS 5.0 without the CONFIG.SYS and AUTOEXEC.BAT files. This way, it was made
|
|||
|
sure no TSR or other program unnecessarily stole computing resources from the
|
|||
|
benchmarks.
|
|||
|
|
|||
|
|
|||
|
Description of benchmarks
|
|||
|
-------------------------
|
|||
|
PEAKFLOP is the kernel of a fractal computation. It consists mainly of a
|
|||
|
tight loop written in assembly code and fine-tuned to give maximum
|
|||
|
performance. The whole program fits nicely into even a very small CPU cache.
|
|||
|
All variables are held in the CPU's and coprocessor's registers, so the only
|
|||
|
memory access is for opcode fetches. The main loop contains three
|
|||
|
multiplications and five additions/ subtractions; this ratio is fairly
|
|||
|
typical for other floating-point intensive programs as well. Due to the
|
|||
|
nature of this program, its MFLOPS rate is hardly to be exceeded by any
|
|||
|
program that calculates anything useful; thus the name PEAKFLOP. You will
|
|||
|
find the source code for PEAKFLOP in appendix B.
|
|||
|
|
|||
|
TRNSFORM multiplies an array of 8191 vectors with a 3D-transformation matrix
|
|||
|
(a 4x4 matrix). Each vector consists of four double-precision values.
|
|||
|
Multiplying vectors with a matrix is a typical operation in the manipulation
|
|||
|
(e.g. rotation) of 3D objects which are made up from many vectors describing
|
|||
|
the object. This benchmark stresses addition and multiplication as well as
|
|||
|
memory access. For each vector, 16 multiplications and 12 additions are used,
|
|||
|
and about 256 KB of data is accessed during the benchmark run.
|
|||
|
|
|||
|
For the IIT 3C87, a special version of TRNSFORM was written that makes use of
|
|||
|
the special F4X4 instruction available on that coprocessor. F4X4 does a full
|
|||
|
multiplication of a 4x4 matrix by a 4x1 vector in a single instruction.
|
|||
|
TRNSFORM is implemented as an optimized assembler program linked with the
|
|||
|
Turbo Pascal 6.0 library. The full source code can be found in appendix B.
|
|||
|
|
|||
|
LLL is short for Lawrence Livermore Loops [21], a set of kernels taken from
|
|||
|
real floating-point extensive programs. Some of these loops are vectorizable,
|
|||
|
but since we don't deal with vector processors here, this doesn't matter. For
|
|||
|
this test, LLL was adapted from the FORTRAN original [20] to Turbo Pascal
|
|||
|
6.0. By variable overlaying (similar to FORTRAN's EQUIVALENCE statement),
|
|||
|
memory allocation for data was reduced to 64 KB, so all data fits into a
|
|||
|
single 64 KB segment. The older version of LLL is used here which contains 14
|
|||
|
loops. There also exists a newer, more elaborate version consisting of 24
|
|||
|
kernels. The kernels in LLL exercise only multiplication and addition. The
|
|||
|
MFLOPS rate reported is the average of the MFLOPS rate of all 14 kernels.
|
|||
|
All floating-point variables in the programs are of type DOUBLE.
|
|||
|
|
|||
|
Both LLL and Whetstone results (see below) are reported as returned by my
|
|||
|
COMPTEST test program, in which they have been included as a measure of
|
|||
|
coprocessor/FPU performance. COMPTEST has been compiled under Turbo Pascal
|
|||
|
6.0 with all 'optimizations' on and using my own run-time library, which
|
|||
|
gives higher performance than the one included with TP 6.0. My library is
|
|||
|
available as TPL60N18.ZIP from garbo.uwasa.fi and ftp sites that mirror this
|
|||
|
site.
|
|||
|
|
|||
|
Linpack [5] is a well known floating-point benchmark that also heavily
|
|||
|
exercises the memory system. Linpack operates on large matrices and takes up
|
|||
|
about 570 KB in the version used for this test. This is about the largest
|
|||
|
program size a pure DOS system can accommodate. Linpack was originally
|
|||
|
designed to estimate performance of BLAS, a library of FORTRAN subroutines
|
|||
|
that handles various vector and matrix operations. Note that vendors are
|
|||
|
free to supply optimized (e.g., assembly language) versions of BLAS. Linpack
|
|||
|
uses two routines from BLAS which are thought to be typical of the matrix
|
|||
|
operations used by BLAS. Both routines only use addition/subtraction and
|
|||
|
multiplication. The FORTRAN source code for Linpack can be obtained from
|
|||
|
the automated mail server netlib@ornl.gov. Linpack was compiled using MS
|
|||
|
FORTRAN 5.0 in the HUGE memory model (which can handle data structures
|
|||
|
larger than 64 KB) and with compiler switches set for maximum optimization.
|
|||
|
All floating-point variables in the program are of the DOUBLE type. Linpack
|
|||
|
performs the same test repeatedly. The number reported is the maximum MFLOPS
|
|||
|
rate returned by Linpack. Linpack MFLOPS ratings for a great number of
|
|||
|
machines are contained in [6]. This PostScript document is also available
|
|||
|
from netlib@ornl.gov.
|
|||
|
|
|||
|
Whetstone [2,3,4] is a synthetic benchmark based upon statistics collected
|
|||
|
about the use of certain control and data structures in programs written in
|
|||
|
high level languages. Based on these statistics, it tries to mirror a
|
|||
|
'typical' HLL program. Whetstone performance is expressed by how many
|
|||
|
hypothetical 'whetstone' instructions are executed per second. It was
|
|||
|
originally implemented in ALGOL. Unlike PEAKFLOP, LLL, and Linpack,
|
|||
|
Whetstone not only uses addition and multiplication but exercises all basic
|
|||
|
arithmetic operations as well as some transcendental functions. Whetstone
|
|||
|
performance depends on the speed of the CPU as well as on the coprocessor,
|
|||
|
while PEAKFLOP, LLL, and Linpack place a heavier burden on the coprocessor/FPU.
|
|||
|
|
|||
|
There exist both old and new versions of Whetstone. Note that results from
|
|||
|
the two versions can differ by as much as 20% for the same test configuration.
|
|||
|
For this test, the new version in Pascal from [3] was used. It was compiled
|
|||
|
with Turbo Pascal 6.0 and my own library (see above) with all 'optimizations'
|
|||
|
on. All computations are performed using the DOUBLE type.
|
|||
|
|
|||
|
SAVAGE tests the performance of transcendental function evaluation. It is
|
|||
|
basically a small loop in which the sin, cos, arctan, ln, exp, and sqrt
|
|||
|
functions are combined in a single expression. While sin, cos, arctan, and
|
|||
|
sqrt can be evaluated directly with a single 387 coprocessor instruction
|
|||
|
each, ln and exp need additional preprocessing for argument reduction and
|
|||
|
result conversion. According to [14], the Savage benchmark was devised by
|
|||
|
Bill Savage, and is distributed by: The Wohl Engine Company, Ltd., 8200 Shore
|
|||
|
Front Parkway, Rockaway Beach, NY 11693, USA. Usually, Savage is programmed
|
|||
|
to make 250,000 passes though the loop. Here only 10,000 loops are executed
|
|||
|
for a total of 60,000 transcendental function evaluations. The result is
|
|||
|
expressed in function evaluations per second. SAVAGE source code was taken
|
|||
|
from [7] and compiled with Turbo Pascal 6.0 and my own run-time library
|
|||
|
(see above).
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Benchmark results using the Intel 386DX CPU and various coprocessors
|
|||
|
--------------------------------------------------------------------
|
|||
|
|
|||
|
My benchmark results for 387 coprocessors, coprocessor emulators and the
|
|||
|
Intel RapidCAD and Intel 486 CPUs, using the programs described above, on
|
|||
|
an Intel 386DX system:
|
|||
|
|
|||
|
|
|||
|
33.3 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage
|
|||
|
MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec
|
|||
|
|
|||
|
Intel 386DX WITH:
|
|||
|
EM87 emulator 0.0070 0.0040 0.0050 0.0050 26 418 ##
|
|||
|
Franke387 emu. 0.0307 0.0246 0.0194 0.0179 137 3335 $$
|
|||
|
TP/MS-FORT emu 0.0263 0.0227 0.0167 0.0158 133 3160 %%
|
|||
|
Q387 emulator 0.0920 0.0664 0.0305 0.0304 251 4796 ((
|
|||
|
Intel 387DX 0.7647 0.6004 0.3283 0.2676 2046 43860
|
|||
|
ULSI 83C87 1.0097 0.6609 0.3239 0.2598 2089 47431
|
|||
|
IIT 3C87 0.8455 0.5957 0.3198 0.2646 2203 49020
|
|||
|
IIT 3C87,4X4 0.8455 1.4334 0.3198 0.2646 2203 49020 @@
|
|||
|
C&T 38700 0.9455 0.6907 0.3338 0.2700 2376 62565
|
|||
|
Cyrix 387+ 0.9286 0.6806 0.3293 0.2669 2435 66890
|
|||
|
Cyrix EMC87 1.0400 0.6628 0.3352 0.2808 2540 71685 //
|
|||
|
|
|||
|
Intel RapidCAD 1.8572 1.5798 0.6072 0.4533 3953 72464
|
|||
|
Intel 486DX 2.0800 1.7779 0.9387 0.6682 5143 82192
|
|||
|
|
|||
|
|
|||
|
|
|||
|
40 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage
|
|||
|
MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec
|
|||
|
|
|||
|
Intel 386DX WITH:
|
|||
|
EM87 emulator 0.0084 0.0080 0.0060 0.0060 31 502 ##
|
|||
|
Franke387 emu. 0.0369 0.0295 0.0233 0.0215 164 4002 $$
|
|||
|
TP/MS-FORT emu 0.0316 0.0273 0.0200 0.0190 160 3794 %%
|
|||
|
Q387 emulator 0.1103 0.0798 0.0365 0.0364 301 5758 ((
|
|||
|
Intel 387DX 0.9204 0.7212 0.3932 0.3211 2428 52677
|
|||
|
ULSI 83C87 1.2093 0.7936 0.3890 0.3120 2528 56926
|
|||
|
IIT 3C87 1.0196 0.7145 0.3834 0.3179 2663 58766
|
|||
|
IIT 3C87,4x4 1.0196 1.7244 0.3834 0.3179 2663 58766 @@
|
|||
|
C&T 38700 1.0722 0.7908 0.4007 0.3222 2837 74906
|
|||
|
Cyrix 387+ 1.1305 0.8162 0.3945 0.3208 2946 80322
|
|||
|
Cyrix EMC87 1.2381 0.7963 0.4025 0.3324 3061 86083 //
|
|||
|
|
|||
|
Intel RapidCAD 2.2128 1.8931 0.7377 0.5432 4810 86957
|
|||
|
Intel 486DX 2.4762 2.1335 1.1110 0.8204 6195 98522
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Benchmark results using the Cyrix 486DLC CPU and various coprocessors
|
|||
|
---------------------------------------------------------------------
|
|||
|
|
|||
|
The Cyrix 486DLC is the latest entry into the market of 386DX replacement
|
|||
|
processors. It features an Intel 486SX-compatible instruction set, a 1 KB on-
|
|||
|
chip cache, and a 16x16 bit hardware multiplier. The RISC-like execution unit
|
|||
|
of the 486DLC executes many instructions in a single clock cycle. The
|
|||
|
hardware multiplier multiplies 16-bit quantities in 3 clock cycles, as
|
|||
|
compared to 12-25 cycles on a standard Intel 386DX. This is especially useful
|
|||
|
in address calculations (code from non-optimizing compilers may contain many
|
|||
|
MUL instructions for array accesses) and for software floating-point
|
|||
|
arithmetic. The 1 KB cache helps the 486DLC to overcome some of the
|
|||
|
limitations of the 386 bus interface, and although its hit rate averages only
|
|||
|
about 65% under normal program conditions, a 5-15% overall performance
|
|||
|
increase can usually be seen for both integer and floating-point-intensive
|
|||
|
applications when it is enabled.
|
|||
|
|
|||
|
The 486DLC's internal cache is a unified data/instruction write-through type,
|
|||
|
and can be configured as either a direct mapped or a 2-way set associative
|
|||
|
cache. For compatibility reasons, the cache is disabled after a processor
|
|||
|
reset and must be enabled with the help of a small routine provided by
|
|||
|
Cyrix. Cyrix has also defined some additional cache control signals for some
|
|||
|
of the 486DLC pins, intended to improve communication between the on-chip
|
|||
|
cache and an external cache. Current 386 systems ignore these signals, since
|
|||
|
they are not defined for the standard Intel 386DX. However, future systems
|
|||
|
designed with the 486DLC in mind may take advantage of them for increased
|
|||
|
performance.
|
|||
|
|
|||
|
In existing 386 systems, DMA transfers (e.g., by a SCSI controller or a
|
|||
|
soundcard) may cause the 486DLC's entire on-chip cache to be flushed, since
|
|||
|
no other means exist to enforce consistency between the cache contents and
|
|||
|
main memory. This reduces the performance of the 486DLC in these cases. The
|
|||
|
486DLC on-chip cache does, however, allow specification of up to four non-
|
|||
|
cacheable regions, which is particularly useful if your system has memory
|
|||
|
mapped peripherals (e.g., a Weitek coprocessor).
|
|||
|
|
|||
|
Although I successfully ran my test programs on the Cyrix chip with all
|
|||
|
coprocessors, not all of them work well with the 486DLC in all circumstances.
|
|||
|
The IIT 3C87, the Cyrix 83D87 (chips manufactured prior to November 1991),
|
|||
|
and the Cyrix EMC87 should not be used with the 486DLC, since they may cause
|
|||
|
the computer to lock up if the FSAVE and FRSTOR instructions are used. (These
|
|||
|
instructions are typically used in protected mode multiple task environments
|
|||
|
to save and restore the coprocessor state for each task. Note that Microsoft
|
|||
|
Windows also fits this description.) According to Cyrix, this problem occurs
|
|||
|
only with first revision 486DLCs (sample chips) and is fixed on newer ones.
|
|||
|
To be on the safe side, I recommend using the Cyrix 387+ with the 486DLC,
|
|||
|
both for assured compatibility and for best performance. Note that 387+ is a
|
|||
|
'Europe only' name and that this chip is called 83D87 elsewhere, just like
|
|||
|
the old version. You need to get a 83D87 produced after about October 1991
|
|||
|
to guarantee that is works correctly with any 486DLC; the same caveat applies
|
|||
|
to the Cyrix 486SLC and the Cyrix 83S87. If you already have a Cyrix
|
|||
|
coprocessor, use my COMPTEST program to find out whether you have a 'new' or
|
|||
|
'old' coprocessor. COMPTEST is available as CTEST257.ZIP via anonymous ftp
|
|||
|
from garbo.uwasa.fi (in the /systest directory) and other ftp servers that
|
|||
|
mirror garbo.
|
|||
|
|
|||
|
The Cyrix 486DLC is currently the 386 'clone' with the highest integer
|
|||
|
performance. With the internal cache enabled, integer performance of the
|
|||
|
486DLC can be up to 80% higher than that of an Intel 386DX at the same clock
|
|||
|
frequency, with the average speed gain for most applications being about 35%.
|
|||
|
Floating-point applications are typically accelerated by about 15%-30% when
|
|||
|
using a Cyrix 486DLC (with its cache enabled) instead of the Intel 386DX.
|
|||
|
|
|||
|
|
|||
|
33.3 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage
|
|||
|
MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec
|
|||
|
Cyrix 486DLC
|
|||
|
(cache off) WITH:
|
|||
|
EM87 emulator 0.0089 0.0082 0.0062 0.0063 31 472 ##
|
|||
|
Franke387 emu. 0.0402 0.0324 0.0258 0.0240 184 4807 $$
|
|||
|
TP/MS-FORT emu 0.0346 0.0288 0.0206 0.0212 173 4401 %%
|
|||
|
Q387 emulator 0.1214 0.0810 0.0368 0.0382 320 6020 ((
|
|||
|
Intel 387DX 0.8455 0.6552 0.3659 0.3033 2249 48780
|
|||
|
ULSI 83C87 1.1818 0.7543 0.3752 0.3026 2381 53476
|
|||
|
IIT 3C87 0.9541 0.6609 0.3653 0.3036 2476 55814
|
|||
|
IIT 3C87,4X4 0.9541 1.4988 0.3653 0.3036 2476 55814 @@
|
|||
|
C&T 38700 1.1183 0.7644 0.3796 0.3087 2703 73350
|
|||
|
Cyrix 387+ 1.1305 0.7445 0.3727 0.3060 2731 81967
|
|||
|
Cyrix EMC87 1.2236 0.7593 0.3823 0.3144 2908 88889 //
|
|||
|
|
|||
|
Intel RapidCAD 1.8572 1.5798 0.6072 0.4533 3953 72464
|
|||
|
Intel 486DX 2.0800 1.7779 0.9387 0.6682 5143 82192
|
|||
|
|
|||
|
|
|||
|
|
|||
|
40.0 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage
|
|||
|
MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec
|
|||
|
Cyrix 486DLC
|
|||
|
(cache off) WITH:
|
|||
|
EM87 emulator 0.0107 0.0098 0.0075 0.0075 37 567 ##
|
|||
|
Franke387 emu. 0.0488 0.0392 0.0311 0.0288 223 5808 $$
|
|||
|
TP/MS-FORT emu 0.0416 0.0345 0.0246 0.0253 208 5284 %%
|
|||
|
Q387 emulator 0.1463 0.0973 0.0442 0.0458 384 7237 ((
|
|||
|
Intel 387DX 1.0196 0.7880 0.4375 0.3644 2712 58479
|
|||
|
ULSI 83C87 1.4247 0.9064 0.4506 0.3630 2868 64171
|
|||
|
IIT 3C87 1.1556 0.7963 0.4399 0.3611 2988 66964
|
|||
|
IIT 3C87,4X4 1.1556 1.7916 0.4399 0.3611 2988 66964 @@
|
|||
|
C&T 38700 1.3333 0.9210 0.4548 0.3708 3254 88106
|
|||
|
Cyrix 387+ 1.3507 0.8958 0.4477 0.3754 3297 98361
|
|||
|
Cyrix EMC87 1.4648 0.9136 0.4548 0.3773 3505 106572 //
|
|||
|
|
|||
|
Intel RapidCAD 2.2128 1.8931 0.7377 0.5432 4810 86957
|
|||
|
Intel 486DX 2.4762 2.1335 1.1110 0.8204 6195 98522
|
|||
|
|
|||
|
|
|||
|
|
|||
|
33.3 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage
|
|||
|
MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec
|
|||
|
Cyrix 486DLC
|
|||
|
(cache on) WITH:
|
|||
|
EM87 emulator 0.0099 0.0089 0.0068 0.0069 35 550 ##
|
|||
|
Franke387 emu. 0.0462 0.0362 0.0288 0.0265 205 5445 $$
|
|||
|
TP/MS-FORT emu 0.0410 0.0330 0.0234 0.0241 198 5339 %%
|
|||
|
Q387 emulator 0.1344 0.0902 0.0389 0.0403 339 6241 ((
|
|||
|
Intel 387DX 0.8525 0.6552 0.3941 0.3279 2332 49834
|
|||
|
ULSI 83C87 1.2093 0.7543 0.4068 0.3270 2478 57197
|
|||
|
IIT 3C87 0.9720 0.6609 0.3959 0.3295 2579 57252
|
|||
|
IIT 3C87,4X4 0.9720 1.5087 0.3959 0.3295 2579 57252 @@
|
|||
|
C&T 38700 1.1305 0.7644 0.4126 0.3343 2839 75949
|
|||
|
Cyrix 387+ 1.1429 0.7445 0.4023 0.3310 2866 85349
|
|||
|
Cyrix EMC87 1.2381 0.7593 0.4150 0.3412 3051 93897 //
|
|||
|
|
|||
|
Intel RapidCAD 1.8572 1.5798 0.6072 0.4533 3953 72464
|
|||
|
Intel 486DX 2.0800 1.7779 0.9387 0.6682 5143 82192
|
|||
|
|
|||
|
|
|||
|
|
|||
|
40.0 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage
|
|||
|
MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec
|
|||
|
Cyrix 486DLC
|
|||
|
(cache on) WITH:
|
|||
|
EM87 emulator 0.0118 0.0107 0.0082 0.0082 42 659 ##
|
|||
|
Franke387 emu. 0.0565 0.0438 0.0350 0.0313 248 6585 $$
|
|||
|
TP/MS-FORT emu 0.0491 0.0395 0.0279 0.0296 238 6408 %%
|
|||
|
Q387 emulator 0.1610 0.1084 0.0470 0.0484 407 7509 ((
|
|||
|
Intel 387DX 1.0297 0.7880 0.4748 0.3937 2801 59821
|
|||
|
ULSI 83C87 1.4445 0.9028 0.4891 0.3926 2976 65789
|
|||
|
IIT 3C87 1.1686 0.7963 0.4734 0.3916 3096 68729
|
|||
|
IIT 3C87,4X4 1.1686 1.8057 0.4734 0.3916 3096 68729 @@
|
|||
|
C&T 38700 1.3685 0.9173 0.4958 0.4012 3401 91185
|
|||
|
Cyrix 387+ 1.3867 0.8958 0.4887 0.3962 3448 102564
|
|||
|
Cyrix EMC87 1.4857 0.9100 0.4959 0.4091 3676 112360 //
|
|||
|
|
|||
|
Intel RapidCAD 2.2128 1.8931 0.7377 0.5432 4810 86957
|
|||
|
Intel 486DX 2.4762 2.1335 1.1110 0.8204 6195 98522
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Benchmark results using the C&T 38600DX CPU and various coprocessors
|
|||
|
--------------------------------------------------------------------
|
|||
|
|
|||
|
The Chips&Technologies 38600DX CPU is marketed as a 100% compatible
|
|||
|
replacement for the Intel 386DX CPU. Unlike AMD's Am386, which uses microcode
|
|||
|
that is identical to the Intel 386DX's, the C&T 38600DX uses microcode
|
|||
|
developed independently by C&T using "clean-room" techniques. C&T even
|
|||
|
included the 386DX's "undocumented" LOADALL386 instruction into the
|
|||
|
instruction set to provide full compatibility with the 386DX. In my tests,
|
|||
|
however, I observed that the 38600DX has severe problems with the CPU-
|
|||
|
coprocessor communication, which causes the floating-point performance to
|
|||
|
drop below that of the Intel 386DX/Intel 387DX for most programs. This
|
|||
|
problem exists with all available 387-compatible coprocessors (ULSI 83C87,
|
|||
|
IIT 3C87, Cyrix EMC87, Cyrix 83D87, Cyrix 387+, C&T 38700, Intel 387DX). A
|
|||
|
net.aquaintance also did tests with the 38600DX and arrived at similar
|
|||
|
results. He contacted C&T and they said that they were aware of the problem.
|
|||
|
|
|||
|
Some instructions execute faster on the C&T 38600DX than on the 386DX, giving
|
|||
|
an average speedup of 5-10% for integer applications. C&T also produces a
|
|||
|
38605DX CPU that includes a 512 byte instruction cache and provides a further
|
|||
|
performance increase. However, the 38605DX needs a bigger socket (144-pin
|
|||
|
PGA) and is therefore *not* pin-compatible with the 386DX. Tests using the
|
|||
|
38600DX were run at 33.3 MHz, as a 40 MHz version was not available as of 09-
|
|||
|
17-92 and running the 33 MHz chip version at 40 MHz locked up the machine
|
|||
|
frequently. Unfortunately, tests using the Intel 387DX consistently locked up
|
|||
|
in the TRNSFORM benchmark when run at 33.3 MHz. It ran fine at 20 MHz, and
|
|||
|
the results were scaled to show expected performance at 33.3 MHz.
|
|||
|
|
|||
|
|
|||
|
33.3 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage
|
|||
|
MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec
|
|||
|
|
|||
|
C&T 38600DX WITH:
|
|||
|
Intel 387DX 0.7376 0.5620 0.3337 0.2636 2066 45489
|
|||
|
ULSI 83C87 0.5226 0.4690 0.3236 0.2654 2087 43228
|
|||
|
IIT 3C87 0.7879 0.5762 0.3397 0.2674 2263 51195
|
|||
|
IIT 3C87,4X4 0.7879 0.6181 0.3397 0.2674 2263 51195 @@
|
|||
|
C&T 38700 0.5977 0.5572 0.3463 0.2681 2338 63966
|
|||
|
Cyrix 387+ 0.5896 0.5508 0.3438 0.2673 2375 66741
|
|||
|
|
|||
|
Intel RapidCAD 1.8572 1.5798 0.6072 0.4533 3953 72464
|
|||
|
Intel 486 2.0800 1.7779 0.9387 0.6682 5143 82192
|
|||
|
|
|||
|
|
|||
|
For comparison:
|
|||
|
|
|||
|
PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage
|
|||
|
MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec
|
|||
|
|
|||
|
i486DX2-66 4.1601 3.4227 1.6531 1.3010 10655 163934
|
|||
|
i486DX2-50 3.0589 2.6665 1.2537 0.9744 7962 123203
|
|||
|
i387, 20 MHz 0.2253 0.3271 0.1434 0.1171 952 21739 ++
|
|||
|
i387DX, 20 MHz 0.3567 0.4444 0.1484 0.1161 1034 24155 &&
|
|||
|
i80287, 5 MHz 0.0281 0.0310 0.0242 0.0222 150 3261 !!
|
|||
|
i8087,9.54 MHz 0.0636 0.0705 0.0321 0.0219 234 5782 **
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Benchmark notes and footnotes
|
|||
|
-----------------------------
|
|||
|
|
|||
|
Hardware configuration for test of 387 coprocessors with C&T 38600DX, Intel
|
|||
|
386DX, Cyrix 486DLC, and Intel RapidCAD CPUs:
|
|||
|
|
|||
|
System A: Motherboard with Forex chip set, 128 KB CPU Cache, 8 MB RAM
|
|||
|
|
|||
|
|
|||
|
Hardware configuration for test of 486 FPU (extra fan for 40 MHz operation):
|
|||
|
|
|||
|
System B: Motherboard with SIS chip set, 256 KB CPU Cache, 8 MB RAM
|
|||
|
|
|||
|
|
|||
|
## EM87 V1.2 by Ron Kimball is a public domain coprocessor emulator that
|
|||
|
loads as a TSR. It uses INT 7 traps emitted by 80286, 80386, or 486SX
|
|||
|
systems with no coprocessor upon encountering coprocessor instructions
|
|||
|
to catch coprocessor instructions and emulate them. Whetstone and Savage
|
|||
|
benchmarks for this test were compiled with the original TP 6.0 library,
|
|||
|
as EM87 chokes on the 387 specific FSIN and FCOS instructions used in my
|
|||
|
own library if a 387 is detected. Obviously EM87 identifies itself as a
|
|||
|
387, but it has no support for 387-specific instructions.
|
|||
|
|
|||
|
$$ Franke387 is a commercial 387 emulator that is also available in a
|
|||
|
shareware version. For this test, shareware version V2.4 was used.
|
|||
|
Franke387 unlike many other emulators supports all 387 instructions.
|
|||
|
It is loaded as a device driver and uses INT 7 to trap coprocessor
|
|||
|
instructions.
|
|||
|
|
|||
|
(( Q387 is an emulator that is distributed as a shareware program by
|
|||
|
Quickware of Austin, Texas. As the name implies, this emulator uses
|
|||
|
386 specific code and supports the full 387 instruction set. The
|
|||
|
program is about 330 kByte in size and loads completely into extended
|
|||
|
memory, using absolutely no DOS memory. It is loaded as a TSR and
|
|||
|
requires an EMM (expanded memory manager) to be present. The emulation
|
|||
|
uses the INT 7 mechanism. The version of Q387 used was 3.0a.
|
|||
|
|
|||
|
%% These benchmarks were run using the built-in coprocessor emulators of
|
|||
|
the TP 6.0 (for Savage, LLL, Whetstone, TRNSFORM, PEAKFLOP) and the MS
|
|||
|
FORTRAN 5.0 (for Linpack) run-time libraries by forcing the libraries
|
|||
|
into not using a coprocessor by using the environment settings NO87=NC
|
|||
|
and 87=N.
|
|||
|
|
|||
|
@@ The 3C87 specific F4X4 instruction was used in the vector transformation
|
|||
|
benchmark.
|
|||
|
|
|||
|
// The EMC87 was used in the 387-compatible mode only. The faster memory-
|
|||
|
mapped mode was *not* used. Times should therefore be identical to the
|
|||
|
Cyrix 83D87.
|
|||
|
|
|||
|
++ Older motherboard with no chip set (discrete logic), no CPU cache, 16 MB
|
|||
|
RAM
|
|||
|
|
|||
|
&& System A, CPU cache disabled via extended set-up, turbo-switch set to
|
|||
|
half speed (that is, 20 MHz)
|
|||
|
|
|||
|
!! 80386 @ 20 MHz / Intel 80287 @ 5 MHz, no CPU cache, 4 MB RAM due to the
|
|||
|
fast CPU used here, performance figures are somewhat higher than can be
|
|||
|
expected for a 80286/287 combination, except for the PEAKFLOP benchmark,
|
|||
|
which is basically coprocessor limited.
|
|||
|
|
|||
|
** 8086/8087 system with 640 KB RAM
|
|||
|
|
|||
|
|
|||
|
Benchmark results for Weitek coprocessors
|
|||
|
------------------------------------------
|
|||
|
Since neither a Weitek coprocessor nor a compiler that generates code for the
|
|||
|
Weitek chips were available to me, performance data for the Weitek Abacus is
|
|||
|
given here according to [31,32] and scaled to show performance of a 33 MHz
|
|||
|
system. The benchmarks were compiled using highly-optimizing 32-bit
|
|||
|
compilers.
|
|||
|
|
|||
|
Single Prec. Double Prec. Double Prec.
|
|||
|
|
|||
|
3167 4167 3167 4167 387 486
|
|||
|
|
|||
|
Linpack MFLOPS 1.8 5.0 0.8 3.2 0.4 1.6
|
|||
|
Whetstone kWhet/sec 7470 22700 4900 14000 3290 12300
|
|||
|
|
|||
|
Note that for the Intel coprocessors, running programs in single vs. double-
|
|||
|
precision doesn't provide much of an performance advantage since all internal
|
|||
|
calculations are always done in extended precision. Using Weitek
|
|||
|
coprocessors, however, performance nearly doubles in single-precision mode.
|
|||
|
For double-precision calculations using only basic arithmetic, the Weitek
|
|||
|
Abacus can at most provide performance at twice the level of the respective
|
|||
|
Intel coprocessor (387/486) at the same clock speed.
|
|||
|
|
|||
|
|
|||
|
Comparison of floating-point performance [30,32]
|
|||
|
|
|||
|
single-precision
|
|||
|
|
|||
|
Weitek 4167-33 Intel 486-33 Intel 486DX2-66
|
|||
|
|
|||
|
Linpack MFLOPS 5.0 1.8 3.5
|
|||
|
Whetstones kWhet/sec 22700 12700 25500
|
|||
|
|
|||
|
|
|||
|
double-precision
|
|||
|
|
|||
|
Weitek 4167-33 Intel 486-33 Intel 486DX2-66
|
|||
|
|
|||
|
LINPACK MFLOPS 3.5 1.6 3.1
|
|||
|
kWhetstones/sec 14000 12300 24700
|
|||
|
|
|||
|
|
|||
|
|
|||
|
=============================================================================
|
|||
|
Clock-cycle timings for coprocessor instructions on various coprocessor chips
|
|||
|
=============================================================================
|
|||
|
|
|||
|
Speed of various coprocessor instructions, measured in clock cycles, as
|
|||
|
captured by my program 87TIMES. Error is +/- one clock cycle, except for the
|
|||
|
Intel 80287. Times for the 80287 were determined on a system with a 20 MHz
|
|||
|
80386 and a 5 MHz Intel 80287. Therefore, times may differ from a genuine
|
|||
|
80286/287 system, especially for those instructions that access an operand in
|
|||
|
memory. Since the times are stated as the number of coprocessor clock cycles
|
|||
|
used, the faster 386 which can execute four clock cycles where the 80287
|
|||
|
executes one clock cycle may decrease memory access times as seen by the
|
|||
|
coprocessor.
|
|||
|
|
|||
|
The CPU used in testing the 387 coprocessors was an Intel 386DX. Note that
|
|||
|
due to the improved coprocessor interface of the Cyrix 486DLC the execution
|
|||
|
time of most coprocessor instructions drops by 2-3 clock cycles when used
|
|||
|
with this CPU.
|
|||
|
|
|||
|
|
|||
|
Intel Intel Cyrix Cyrix C&T ULSI IIT Intel Intel
|
|||
|
i486 RapidCAD 83D87 387+ 38700 83C87 3C87 387DX 80387
|
|||
|
|
|||
|
FLD1 4 3 14 14 14 18 24 23 26
|
|||
|
FLDZ 4 3 14 14 14 18 24 23 31
|
|||
|
FLDPI 7 8 14 15 14 18 24 38 45
|
|||
|
FLDLG2 7 8 14 14 14 18 24 33 45
|
|||
|
FLDL2T 7 8 14 14 14 19 24 38 45
|
|||
|
FLDL2E 7 8 14 14 14 19 24 38 45
|
|||
|
FLDLN2 7 8 14 14 14 19 24 38 45
|
|||
|
FLD ST(0) 4 4 14 14 14 14 24 20 21
|
|||
|
FST ST(1) 3 4 14 14 14 14 19 18 22
|
|||
|
FSTP ST(0) 4 4 14 14 14 15 19 19 22
|
|||
|
FSTP ST(1) 4 4 15 15 14 15 19 20 22
|
|||
|
FLD ST(1) 4 4 14 14 14 14 24 18 21
|
|||
|
FXCH ST(1) 4 4 14 20 14 19 24 24 27
|
|||
|
FILD [Word] 12 16 33 37 32 42 38 47 62
|
|||
|
FILD [DWord] 8 11 26 26 21 32 28 35 45
|
|||
|
FILD [QWord] 9 15 30 30 25 36 32 34 54
|
|||
|
FLD [DWord] 3 5 26 26 21 23 28 20 25
|
|||
|
FLD [QWord] 3 7 30 30 25 27 32 24 35
|
|||
|
FLD [TByte] 5 11 46 46 46 46 47 46 57
|
|||
|
FBLD [TByte] 83 90 66 86 106 146 197 71 278
|
|||
|
FIST [Word] 31 31 37 40 37 42 51 69 90
|
|||
|
FIST [DWord] 29 30 35 40 35 40 49 66 84
|
|||
|
FST [DWord] 7 7 35 37 32 40 33 37 40
|
|||
|
FST [QWord] 8 9 43 43 39 47 40 45 51
|
|||
|
FISTP [Word] 32 32 42 40 37 43 46 70 90
|
|||
|
FISTP [DWord] 31 31 40 40 35 41 50 67 87
|
|||
|
FISTP [QWord] 29 29 44 44 42 48 56 73 92
|
|||
|
FSTP [DWord] 8 8 38 36 32 41 35 38 43
|
|||
|
FSTP [QWord] 9 9 46 43 39 48 42 46 49
|
|||
|
FSTP [TByte] 8 8 50 45 49 50 48 53 58
|
|||
|
FBSTP [TByte] 170 172 98 98 114 129 218 144 533
|
|||
|
FINIT 17 31 15 16 15 15 16 16 25
|
|||
|
FCLEX 7 20 15 16 16 16 16 16 25
|
|||
|
FCHS 7 8 14 15 14 14 19 30 33
|
|||
|
FABS 5 5 14 15 14 14 19 30 33
|
|||
|
FXAM 12 13 14 15 14 14 19 39 43
|
|||
|
FTST 5 5 19 25 14 24 24 34 38
|
|||
|
FSTENV 67 82 125 125 124 132 124 159 165
|
|||
|
FLDENV 44 59 106 106 112 120 106 119 129
|
|||
|
FSAVE 181 169 355 355 374 361 376 469 511
|
|||
|
FRSTOR 130 203 358 358 385 372 371 420 456
|
|||
|
FSTSW [mem] 4 5 14 14 14 14 14 14 17
|
|||
|
FSTSW AX 3 4 12 12 11 11 11 11 14
|
|||
|
FSTCW [mem] 4 5 14 14 13 13 13 14 18
|
|||
|
FLDCW [mem] 4 11 26 26 31 32 27 32 36
|
|||
|
FADD ST,ST(0) 8 9 19 20 19 19 24 24 32
|
|||
|
FADD ST,ST(1) 9 9 19 20 19 18 24 20 32
|
|||
|
FADD ST(1),ST 10 10 19 20 19 18 24 24 37
|
|||
|
FADDP ST(1),ST 11 11 19 19 19 16 24 25 37
|
|||
|
FADD [DWord] 9 10 25 28 22 23 23 21 34
|
|||
|
FADD [QWord] 9 10 32 32 26 27 27 25 38
|
|||
|
FIADD [Word] 20 21 34 34 33 40 40 52 80
|
|||
|
FIADD [DWord] 20 21 27 28 27 30 30 37 61
|
|||
|
FSUB ST(1),ST 10 10 19 20 19 19 24 24 38
|
|||
|
FSUBR ST(1),ST 9 10 19 22 19 19 24 27 38
|
|||
|
FSUBRP ST(1),ST 10 10 19 19 22 20 24 25 38
|
|||
|
FSUB [DWord] 11 12 27 28 27 23 29 27 32
|
|||
|
FSUB [QWord] 11 12 32 32 31 27 33 26 44
|
|||
|
FISUB [Word] 21 21 34 34 34 40 40 52 80
|
|||
|
FISUB [DWord] 21 22 27 28 27 29 30 40 60
|
|||
|
FMUL ST,ST(1) 16 17 19 25 24 24 29 38 57
|
|||
|
FMUL ST(1),ST 16 17 19 24 24 24 29 40 62
|
|||
|
FMULP ST(1),ST 17 17 19 24 24 25 29 40 58
|
|||
|
FIMUL [Word] 22 23 40 40 37 46 46 52 80
|
|||
|
FIMUL [DWord] 22 23 27 28 27 36 35 45 68
|
|||
|
FMUL [DWord] 11 12 27 28 27 28 29 25 45
|
|||
|
FMUL [QWord] 14 15 32 32 31 32 33 37 61
|
|||
|
FDIV ST,ST(0) 73 74 26 40 59 54 54 89 100
|
|||
|
FDIV ST,ST(1) 73 74 36 45 59 54 54 77 100
|
|||
|
FDIV ST(1),ST 73 74 36 45 59 55 54 78 102
|
|||
|
FDIVR ST(1),ST 73 74 36 45 59 54 54 77 102
|
|||
|
FDIVRP ST(1),ST 73 74 36 44 59 55 54 76 106
|
|||
|
FIDIV [Word] 84 85 52 58 75 76 76 105 141
|
|||
|
FIDIV [DWord] 84 85 45 46 65 65 65 101 123
|
|||
|
FDIV [DWord] 73 74 45 46 63 56 59 77 101
|
|||
|
FDIV [QWord] 73 74 50 50 67 60 63 78 103
|
|||
|
FSQRT (0.0) 25 25 19 19 14 19 24 29 37
|
|||
|
FSQRT (1.0) 83 84 36 74 54 89 59 109 132
|
|||
|
FSQRT (L2T) 86 87 36 74 54 89 59 104 137
|
|||
|
FXTRACT (L2T) 17 17 19 19 19 28 79 53 72
|
|||
|
FSCALE (PI,5) 30 30 36 24 24 49 79 59 82
|
|||
|
FRNDINT (PI) 31 31 19 29 24 34 29 49 82
|
|||
|
FPREM (99,PI) 58 59 54 99 44 54 49 79 96
|
|||
|
FPREM1(99,PI) 90 91 54 99 44 59 54 104 121
|
|||
|
FCOM 5 6 15 20 19 25 19 29 32
|
|||
|
FCOMP 6 6 15 19 19 25 19 30 33
|
|||
|
FCOMPP 7 7 15 19 19 25 19 31 40
|
|||
|
FICOM [Word] 16 17 34 34 33 46 34 58 76
|
|||
|
FICOM [DWord] 16 16 21 28 21 35 23 45 57
|
|||
|
FCOM [DWord] 5 6 21 28 22 23 23 27 34
|
|||
|
FCOM [QWord] 5 8 27 32 25 27 27 31 39
|
|||
|
FSIN (0.0) 24 24 14 99 14 19 24 39 43
|
|||
|
FSIN (1.0) 310 313 114 164 144 494 219 509 596
|
|||
|
FSIN (PI) 88 89 118 189 64 64 214 134 152
|
|||
|
FSIN (LG2) 292 295 72 89 139 454 184 449 531
|
|||
|
FSIN (L2T) 299 302 123 179 164 469 214 454 536
|
|||
|
FCOS (0.0) 24 24 19 159 14 19 24 34 42
|
|||
|
FCOS (1.0) 302 305 84 104 139 489 214 459 547
|
|||
|
FCOS (PI) 88 89 154 254 64 64 224 199 232
|
|||
|
FCOS (LG2) 300 303 108 149 139 454 194 504 583
|
|||
|
FCOS (L2T) 307 310 159 239 164 469 224 509 601
|
|||
|
FSINCOS (0.0) 25 25 14 19 19 18 34 38 55
|
|||
|
FSINCOS (1.0) 353 356 124 174 254 493 419 538 636
|
|||
|
FSINCOS (PI) 105 106 162 263 79 68 424 228 277
|
|||
|
FSINCOS (LG2) 340 343 119 159 249 458 359 533 627
|
|||
|
FSINCOS (L2T) 347 350 168 248 274 473 424 538 646
|
|||
|
FPTAN (0.0) 25 25 14 19 19 18 29 38 46
|
|||
|
FPTAN (1.0) 266 269 119 149 184 538 309 323 396
|
|||
|
FPTAN (PI) 145 146 134 228 104 108 304 168 211
|
|||
|
FPTAN (LG2) 244 246 94 129 179 498 274 298 363
|
|||
|
FPTAN (L2T) 247 249 139 219 204 513 304 298 365
|
|||
|
FPATAN (0.0) 38 39 19 24 19 20 29 95 93
|
|||
|
FPATAN (1.0) 294 298 124 159 29 375 604 360 433
|
|||
|
FPATAN (PI) 304 308 139 188 279 360 424 375 472
|
|||
|
FPATAN (LG2) 290 293 128 154 269 365 379 375 448
|
|||
|
FPATAN (L2T) 304 308 144 189 274 359 424 375 468
|
|||
|
F2XM1 (0.0) 25 25 14 14 14 19 24 34 37
|
|||
|
F2XM1 (LN2) 209 211 89 119 169 394 284 299 348
|
|||
|
F2XM1 (LG2) 204 206 78 104 159 379 284 294 337
|
|||
|
FYL2X (1.0) 60 61 36 39 24 75 94 115 127
|
|||
|
FYL2X (PI) 294 297 108 163 249 450 359 395 504
|
|||
|
FYL2X (LG2) 311 314 108 159 249 460 339 410 518
|
|||
|
FYL2X (L2T) 293 296 108 164 249 439 359 390 501
|
|||
|
FYL2XP1 (LG2) 334 337 99 169 234 460 284 435 538
|
|||
|
|
|||
|
|
|||
|
|
|||
|
80386 + 80386 + 80386 + 80386 +
|
|||
|
Intel Intel Q387 Franke387 TP 6.0 EM87
|
|||
|
8087 80287 Emulator Emulator Emulator Emulator
|
|||
|
|
|||
|
FLD1 26 55 51 481 422 1626
|
|||
|
FLDZ 21 53 39 480 416 1646
|
|||
|
FLDPI 26 55 51 486 443 1626
|
|||
|
FLDLG2 26 56 51 486 423 1626
|
|||
|
FLDL2T 26 55 51 486 440 1626
|
|||
|
FLDL2E 26 53 52 486 423 1626
|
|||
|
FLDLN2 26 55 52 486 441 1626
|
|||
|
FLD ST(0) 31 55 57 493 362 1851
|
|||
|
FST ST(1) 26 54 61 489 355 1931
|
|||
|
FSTP ST(0) 26 54 46 507 358 2115
|
|||
|
FSTP ST(1) 21 55 66 507 356 2116
|
|||
|
FLD ST(1) 26 55 54 493 362 1852
|
|||
|
FXCH ST(1) 21 57 80 497 486 2187
|
|||
|
FILD [Word] 58 90 122 667 712 2259
|
|||
|
FILD [DWord] 64 74 121 608 812 2164
|
|||
|
FILD [QWord] 74 93 179 652 707 2971
|
|||
|
FLD [DWord] 49 44 106 633 473 2077
|
|||
|
FLD [QWord] 54 57 118 641 524 2336
|
|||
|
FLD [TByte] 59 45 102 607 492 2063
|
|||
|
FBLD [TByte] 309 310 736 2019 1512 17827
|
|||
|
FIST [Word] 79 72 143 854 766 2418
|
|||
|
FIST [DWord] 84 80 136 865 518 2325
|
|||
|
FST [DWord] 89 85 124 686 441 2200
|
|||
|
FST [QWord] 99 92 135 703 516 2481
|
|||
|
FISTP [Word] 79 80 154 864 794 2620
|
|||
|
FISTP [DWord] 79 81 144 879 541 2523
|
|||
|
FISTP [QWord] 88 75 184 904 916 3226
|
|||
|
FSTP [DWord] 89 75 133 713 467 2400
|
|||
|
FSTP [QWord] 93 72 142 732 538 2678
|
|||
|
FSTP [TByte] 49 21 111 685 467 2124
|
|||
|
FBSTP [TByte] 528 472 1124 3305 1555 27013
|
|||
|
FINIT 11 10 1079 742 641 1369
|
|||
|
FCLEX 11 10 48 440 323 912
|
|||
|
FCHS 21 54 45 460 354 1744
|
|||
|
FABS 21 54 43 456 349 1738
|
|||
|
FXAM 21 54 72 481 380 1551
|
|||
|
FTST 51 75 70 585 386 2721
|
|||
|
FSTENV 54 57 827 928 519 2104
|
|||
|
FLDENV 48 50 780 1125 450 1631
|
|||
|
FSAVE 214 244 3929 1949 976 2749
|
|||
|
FRSTOR 209 227 2901 2182 657 2225
|
|||
|
FSTSW [mem] 28 10 87 516 401 1189
|
|||
|
FSTSW AX N/A 55 57 451 N/A N/A
|
|||
|
FSTCW [mem] 28 10 74 506 359 1167
|
|||
|
FLDCW [mem] 19 47 91 524 437 1584
|
|||
|
FADD ST,ST(0) 86 128 136 643 706 2805
|
|||
|
FADD ST,ST(1) 85 116 146 707 808 3093
|
|||
|
FADD ST(1),ST 92 131 157 664 812 3146
|
|||
|
FADDP ST(1),ST 92 129 164 704 799 3143
|
|||
|
FADD [DWord] 105 122 221 874 969 3139
|
|||
|
FADD [QWord] 115 122 232 888 1021 3396
|
|||
|
FIADD [Word] 115 122 238 940 1211 3330
|
|||
|
FIADD [DWord] 125 122 239 882 1297 3215
|
|||
|
FSUB ST(1),ST 88 130 171 738 817 3156
|
|||
|
FSUBR ST(1),ST 96 132 181 740 868 3004
|
|||
|
FSUBRP ST(1),ST 99 132 193 733 805 3301
|
|||
|
FSUB [DWord] 119 122 230 918 1018 3127
|
|||
|
FSUB [QWord] 129 123 242 932 1070 3632
|
|||
|
FISUB [Word] 115 123 268 977 1081 3802
|
|||
|
FISUB [DWord] 125 125 289 940 980 4161
|
|||
|
FMUL ST,ST(1) 145 151 297 810 1368 3924
|
|||
|
FMUL ST(1),ST 145 151 296 817 1377 3962
|
|||
|
FMULP ST(1),ST 148 168 304 840 1365 4164
|
|||
|
FIMUL [Word] 132 151 384 1039 1517 4039
|
|||
|
FIMUL [DWord] 141 151 383 980 1643 3976
|
|||
|
FMUL [DWord] 125 123 345 948 1480 3445
|
|||
|
FMUL [QWord] 175 192 387 991 1602 4416
|
|||
|
FDIV ST,ST(0) 201 207 274 726 1536 9789
|
|||
|
FDIV ST,ST(1) 203 218 299 808 1658 10332
|
|||
|
FDIV ST(1),ST 207 214 299 825 1655 10342
|
|||
|
FDIVR ST(1),ST 201 206 302 819 1806 10213
|
|||
|
FDIVRP ST(1),ST 201 205 309 845 1803 10409
|
|||
|
FIDIV [Word] 237 227 390 980 1779 11225
|
|||
|
FIDIV [DWord] 246 227 411 944 1680 11572
|
|||
|
FDIV [DWord] 229 226 352 893 1722 10577
|
|||
|
FDIV [QWord] 236 227 391 993 1777 10829
|
|||
|
FSQRT (0.0) 21 57 60 512 382 1755
|
|||
|
FSQRT (1.0) 186 206 294 1106 2504 37836
|
|||
|
FSQRT (L2T) 186 207 295 1398 2467 37925
|
|||
|
FXTRACT (L2T) 51 56 155 726 571 3326
|
|||
|
FSCALE (PI,5) 41 56 95 817 443 3194
|
|||
|
FRNDINT (PI) 51 58 136 808 800 7092
|
|||
|
FPREM (99,PI) 81 131 322 1696 941 4098
|
|||
|
FPREM1(99,PI) N/A N/A 384 1625 N/A N/A
|
|||
|
FCOM 56 75 155 582 483 2799
|
|||
|
FCOMP 61 92 160 616 485 2983
|
|||
|
FCOMPP 61 90 149 661 476 3198
|
|||
|
FICOM [Word] 79 77 231 808 861 3654
|
|||
|
FICOM [DWord] 89 77 231 750 964 3684
|
|||
|
FCOM [DWord] 74 75 214 741 625 3643
|
|||
|
FCOM [QWord] 74 76 205 754 667 3771
|
|||
|
FSIN (0.0) N/A N/A 137 639 N/A N/A
|
|||
|
FSIN (1.0) N/A N/A 997 4640 N/A N/A
|
|||
|
FSIN (PI) N/A N/A 322 2488 N/A N/A
|
|||
|
FSIN (LG2) N/A N/A 978 3911 N/A N/A
|
|||
|
FSIN (L2T) N/A N/A 1005 3767 N/A N/A
|
|||
|
FCOS (0.0) N/A N/A 182 740 N/A N/A
|
|||
|
FCOS (1.0) N/A N/A 988 4777 N/A N/A
|
|||
|
FCOS (PI) N/A N/A 337 2557 N/A N/A
|
|||
|
FCOS (LG2) N/A N/A 976 4176 N/A N/A
|
|||
|
FCOS (L2T) N/A N/A 1001 3905 N/A N/A
|
|||
|
FSINCOS (0.0) N/A N/A 225 714 N/A N/A
|
|||
|
FSINCOS (1.0) N/A N/A 1841 6049 N/A N/A
|
|||
|
FSINCOS (PI) N/A N/A 1167 4091 N/A N/A
|
|||
|
FSINCOS (LG2) N/A N/A 1525 5640 N/A N/A
|
|||
|
FSINCOS (L2T) N/A N/A 1552 5405 N/A N/A
|
|||
|
FPTAN (0.0) 41 58 90 752 8381 2324
|
|||
|
FPTAN (1.0) 581 582 1182 6366 10817 29824
|
|||
|
FPTAN (PI) 606 587 292 4388 12410 2300
|
|||
|
FPTAN (LG2) 516 513 883 5939 12502 26770
|
|||
|
FPTAN (L2T) 576 586 954 5723 12483 2301
|
|||
|
FPATAN (0.0) 41 55 123 616 1208 10578
|
|||
|
FPATAN (1.0) 736 736 171 1426 13446 34208
|
|||
|
FPATAN (PI) 206 207 11115 2835 13305 46903
|
|||
|
FPATAN (LG2) 756 736 11077 2490 13319 41312
|
|||
|
FPATAN (L2T) 206 204 11117 2922 13364 50149
|
|||
|
F2XM1 (0.0) 16 56 102 563 723 1722
|
|||
|
F2XM1 (LN2) 631 624 905 4178 11070 33823
|
|||
|
F2XM1 (LG2) 611 585 890 4798 11116 32163
|
|||
|
FYL2X (1.0) 56 57 136 961 1214 4327
|
|||
|
FYL2X (PI) 946 961 1008 8987 12858 40148
|
|||
|
FYL2X (LG2) 1081 1038 1035 8933 12748 46821
|
|||
|
FYL2X (L2T) 926 886 1089 8982 12712 38986
|
|||
|
FYL2XP1 (LG2) 1026 1037 1154 10485 11867 44708
|
|||
|
|
|||
|
|
|||
|
Clock-cycle timings for floating-point operations on Weitek coprocessors
|
|||
|
------------------------------------------------------------------------
|
|||
|
|
|||
|
The Weitek 3167 and 4167 coprocessors only implement the basic arithmetic
|
|||
|
functions (add, subtract, multiply, divide, square root) in hardware;
|
|||
|
transcendental functions are implemented by means of a software library
|
|||
|
supplied by Weitek which uses the basic hardware instructions to approximate
|
|||
|
the transcendental functions (using polynomial and rational approximations).
|
|||
|
The clock cycle timings for the transcendental functions are average values,
|
|||
|
since execution time can differ with the value of argument. The speed of
|
|||
|
transcendental functions for the 4167 is estimated based on the numbers in
|
|||
|
[31,33], from which this timing information has been extracted.
|
|||
|
|
|||
|
|
|||
|
Single-precision Double-precision
|
|||
|
|
|||
|
3167 4167 3167 4167
|
|||
|
|
|||
|
ABS 3 2 3 2
|
|||
|
NEG 6 2 6 2
|
|||
|
ADD 6 2 6 2
|
|||
|
SUB 6 2 6 2
|
|||
|
SUBR 6 2 6 2
|
|||
|
MUL 6 2 10 3
|
|||
|
DIVR 38 17 66 31
|
|||
|
SQRT 60 17 118 31
|
|||
|
SIN 146 ~50 292 ~100
|
|||
|
COS 140 ~50 285 ~100
|
|||
|
TAN 188 ~60 340 ~110
|
|||
|
EXP 179 ~60 401 ~130
|
|||
|
LOG 171 ~60 365 ~120
|
|||
|
F->ASCII 1000 N/A 1700 N/A //
|
|||
|
ASCII->F 1100 N/A 1800 N/A //
|
|||
|
|
|||
|
// rough average of the timings given for different numeric
|
|||
|
formats by Weitek. Note that these conversions routines
|
|||
|
do much more work than the FBLD and FBSTP instructions
|
|||
|
provided by the 80x87 coprocessors. FBLD and FBSTP are
|
|||
|
useful for conversion routines but quite a bit of additional
|
|||
|
code is need for this purpose.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
=============================================================================
|
|||
|
Accuracy of calculations performed by a coprocessor / The IEEETEST program
|
|||
|
=============================================================================
|
|||
|
|
|||
|
Among the 80x87 coprocessors, the IEEE-754 Standard for Binary Floating-Point
|
|||
|
Arithmetic [10,11] was first fully implemented by Intel's 387 coprocessor [17].
|
|||
|
Among other things, this means that the add, subtract, multiply, divide,
|
|||
|
remainder, and square root operations always deliver the 'exact' result. By
|
|||
|
'exact', the standard means that the coprocessor always delivers the machine
|
|||
|
number closest to the real result, which may not always be representable
|
|||
|
exactly in the available numeric format. The 80387 implements the single,
|
|||
|
double, and double extended formats as specified in the IEEE standard, as
|
|||
|
well as all functions required by it [17].
|
|||
|
|
|||
|
Note that earlier Intel coprocessors (the 8087 and the 80287) comply with a
|
|||
|
draft version of the standard that differs from the final version. These
|
|||
|
chips were developed before IEEE-754 was finally accepted in 1985. As with
|
|||
|
the 80387, the basic arithmetic in the 8087 and the 80287 is 'exact' in the
|
|||
|
sense that the computed result is always the machine number closest to the
|
|||
|
real result. However, there are some differences regarding certain operands
|
|||
|
like infinities, and some operations like the remainder are defined
|
|||
|
differently than in the final version of the standard.
|
|||
|
|
|||
|
Some new instructions were introduced with the 80387, most notably the FSIN
|
|||
|
and FCOS operations. The argument range for some transcendental function has
|
|||
|
also been extended [17]. Note that the IEEE-754 standard says nothing about
|
|||
|
the quality of the implementation of transcendental functions like sin, cos,
|
|||
|
tan, arctan, log. Intel uses a modified CORDIC [18,19] technique to compute
|
|||
|
the transcendental functions; Intel claims that maximum error in the 8087,
|
|||
|
80287, and 80387 for all transcendental functions does not exceed two bits in
|
|||
|
the mantissa of the double extended format, which features 64 mantissa bits
|
|||
|
for an overall accuracy of approximately 19 decimal places [22,23]. This
|
|||
|
claim has been independently verified by a competing vendor [13]. This means
|
|||
|
that at least 62 of the 64 mantissa bits returned as a result by one of the
|
|||
|
transcendental function instructions are guaranteed to be correct.
|
|||
|
|
|||
|
The Weitek Abacus 3167 and 4167 coprocessors are 'mostly compatible' with
|
|||
|
IEEE-754 [31,32,33]. They support the single-precision and double precision
|
|||
|
numeric formats described in the standard, as well as the four rounding modes
|
|||
|
required by it. However, due to Weitek's desire for extremely high-speed
|
|||
|
operation, some of the finer points of IEEE-754 have not been implemented.
|
|||
|
One of the most notable omissions is the missing support for denormal
|
|||
|
numbers; denormals are always flushed to zero on Weitek chips.
|
|||
|
|
|||
|
The 387 clone manufacturers all claim 100% compatibility with Intel's 80387,
|
|||
|
so one would reasonably expect the same accuracy from their chips as from
|
|||
|
Intel's. For example, on the packaging of the IIT 3C87 it states that "...the
|
|||
|
requirements of ANSI/IEEE standards are fulfilled and exceeded". Cyrix states
|
|||
|
that their 83D87 complies fully with the IEEE-754 standard [12], and in fact
|
|||
|
delivers with their coprocessors diagnostic software that includes the
|
|||
|
program IEEETEST. This program is based on the IEEE test vectors from the PhD
|
|||
|
thesis of Dr. Jerome T. Coonen [9]. A test using the IEEE test vectors has
|
|||
|
also been included into the RUNDIAG program on the Intel RapidCAD diagnostic
|
|||
|
disk. Rather than performing random tests, the test vectors check specific
|
|||
|
cases that may be hard to get right. Each test vector specifies the operation
|
|||
|
to be performed, the operands, precision and rounding mode to be used, and
|
|||
|
the result (including flags set) to be expected according to the IEEE-754
|
|||
|
standard.
|
|||
|
|
|||
|
I ran IEEETEST on all the available coprocessors/FPUs. The Intel 486, Intel
|
|||
|
RapidCAD, Intel 387, Intel 387DX, Cyrix 83D87, and the Cyrix 387+ passed with
|
|||
|
no errors. The ULSI 83C87 showed some minor flaws in the FCOM, FDIV, FMUL,
|
|||
|
and FSCALE operations, getting flag errors in about 1% of the tested cases,
|
|||
|
but no computational errors. However, for the IIT 3C87, the IEEETEST program
|
|||
|
showed flag *and* some computational errors (that is, wrong results) for all
|
|||
|
tested operations except FXTRACT and FCHS. The Intel 8087 and 80287 show
|
|||
|
numerous errors, but this it not surprising, since they do not comply with
|
|||
|
IEEE-754 but with an earlier draft of that standard, so they do some things
|
|||
|
differently than required by the final version of the standard. In particular
|
|||
|
the Intel 8087/80287 do not feature the IEEE-754 compliant comparison (FUCOM)
|
|||
|
and remainder (FPREM1) instructions available on the Intel 80387 and newer
|
|||
|
coprocessors, so IEEETEST uses the non-compliant FCOM and FPREM instructions
|
|||
|
on these processors. Lack of an IEEE-754 compliant comparison instruction also
|
|||
|
causes a good deal of the errors in the 'Next After' test.
|
|||
|
|
|||
|
Since IEEETEST is written in Turbo Pascal, it was recompiled with the $E+
|
|||
|
switch to enable use of the coprocessor emulator built into the TP 6.0 library.
|
|||
|
Using the emulator, IEEETEST aborted in the following tests with a division
|
|||
|
by zero error: 'Comparison', 'Division', 'Next After'. These tests were removed
|
|||
|
from the suite and the remaining tests were performed. The public domain
|
|||
|
emulator EM87 could be tested, but hung in the last test which checks the
|
|||
|
implementation of the remainder operation. This problem occurred because EM87
|
|||
|
incorrectly identifies itself as an 387 type coprocessor when run on an 80386.
|
|||
|
This causes the 387 specific FUCOM instruction to be used in the 'Comparison'
|
|||
|
and 'Next After' tests and the FPREM1 instruction to be used in the 'Remainder'
|
|||
|
test. Apparently EM87 is not able to emulate these instructions and therefore
|
|||
|
crashes upon trying to execute them. It is interesting to note how the error
|
|||
|
profile of EM87 matches exactly that of the Intel 80287, so it can be assumed
|
|||
|
that EM87 is a very good emulation of the 80287 when run on the 80286. The
|
|||
|
Franke387 V2.4 emulator hangs in the following test performed by IEEETEST:
|
|||
|
'Division', 'Multiplication', 'Scalb', 'Remainder'. The cause for these
|
|||
|
failures is unknown.
|
|||
|
|
|||
|
|
|||
|
This explanatory text is printed at the start of the IEEETEST program:
|
|||
|
|
|||
|
JT Coonen's 1984 UC Berkeley Ph.D. thesis centers around his activities
|
|||
|
as a member of the floating-point working group that defined the IEEE
|
|||
|
754-1985 Standard for Binary Floating-Point Arithmetic. Appendix C of
|
|||
|
his thesis presents FPTEST, a Pascal program written by J Thomas and JT
|
|||
|
Coonen. IEEETEST is a port of FPTEST and runs on PCs whose math
|
|||
|
coprocessor accepts 80387-compatible floating-point instructions.
|
|||
|
|
|||
|
IEEETEST reads test vectors from the file TESTVECS and compares the
|
|||
|
answer returned by the math coprocessor with the answer listed in the
|
|||
|
test vector. If these answers differ an 'F' is displayed, otherwise a
|
|||
|
'.'is displayed. Answers can differ due to two types of failures:
|
|||
|
numeric failures or flag failures. Numeric failures occur when the
|
|||
|
computed answer has the wrong value. Flag failures occur when the status
|
|||
|
(invalid operation, divide by zero, underflow, overflow, inexact) is
|
|||
|
incorrectly identified.
|
|||
|
|
|||
|
TESTVECS is the concatenation of unmodified versions of all the test
|
|||
|
vectors distributed by UC Berkeley. The test data base is copyrighted by
|
|||
|
UC Berkeley (1985) and is being distributed with their permission.
|
|||
|
FPTEST and the test data base can be obtained by asking for 'IEEE-754
|
|||
|
Test Vector' from UC Berkeley, Electrical Engineering and Computer
|
|||
|
Science, Industrial Liaison Program, 479 Corey Hall, Berkeley, CA, 94720
|
|||
|
(415)643-6687.
|
|||
|
|
|||
|
The initial version of this test data base for the proposed IEEE 754
|
|||
|
binary floating-point standard (draft 8.0) was developed for Zilog, Inc.
|
|||
|
and was donated to the floating-point working group for dissemination.
|
|||
|
Errors in or additions to the distributed data base should be reported
|
|||
|
to the agency of distribution, with copies to Zilog, Inc., 1315 Dell
|
|||
|
Avenue, Campbell, CA, 95008.
|
|||
|
|
|||
|
|
|||
|
IEEETEST output for Intel 80387, Intel 387DX (manufactured 91/49), Intel 486,
|
|||
|
C&T 38700 (manufactured 92/19), Cyrix 83D87, Cyrix 387+ (manufactured 92/11),
|
|||
|
and Intel RapidCAD (manufactured 92/05):
|
|||
|
----------------------------------------------------------------------------
|
|||
|
|
|||
|
IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended
|
|||
|
| TESTS | numeric TYPE OF FAILURE flag
|
|||
|
Operation Code | Passed Failed | S D E | S D E
|
|||
|
----------------------------------------------------------------------
|
|||
|
Absolute Value A | 216 0 | 0 0 0 | 0 0 0
|
|||
|
Addition + | 3528 0 | 0 0 0 | 0 0 0
|
|||
|
Comparison C | 4320 0 | 0 0 0 | 0 0 0
|
|||
|
Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0
|
|||
|
Division / | 4311 0 | 0 0 0 | 0 0 0
|
|||
|
Fraction Part F | 624 0 | 0 0 0 | 0 0 0
|
|||
|
Logb L | 960 0 | 0 0 0 | 0 0 0
|
|||
|
Multiplication * | 3978 0 | 0 0 0 | 0 0 0
|
|||
|
Negation - | 216 0 | 0 0 0 | 0 0 0
|
|||
|
Next After N | 2832 0 | 0 0 0 | 0 0 0
|
|||
|
Round to Integer I | 558 0 | 0 0 0 | 0 0 0
|
|||
|
Scalb S | 948 0 | 0 0 0 | 0 0 0
|
|||
|
Square Root V | 744 0 | 0 0 0 | 0 0 0
|
|||
|
Subtraction - | 3528 0 | 0 0 0 | 0 0 0
|
|||
|
Remainder % | 2984 0 | 0 0 0 | 0 0 0
|
|||
|
Totals | 31235 0 |
|
|||
|
|
|||
|
|
|||
|
IEEETEST output for ULSI 83C87 (manufactured 91/48):
|
|||
|
----------------------------------------------------
|
|||
|
|
|||
|
IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended
|
|||
|
| TESTS | numeric TYPE OF FAILURE flag
|
|||
|
Operation Code | Passed Failed | S D E | S D E
|
|||
|
----------------------------------------------------------------------
|
|||
|
Absolute Value A | 216 0 | 0 0 0 | 0 0 0
|
|||
|
Addition + | 3528 0 | 0 0 0 | 0 0 0
|
|||
|
Comparison C | 4312 8 | 0 0 0 | 0 0 8
|
|||
|
Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0
|
|||
|
Division / | 4250 61 | 0 0 0 | 28 28 5
|
|||
|
Fraction Part F | 624 0 | 0 0 0 | 0 0 0
|
|||
|
Logb L | 960 0 | 0 0 0 | 0 0 0
|
|||
|
Multiplication * | 3936 42 | 0 0 0 | 19 19 4
|
|||
|
Negation - | 216 0 | 0 0 0 | 0 0 0
|
|||
|
Next After N | 2828 4 | 0 0 0 | 0 0 4
|
|||
|
Round to Integer I | 558 0 | 0 0 0 | 0 0 0
|
|||
|
Scalb S | 930 18 | 0 0 0 | 6 6 6
|
|||
|
Square Root V | 744 0 | 0 0 0 | 0 0 0
|
|||
|
Subtraction - | 3528 0 | 0 0 0 | 0 0 0
|
|||
|
Remainder % | 2984 0 | 0 0 0 | 0 0 0
|
|||
|
Totals | 31102 133 |
|
|||
|
|
|||
|
|
|||
|
IEEETEST output for ULSI 83S87 (manufactured 92/17)
|
|||
|
(data kindly supplied by Bengt Ask, f89ba@efd.lth.se):
|
|||
|
------------------------------------------------------
|
|||
|
|
|||
|
IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended
|
|||
|
| TESTS | numeric TYPE OF FAILURE flag
|
|||
|
Operation Code | Passed Failed | S D E | S D E
|
|||
|
----------------------------------------------------------------------
|
|||
|
Absolute Value A | 216 0 | 0 0 0 | 0 0 0
|
|||
|
Addition + | 3528 0 | 0 0 0 | 0 0 0
|
|||
|
Comparison C | 4320 0 | 0 0 0 | 0 0 0
|
|||
|
Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0
|
|||
|
Division / | 4296 15 | 0 0 0 | 5 5 5
|
|||
|
Fraction Part F | 624 0 | 0 0 0 | 0 0 0
|
|||
|
Logb L | 960 0 | 0 0 0 | 0 0 0
|
|||
|
Multiplication * | 3966 12 | 0 0 0 | 4 4 4
|
|||
|
Negation - | 216 0 | 0 0 0 | 0 0 0
|
|||
|
Next After N | 2828 4 | 0 0 0 | 0 0 4
|
|||
|
Round to Integer I | 558 0 | 0 0 0 | 0 0 0
|
|||
|
Scalb S | 930 18 | 0 0 0 | 6 6 6
|
|||
|
Square Root V | 744 0 | 0 0 0 | 0 0 0
|
|||
|
Subtraction - | 3528 0 | 0 0 0 | 0 0 0
|
|||
|
Remainder % | 2984 0 | 0 0 0 | 0 0 0
|
|||
|
Totals | 31102 45 |
|
|||
|
|
|||
|
|
|||
|
IEEETEST output for IIT 3C87 (manufactured 92/20):
|
|||
|
--------------------------------------------------
|
|||
|
|
|||
|
IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended
|
|||
|
| TESTS | numeric TYPE OF FAILURE flag
|
|||
|
Operation Code | Passed Failed | S D E | S D E
|
|||
|
----------------------------------------------------------------------
|
|||
|
Absolute Value A | 200 16 | 0 0 16 | 0 0 0
|
|||
|
Addition + | 3336 192 | 0 0 128 | 0 0 96
|
|||
|
Comparison C | 4224 96 | 0 0 96 | 0 0 0
|
|||
|
Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0
|
|||
|
Division / | 4159 152 | 0 0 124 | 0 0 116
|
|||
|
Fraction Part F | 600 24 | 0 0 24 | 0 0 24
|
|||
|
Logb L | 960 0 | 0 0 0 | 0 0 0
|
|||
|
Multiplication * | 3702 276 | 0 0 248 | 0 0 100
|
|||
|
Negation - | 200 16 | 0 0 16 | 0 0 0
|
|||
|
Next After N | 2248 584 | 0 0 584 | 0 0 168
|
|||
|
Round to Integer I | 542 16 | 0 0 4 | 0 0 16
|
|||
|
Scalb S | 874 74 | 5 5 44 | 8 8 20
|
|||
|
Square Root V | 688 56 | 0 0 56 | 0 0 56
|
|||
|
Subtraction - | 3336 192 | 0 0 128 | 0 0 96
|
|||
|
Remainder % | 2844 140 | 0 0 140 | 0 0 116
|
|||
|
Totals | 29401 1834 |
|
|||
|
|
|||
|
|
|||
|
IEEETEST output for Intel 80287 run with a 80386 CPU and Intel 8087:
|
|||
|
--------------------------------------------------------------------
|
|||
|
|
|||
|
IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended
|
|||
|
| TESTS | numeric TYPE OF FAILURE flag
|
|||
|
Operation Code | Passed Failed | S D E | S D E
|
|||
|
----------------------------------------------------------------------
|
|||
|
Absolute Value A | 216 0 | 0 0 0 | 0 0 0
|
|||
|
Addition + | 2886 642 | 16 16 112 | 174 174 174
|
|||
|
Comparison C | 3612 708 | 136 136 136 | 228 228 228
|
|||
|
Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0
|
|||
|
Division / | 3777 534 | 18 18 37 | 169 169 165
|
|||
|
Fraction Part F | 552 72 | 24 24 24 | 24 24 24
|
|||
|
Logb L | 900 60 | 12 12 12 | 20 20 20
|
|||
|
Multiplication * | 2944 1034 | 105 105 197 | 303 303 231
|
|||
|
Negation - | 216 0 | 0 0 0 | 0 0 0
|
|||
|
Next After N | 516 2316 | 168 168 332 | 764 764 764
|
|||
|
Round to Integer I | 546 12 | 0 0 0 | 4 4 4
|
|||
|
Scalb S | 663 285 | 45 43 26 | 102 98 46
|
|||
|
Square Root V | 720 24 | 4 4 4 | 8 8 8
|
|||
|
Subtraction - | 2886 642 | 16 16 112 | 174 174 174
|
|||
|
Remainder % | 1490 1494 | 432 432 288 | 342 342 230
|
|||
|
Totals | 23412 7823 |
|
|||
|
|
|||
|
|
|||
|
IEEETEST output for EM87 coprocessor emulator run on an Intel 386 CPU:
|
|||
|
----------------------------------------------------------------------
|
|||
|
|
|||
|
IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended
|
|||
|
| TESTS | numeric TYPE OF FAILURE flag
|
|||
|
Operation Code | Passed Failed | S D E | S D E
|
|||
|
----------------------------------------------------------------------
|
|||
|
Absolute Value A | 216 0 | 0 0 0 | 0 0 0
|
|||
|
Addition + | 2886 642 | 16 16 112 | 174 174 174
|
|||
|
Comparison C | 0 4320 | 1324 1324 1324 |1332 1332 1332
|
|||
|
Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0
|
|||
|
Division / | 3777 534 | 18 18 37 | 169 169 165
|
|||
|
Fraction Part F | 552 72 | 24 24 24 | 24 24 24
|
|||
|
Logb L | 900 60 | 12 12 12 | 20 20 20
|
|||
|
Multiplication * | 2944 1034 | 105 105 197 | 303 303 231
|
|||
|
Negation - | 216 0 | 0 0 0 | 0 0 0
|
|||
|
Next After N | 348 2484 | 768 768 768 | 504 504 526
|
|||
|
Round to Integer I | 546 12 | 0 0 0 | 4 4 4
|
|||
|
Scalb S | 663 285 | 45 43 26 | 102 98 46
|
|||
|
Square Root V | 720 24 | 4 4 4 | 8 8 8
|
|||
|
Subtraction - | 2886 642 | 16 16 112 | 174 174 174
|
|||
|
Remainder % | ######## not run since machine hangs #######
|
|||
|
|
|||
|
|
|||
|
IEEETEST output for Franke387 coprocessor emulator run on an Intel 386:
|
|||
|
-----------------------------------------------------------------------
|
|||
|
|
|||
|
IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended
|
|||
|
| TESTS | numeric TYPE OF FAILURE flag
|
|||
|
Operation Code | Passed Failed | S D E | S D E
|
|||
|
----------------------------------------------------------------------
|
|||
|
Absolute Value A | 152 64 | 0 0 8 | 24 24 8
|
|||
|
Addition + | 1587 1941 | 178 178 722 | 508 508 616
|
|||
|
Comparison C | 3696 624 | 208 208 208 | 4 4 108
|
|||
|
Copy Sign @ | 1200 288 | 0 0 0 | 144 144 0
|
|||
|
Division / | ######## not run since machine hangs #######
|
|||
|
Fraction Part F | 624 0 | 0 0 0 | 0 0 0
|
|||
|
Logb L | 908 52 | 0 0 16 | 16 16 4
|
|||
|
Multiplication * | ######## not run since machine hangs #######
|
|||
|
Negation - | 152 64 | 0 0 8 | 24 24 8
|
|||
|
Next After N | 1404 1420 | 404 404 596 | 80 80 172
|
|||
|
Round to Integer I | 514 44 | 4 4 20 | 8 8 16
|
|||
|
Scalb S | ######## not run since machine hangs #######
|
|||
|
Square Root V | 569 175 | 14 31 54 | 28 48 72
|
|||
|
Subtraction - | 1827 1701 | 98 98 642 | 452 452 576
|
|||
|
Remainder % | ######## not run since machine hangs #######
|
|||
|
|
|||
|
|
|||
|
IEEETEST output for Q387 coprocessor emulator run on an Intel 386:
|
|||
|
------------------------------------------------------------------
|
|||
|
|
|||
|
IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended
|
|||
|
| TESTS | numeric TYPE OF FAILURE flag
|
|||
|
Operation Code | Passed Failed | S D E | S D E
|
|||
|
----------------------------------------------------------------------
|
|||
|
Absolute Value A | 104 112 | 42 38 16 | 24 24 0
|
|||
|
Addition + | 911 2617 | 746 637 637 | 672 672 380
|
|||
|
Comparison C | 3180 1140 | 380 380 380 | 108 108 108
|
|||
|
Copy Sign @ | 696 792 | 320 280 0 | 288 288 0
|
|||
|
Division / | 900 3411 | 673 574 814 | 977 977 821
|
|||
|
Fraction Part F | 348 276 | 154 82 40 | 24 24 24
|
|||
|
Logb L | 656 304 | 136 100 36 | 24 24 12
|
|||
|
Multiplication * | 1023 2955 | 759 663 857 | 670 670 442
|
|||
|
Negation - | 86 130 | 44 38 32 | 24 24 0
|
|||
|
Next After N | 464 2368 | 780 780 796 | 344 344 320
|
|||
|
Round to Integer I | 273 285 | 95 74 52 | 72 72 68
|
|||
|
Scalb S | 254 694 | 217 192 137 | 176 168 136
|
|||
|
Square Root V | 128 616 | 192 180 147 | 196 196 188
|
|||
|
Subtraction - | 911 2617 | 746 637 637 | 672 672 372
|
|||
|
Remainder % | 558 2426 | 903 859 664 | 508 508 220
|
|||
|
Totals | 10492 20743 |
|
|||
|
|
|||
|
|
|||
|
IEEETEST output for TP 6.0 coprocessor emulator:
|
|||
|
------------------------------------------------
|
|||
|
|
|||
|
IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended
|
|||
|
| TESTS | numeric TYPE OF FAILURE flag
|
|||
|
Operation Code | Passed Failed | S D E | S D E
|
|||
|
----------------------------------------------------------------------
|
|||
|
Absolute Value A | 168 48 | 16 16 16 | 16 8 0
|
|||
|
Addition + | 1877 1651 | 294 290 336 | 496 456 416
|
|||
|
Comparison C | ## not run - program aborts with div-by-0 ##
|
|||
|
Copy Sign @ | 1392 96 | 48 48 0 | 48 0 0
|
|||
|
Division / | ## not run - program aborts with div-by-0 ##
|
|||
|
Fraction Part F | 588 36 | 12 0 24 | 0 0 0
|
|||
|
Logb L | 888 72 | 24 24 24 | 12 12 12
|
|||
|
Multiplication * | 2148 1830 | 332 310 528 | 520 360 352
|
|||
|
Negation - | 160 48 | 16 16 16 | 16 8 0
|
|||
|
Next After N | ## not run - program aborts with div-by-0 ##
|
|||
|
Round to Integer I | 318 240 | 0 0 4 | 80 80 80
|
|||
|
Scalb S | 564 384 | 108 100 76 | 112 88 56
|
|||
|
Square Root V | 180 564 | 143 157 169 | 72 72 128
|
|||
|
Subtraction - | 1877 1651 | 294 290 336 | 496 456 416
|
|||
|
Remainder % | 1072 1912 | 652 672 524 | 336 288 216
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Additional accuracy and compatibility tests
|
|||
|
-------------------------------------------
|
|||
|
|
|||
|
To complement the checks done by IEEETEST, I also wrote the short programs
|
|||
|
DENORMTS, RCTRL, PCTRL in Turbo Pascal 6.0 that test the following
|
|||
|
coprocessor functions:
|
|||
|
|
|||
|
1. support for denormals in all precisions (single, double, extended)
|
|||
|
2. support for the four IEEE rounding modes (up, down, nearest, chop)
|
|||
|
3. support for precision control
|
|||
|
|
|||
|
Note that passing all tests is required for IEEE conformance, as well as 100%
|
|||
|
compatibility with Intel's coprocessors. Precision control forces the results
|
|||
|
of the FADD, FSUB, FMUL, FDIV, and FSQRT instruction to be rounded to the
|
|||
|
specified precision (single, double, double extended). This feature is
|
|||
|
provided to obtain compatibility with certain programming languages [17]. By
|
|||
|
specifying lower precision, one effectively nullifies the advantages of
|
|||
|
extended precision intermediate results.
|
|||
|
|
|||
|
The IEEE-754 standard for floating-point arithmetic demands that processors
|
|||
|
and floating-point packages that can not store the result of operations
|
|||
|
*directly* to single and double precision location must provide precision
|
|||
|
control. The programs that test precision control and rounding control are
|
|||
|
designed to return a different result for each of the modes for the same
|
|||
|
sequence of operation.
|
|||
|
|
|||
|
The source code of the programs can be found in appendix A. The Intel 8087
|
|||
|
and 80287 were not tested with DENORMTS since Turbo Pascal does not support
|
|||
|
extended precision denormals on 8087/80287 processors, so the denormal test
|
|||
|
fails anyway. (The 8087 and 287 pass the RCTRL and PCTRL tests without error,
|
|||
|
however).
|
|||
|
|
|||
|
|
|||
|
Test Results for the Intel 387, Intel 387DX, Intel 486, Intel RapidCAD,
|
|||
|
Cyrix 83D87, Cyrix 387+, C&T 38700, and the EM87 emulator (on an 80386 system):
|
|||
|
-------------------------------------------------------------------------------
|
|||
|
|
|||
|
Precision Control SINGLE 1.13311278820037842E+0000
|
|||
|
DOUBLE 1.23456789006442125E+0000
|
|||
|
EXTENDED 1.23456789012337585E+0000
|
|||
|
|
|||
|
Rounding Control NEAREST -1.23427629010100635E+0100
|
|||
|
DOWN -1.23427623555772409E+0100
|
|||
|
UP -1.23457760966801097E+0100
|
|||
|
CHOP -1.23397493540770643E+0100
|
|||
|
|
|||
|
Denormal support
|
|||
|
|
|||
|
SINGLE denormals supported
|
|||
|
SINGLE denormal prints as: 4.60943116855005E-0041
|
|||
|
Denormal should be printed as 4.60943...E-0041
|
|||
|
|
|||
|
DOUBLE denormals supported
|
|||
|
DOUBLE denormal prints as: 8.75000000000016E-0311
|
|||
|
Denormal should be printed as 8.75...E-0311
|
|||
|
|
|||
|
EXTENDED denormals supported
|
|||
|
EXTENDED denormal prints as: 1.31640625000000E-4934
|
|||
|
Denormal should be printed as 1.3164...E-4934
|
|||
|
|
|||
|
|
|||
|
Results for the ULSI 83C87:
|
|||
|
---------------------------
|
|||
|
|
|||
|
Precision Control SINGLE 1.23456789012337585E+0000
|
|||
|
DOUBLE 1.23456789012337585E+0000
|
|||
|
EXTENDED 1.23456789012337585E+0000
|
|||
|
|
|||
|
Rounding Control NEAREST -1.23427629010100635E+0100
|
|||
|
DOWN -1.23427623555772409E+0100
|
|||
|
UP -1.23457760966801097E+0100
|
|||
|
CHOP -1.23397493540770643E+0100
|
|||
|
|
|||
|
Denormal support
|
|||
|
|
|||
|
SINGLE denormals supported
|
|||
|
SINGLE denormal prints as: 4.60943116855005E-0041
|
|||
|
Denormal should be printed as 4.60943...E-0041
|
|||
|
|
|||
|
DOUBLE denormals supported
|
|||
|
DOUBLE denormal prints as: 8.75000000000016E-0311
|
|||
|
Denormal should be printed as 8.75...E-0311
|
|||
|
|
|||
|
EXTENDED denormals supported
|
|||
|
EXTENDED denormal prints as: 1.31640625000000E-4934
|
|||
|
Denormal should be printed as 1.3164...E-4934
|
|||
|
|
|||
|
|
|||
|
Results for the IIT 3C87:
|
|||
|
-------------------------
|
|||
|
|
|||
|
Precision Control SINGLE 1.13311278820037842E+0000
|
|||
|
DOUBLE 1.23456789006442125E+0000
|
|||
|
EXTENDED 1.23456789012337585E+0000
|
|||
|
|
|||
|
Rounding Control NEAREST -1.23427629010100635E+0100
|
|||
|
DOWN -1.23427623555772409E+0100
|
|||
|
UP -1.23457760966801097E+0100
|
|||
|
CHOP -1.23397493540770643E+0100
|
|||
|
|
|||
|
Denormal support
|
|||
|
|
|||
|
SINGLE denormals supported
|
|||
|
SINGLE denormal prints as: 4.60943116855005E-0041
|
|||
|
Denormal should be printed as 4.60943...E-0041
|
|||
|
|
|||
|
DOUBLE denormals supported
|
|||
|
DOUBLE denormal prints as: 8.75000000000016E-0311
|
|||
|
Denormal should be printed as 8.75...E-0311
|
|||
|
|
|||
|
EXTENDED denormals not supported
|
|||
|
|
|||
|
|
|||
|
Results for the Turbo Pascal 6.0 coprocessor emulator:
|
|||
|
------------------------------------------------------
|
|||
|
|
|||
|
Precision Control SINGLE 1.23456789012351396E+0000
|
|||
|
DOUBLE 1.23456789012351396E+0000
|
|||
|
EXTENDED 1.23456789012351396E+0000
|
|||
|
|
|||
|
Rounding Control NEAREST -1.23457766383395931E+0100
|
|||
|
DOWN -1.23457766383395931E+0100
|
|||
|
UP -1.23457766383395931E+0100
|
|||
|
CHOP -1.23457766383395931E+0100
|
|||
|
|
|||
|
Denormal support
|
|||
|
|
|||
|
SINGLE denormals not supported
|
|||
|
DOUBLE denormals not supported
|
|||
|
EXTENDED denormals not supported
|
|||
|
|
|||
|
|
|||
|
Results for the Q387 coprocessor emulator:
|
|||
|
------------------------------------------
|
|||
|
|
|||
|
Precision Control SINGLE 1.23456789012337614E+0000
|
|||
|
DOUBLE 1.23456789012337614E+0000
|
|||
|
EXTENDED 1.23456789012337614E+0000
|
|||
|
|
|||
|
Rounding Control NEAREST -1.23427621117212139E+0100
|
|||
|
DOWN -1.23427621117212139E+0100
|
|||
|
UP -1.23427621117212139E+0100
|
|||
|
CHOP -1.23427621117212139E+0100
|
|||
|
|
|||
|
Denormal support
|
|||
|
|
|||
|
SINGLE denormals not supported
|
|||
|
DOUBLE denormals not supported
|
|||
|
EXTENDED denormals not supported
|
|||
|
|
|||
|
|
|||
|
The test results show that the IIT 3C87 does not conform to the IEEE-754
|
|||
|
floating-point standard in that it does not support denormals in double
|
|||
|
extended precision. The ULSI 83C87 does not conform to that standard in that
|
|||
|
it does not support precision control, but uses double extended precision for
|
|||
|
all operations. The TP 6.0 emulator supports neither precision control,
|
|||
|
rounding control nor support for any denormals, as does the Q387 emulator.
|
|||
|
In addition, their basic arithmetic operations do not seem to conform to
|
|||
|
the IEEE standard as the results of the test programs differ from that of
|
|||
|
any result computed by a coprocessor for any mode.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
================================================
|
|||
|
Accuracy of transcendental function calculations
|
|||
|
================================================
|
|||
|
|
|||
|
With regard to the accuracy of transcendental functions, Cyrix claims that
|
|||
|
the relative error of the transcendental functions on its 83D87 coprocessor
|
|||
|
never exceeds 0.5 ULP of the double extended format [13] (ULP = Unit in the
|
|||
|
Last Place, numeric weight of the least significant mantissa bit). This means
|
|||
|
that the maximum relative error is below 2**-64, while Intel's published
|
|||
|
error limit for the 80387 is 2**-62. While Intel uses a modified CORDIC
|
|||
|
algorithm [18,19] to compute the transcendental functions, Cyrix uses
|
|||
|
rational approximations that utilize their chip's very fast array multiplier.
|
|||
|
(For an explanation why this approach is superior to CORDIC with today's
|
|||
|
technology, see [61].) Also, Cyrix uses an internal 75 bit data path for the
|
|||
|
mantissa [15], so intermediate computations in the generation of
|
|||
|
transcendental function values will enjoy some additional accuracy over the
|
|||
|
64 bits provided by the double extended format. Using 75 mantissa bits also
|
|||
|
provides an advantage over other coprocessors like the Intel 387DX and ULSI
|
|||
|
83C87 which use only a 68 bit mantissa data path [58,59].
|
|||
|
|
|||
|
Note that a maximum relative error of 0.5 ULP for the Cyrix coprocessor does
|
|||
|
not mean that it returns the 'exact' result (machine number closest to
|
|||
|
infinitely precise result) all the time. Consider the case where the
|
|||
|
infinitely precise result of a transcendental function falls nearly halfway
|
|||
|
between two machine numbers. A relative error of 0.5 ULP can cause the result
|
|||
|
to be either of the numbers after rounding, depending on the direction of the
|
|||
|
error. But the 83D87 should deliver results that never differ from the
|
|||
|
'exact' result by more than one ULP. Also note that the claim of relative
|
|||
|
error being below 0.5 ULPs is slightly exaggerated; 0.6 ULPs would be a more
|
|||
|
realistic error limit. Imagine that the infinitely precise result for some
|
|||
|
argument to a transcendental was xxx..xxx1001... (where the xxx...xxx
|
|||
|
represent the first 64 bits of the result), but that the coprocessor computes
|
|||
|
the result as xxx..xxx0111 and then round this down to xxx..xxx0000. Then the
|
|||
|
relative error is (1001b-0b)/1000b = 0.5625 ULPs.
|
|||
|
|
|||
|
I tested some of the transcendental functions of the Cyrix 387+ and found the
|
|||
|
relative error to be always below 0.6 ULPs. Cyrix also claims that its
|
|||
|
transcendental functions satisfy the monotonicity criterion [13], a claim not
|
|||
|
made by any of the competitors, which does not mean that the transcendental
|
|||
|
functions on the other 387-compatibles may not be monotonic, too.
|
|||
|
Monotonicity means that for all x1 > x2, it always follows that f(x1) >=
|
|||
|
f(x2) for an increasing function like sin on [0..pi/4]. Likewise, for a
|
|||
|
decreasing function like cos on [0..pi/4], for all x1 > x2, it follows that
|
|||
|
f(x1) <= f(x2).
|
|||
|
|
|||
|
As previously noted, the Weitek Abacus 3167 and 4167 coprocessors implement
|
|||
|
only the basic arithmetic operations (add, subtract, negate, multiply,
|
|||
|
divide, square root) in hardware. Transcendental functions are performed via
|
|||
|
a software library provided by Weitek. For these library functions Weitek
|
|||
|
claims a maximum relative error of 5 ULPs [31,33]. This means that the last
|
|||
|
three bits in the mantissa of a double-precision result can be wrong. Note
|
|||
|
that the Intel 387 and compatible math coprocessors generate the
|
|||
|
transcendental functions with a small relative error with regard to the
|
|||
|
*extended double precision* format. Thus, when rounded to double-precision,
|
|||
|
their function values are nearly always 'exact'. The problem of 'double
|
|||
|
rounding' prevents them to be 'exact' in 100% of all cases. 387 type
|
|||
|
coprocessors in general have superior accuracy when compared with Weitek's
|
|||
|
coprocesssors.
|
|||
|
|
|||
|
The test diskette distributed with early versions of the Cyrix 83D87
|
|||
|
contained a program (TRANCK) that checks the accuracy of the transcendental
|
|||
|
functions in the coprocessor against a more precise software arithmetic [16].
|
|||
|
I used this program to compare the accuracy of the transcendental functions
|
|||
|
on those 287/387/486 coprocessors/FPUs available to me. As TRANCK will not
|
|||
|
accept negative numbers as interval limits, I tested each function on an
|
|||
|
interval along the positive x-axis. The functions tested were F2XM1 (2**x-1),
|
|||
|
FSIN (sine), FCOS (cosine), FPTAN (tangent), FPATAN (arctangent), FYL2X (y *
|
|||
|
log2 (x)), and FYL2XP1 (y * log2 (x+1)). These are all the transcendental
|
|||
|
functions implemented on the 80387. Note that the square root (FSQRT) is
|
|||
|
*not* a transcendental function. For each function, 100,000 arguments were
|
|||
|
evaluated, with the arguments uniformly distributed within the interval
|
|||
|
tested.
|
|||
|
|
|||
|
The EM87 emulator could not be checked with TRANCK, since the multiple
|
|||
|
precision package in TRANCK would always return with an error message
|
|||
|
immediately. However, the Franke387 emulator could be tested.
|
|||
|
|
|||
|
|
|||
|
In the test results below, the following statistics are detailed:
|
|||
|
|
|||
|
%wrong is the percentage of results that differ from the 'exact'
|
|||
|
result (infinitely precise result rounded to 64 bits)
|
|||
|
ULP_hi is the number of results where the returned result was
|
|||
|
greater than the 'exact' (correctly rounded) result by
|
|||
|
one ULP (the numeric weight of the last mantissa bit,
|
|||
|
2**-63 to 2**-64 depending of the size of the number).
|
|||
|
ULPs_hi is the number of results where the returned result was
|
|||
|
greater than the 'exact' result by two or more ULPs.
|
|||
|
ULP_lo is the number of results where the returned result was
|
|||
|
smaller than the 'exact' (correctly rounded) result by
|
|||
|
one ULP (the numeric weight of the last mantissa bit,
|
|||
|
2**-63 to 2**-64 depending of the size of the number).
|
|||
|
ULPs_lo is the number of results where the returned result was
|
|||
|
smaller than the 'exact' result by two or more ULPs.
|
|||
|
max ULP err is the maximum deviation of a returned result from the
|
|||
|
'exact' answer expressed in ULPs.
|
|||
|
|
|||
|
Test results for accuracy of transcendental functions for double extended
|
|||
|
precision as returned by the program TRANCK. 100,000 trials per function:
|
|||
|
|
|||
|
Franke387 V2.4 emulator
|
|||
|
max
|
|||
|
funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
|
|||
|
|
|||
|
SIN 0,pi/4 39.042 25301 708 13029 4 2
|
|||
|
COS 0,pi/4 75.714 49827 25887 0 0 3
|
|||
|
TAN 0,pi/4 76.976 14230 10029 24323 28394 9
|
|||
|
ATAN 0,1 55.826 26028 1529 24044 4225 4
|
|||
|
2XM1 0,0.5 96.717 0 0 47910 48807 5
|
|||
|
YL2XP1 0,sqrt(2)-1 93.007 578 9 27416 65004 8
|
|||
|
YL2X 0.1,10 62.252 16817 4712 37082 3641 2953
|
|||
|
|
|||
|
|
|||
|
Microsoft's coprocessor emulator
|
|||
|
(part of MS-C and MS-Fortran libraries)
|
|||
|
max
|
|||
|
funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
|
|||
|
|
|||
|
SIN 0,pi/4 N/A N/A N/A N/A N/A N/A
|
|||
|
COS 0,pi/4 N/A N/A N/A N/A N/A N/A
|
|||
|
TAN 0,pi/4 40.828 27764 1520 11445 99 2
|
|||
|
ATAN 0,1 32.307 18893 485 12530 299 2
|
|||
|
2XM1 0,0.5 52.163 8585 189 37745 5644 3
|
|||
|
YL2XP1 0,sqrt(2)-1 88.801 4714 916 14239 68932 11
|
|||
|
YL2X 0.1,10 36.598 13813 3272 13866 5647 11
|
|||
|
|
|||
|
|
|||
|
INTEL 8087, 80287
|
|||
|
max
|
|||
|
funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
|
|||
|
|
|||
|
SIN 0,pi/4 N/A N/A N/A N/A N/A N/A
|
|||
|
COS 0,pi/4 N/A N/A N/A N/A N/A N/A
|
|||
|
TAN 0,pi/4 37.001 18756 524 17405 316 2
|
|||
|
ATAN 0,1 9.666 6065 0 3601 0 1
|
|||
|
2XM1 0,0.5 19.920 0 0 19920 0 1
|
|||
|
YL2XP1 0,sqrt(2)-1 7.780 868 0 6912 0 1
|
|||
|
YL2X 0.1,10 1.287 723 0 564 0 1
|
|||
|
|
|||
|
|
|||
|
INTEL 80387
|
|||
|
max
|
|||
|
funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
|
|||
|
|
|||
|
SIN 0,pi/4 28.872 2467 0 26392 13 2
|
|||
|
COS 0,pi/4 27.213 27169 35 9 0 2
|
|||
|
TAN 0,pi/4 10.532 441 0 10091 0 1
|
|||
|
ATAN 0,1 7.088 2386 0 4691 1 2
|
|||
|
2XM1 0,0.5 32.024 0 0 32024 0 1
|
|||
|
YL2XP1 0,sqrt(2)-1 22.611 3461 0 19150 0 1
|
|||
|
YL2X 0.1,10 13.020 6508 0 6512 0 1
|
|||
|
|
|||
|
|
|||
|
INTEL 387DX
|
|||
|
max
|
|||
|
funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
|
|||
|
|
|||
|
SIN 0,pi/4 28.873 2467 0 26393 13 2
|
|||
|
COS 0,pi/4 27.121 27090 22 9 0 2
|
|||
|
TAN 0,pi/4 10.711 457 0 10254 0 1
|
|||
|
ATAN 0,1 7.088 2386 0 4691 1 2
|
|||
|
2XM1 0,0.5 32.024 0 0 32024 0 1
|
|||
|
YL2XP1 0,sqrt(2)-1 22.611 3461 0 19150 0 1
|
|||
|
YL2X 0.1,10 13.020 6508 0 6512 0 1
|
|||
|
|
|||
|
|
|||
|
ULSI 83C87
|
|||
|
max
|
|||
|
funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
|
|||
|
|
|||
|
SIN 0,pi/4 35.530 4989 6 30238 297 2
|
|||
|
COS 0,pi/4 43.989 11193 675 31393 728 2
|
|||
|
TAN 0,pi/4 48.539 18880 1015 26349 2295 3
|
|||
|
ATAN 0,1 20.858 62 0 20796 0 1
|
|||
|
2XM1 0,0.5 21.257 4 0 21253 0 1
|
|||
|
YL2XP1 0,sqrt(2)-1 27.893 9446 0 18213 234 2
|
|||
|
YL2X 0.1,10 13.603 9816 0 3787 0 1
|
|||
|
|
|||
|
|
|||
|
IIT 3C87
|
|||
|
max
|
|||
|
funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
|
|||
|
|
|||
|
SIN 0,pi/4 18.650 11171 0 7479 0 1
|
|||
|
COS 0,pi/4 7.700 3024 0 4676 0 1
|
|||
|
TAN 0,pi/4 20.973 9681 0 11291 1 2
|
|||
|
ATAN 0,1 19.280 13186 0 6094 0 1
|
|||
|
2XM1 0,0.5 25.660 17570 0 8090 0 1
|
|||
|
YL2XP1 0,sqrt(2)-1 45.830 23503 1896 19654 777 3
|
|||
|
YL2X 0.1,10 10.888 5638 357 4845 48 3
|
|||
|
|
|||
|
|
|||
|
C&T 38700DX
|
|||
|
max
|
|||
|
funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
|
|||
|
|
|||
|
SIN 0,pi/4 1.821 1272 0 549 0 1
|
|||
|
COS 0,pi/4 23.358 12458 0 10901 0 1
|
|||
|
TAN 0,pi/4 17.178 10725 0 6453 0 1
|
|||
|
ATAN 0,1 9.359 7082 0 2277 0 1
|
|||
|
2XM1 0,0.5 15.188 3039 0 12149 0 1
|
|||
|
YL2XP1 0,sqrt(2)-1 19.497 12109 0 7388 0 1
|
|||
|
YL2X 0.1,10 46.868 261 0 46607 0 1
|
|||
|
|
|||
|
|
|||
|
CYRIX 83D87
|
|||
|
max
|
|||
|
funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
|
|||
|
|
|||
|
SIN 0,pi/4 1.554 1015 0 539 0 1
|
|||
|
COS 0,pi/4 0.925 143 0 782 0 1
|
|||
|
TAN 0,pi/4 4.147 881 0 3266 0 1
|
|||
|
ATAN 0,1 0.656 229 0 427 0 1
|
|||
|
2XM1 0,0.5 2.628 1433 0 1194 0 1
|
|||
|
YL2XP1 0,sqrt(2)-1 3.242 825 0 2417 0 1
|
|||
|
YL2X 0.1,10 0.931 256 0 675 0 1
|
|||
|
|
|||
|
CYRIX 387+
|
|||
|
max
|
|||
|
funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
|
|||
|
|
|||
|
SIN 0,pi/4 1.486 864 0 622 0 1
|
|||
|
COS 0,pi/4 2.072 12 0 2060 0 1
|
|||
|
TAN 0,pi/4 0.602 63 0 539 0 1
|
|||
|
ATAN 0,1 0.384 12 0 372 0 1
|
|||
|
2XM1 0,0.5 1.985 27 0 1958 0 1
|
|||
|
YL2XP1 0,sqrt(2)-1 3.662 1705 0 1957 0 1
|
|||
|
YL2X 0.1,10 0.764 367 0 397 0 1
|
|||
|
|
|||
|
|
|||
|
INTEL RapidCAD, Intel 486
|
|||
|
max
|
|||
|
funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
|
|||
|
|
|||
|
SIN 0,pi/4 16.991 1517 0 15474 0 1
|
|||
|
COS 0,pi/4 9.003 7603 0 1400 0 1
|
|||
|
TAN 0,pi/4 10.532 441 0 10091 0 1
|
|||
|
ATAN 0,1 7.078 2386 0 4691 1 2
|
|||
|
2XM1 0,0.5 32.025 0 0 32025 0 1
|
|||
|
YL2XP1 0,sqrt(2)-1 21.800 533 0 21267 0 1
|
|||
|
YL2X 0.1,10 3.894 1879 0 2015 0 1
|
|||
|
|
|||
|
|
|||
|
Discussion of the transcendental function tests
|
|||
|
-----------------------------------------------
|
|||
|
|
|||
|
The test results above indicate that all 80x87 compatibles do not exceed
|
|||
|
Intel's stated error bound of 3 ULPs for the transcendental functions.
|
|||
|
However, some coprocessors are more accurate than others. Rating the
|
|||
|
coprocessors according to the accuracy of their transcendental functions
|
|||
|
gives the following list (highest accuracy first): Cyrix 387+, Cyrix 83D87,
|
|||
|
Intel 486, Intel RapidCAD, Intel 80287(!), C&T 38700DX, Intel 387DX, Intel
|
|||
|
80387, IIT 3C87, ULSI 83C87. The tests also show that the problems with
|
|||
|
excessive inaccuracy of the transcendental functions in early versions of the
|
|||
|
IIT coprocessors with errors of up to 8 ULPs [8] have been corrected.
|
|||
|
(According to [56], certain problems with the FPATAN instruction on the IIT
|
|||
|
3C87 occurring under the UNIX version of AutoCAD were corrected in June,
|
|||
|
1990.)
|
|||
|
|
|||
|
Considering the coprocessor emulators, the Franke387 has acceptable accuracy
|
|||
|
for the FSIN, FCOS, and FPATAN instructions, taking into consideration that
|
|||
|
according to its documentation, Franke387 uses only 64 bits of precision for
|
|||
|
the intermediate results, while coprocessors typically use 68 bits and more.
|
|||
|
However, the larger error in the FPTAN, F2XM1, FYL2XP1, and especially the
|
|||
|
FYL2X operations show that the emulator doesn't use state-of-the-art
|
|||
|
algorithms, which ensure an error of only a very few ULPs even if no extra
|
|||
|
precise intermediate results are available. Microsoft's emulator, meanwhile,
|
|||
|
provides transcendental functions with rather good accuracy, except for the
|
|||
|
logarithmic operations, which contain some minor flaws. The Q387 emulator,
|
|||
|
which came out only recently and is the fastest emulator available, could
|
|||
|
unfortunately not be tested since it caused TRANCK to abort with a GP (general
|
|||
|
protection) fault for every input that I tried.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
======================================================
|
|||
|
Intel 387DX compatibility testing / The SMDIAG program
|
|||
|
======================================================
|
|||
|
|
|||
|
Chips and Technologies has included the program SMDIAG on the V1.0 diagnostic
|
|||
|
disk distributed with its SuperMATH 38700DX coprocessor. Its stated purpose
|
|||
|
is to test the compatibility of the computational results and flag settings
|
|||
|
returned by the C&T coprocessor with the Intel 387DX. However, the tests for
|
|||
|
the transcendental functions seem to have been tweaked to let the C&T 38700DX
|
|||
|
pass, while coprocessors like the Intel RapidCAD and the Cyrix 83D87 fail.
|
|||
|
Also, SMDIAG shows failure in the FSCALE test for the Intel RapidCAD, Cyrix
|
|||
|
83D87, Cyrix 387+, and ULSI 83C87, even though they return the correct result
|
|||
|
according to Intel's documentation for the Intel 387DX (Intel's second
|
|||
|
generation 387), which is indeed returned by the 387DX. (SMDIAG apparently
|
|||
|
expects the result returned by the original Intel 80387.)
|
|||
|
|
|||
|
Note that chip manufacturers often do quite bug fixes, so it wouldn't be
|
|||
|
surprising if somebody else, using different runs of the same manufacturer's
|
|||
|
chip, came up with different results than the ones below. The Intel 387 alone
|
|||
|
seems to have been produced in four different versions that can be told apart
|
|||
|
by software, and Cyrix, ULSI, and IIT have manufactured at least two versions
|
|||
|
each of their coprocessors. (The coprocessors I tested have the following
|
|||
|
manufacturing dates stamped on them. Intel 387DX: 91/49, C&T 38700DX: 92/19,
|
|||
|
Cyrix 387+: 92/11, Intel RapidCAD: 92/05, ULSI 83C87: 91/48, IIT 3C87:
|
|||
|
92/20.)
|
|||
|
|
|||
|
Results of running the SMDIAG program on 387-compatible coprocessors
|
|||
|
(p = passed, f = failed)
|
|||
|
|
|||
|
Intel Intel Intel Cyrix Cyrix IIT ULSI C&T
|
|||
|
Test RapidCAD 387DX 80387 387+ 83D87 3C87 83C87 38700
|
|||
|
|
|||
|
1 (fstore) f p p p f f f p ##,%%
|
|||
|
2 (fiall) p p p p p p f p
|
|||
|
3 (faddsub) p p p p p p p p
|
|||
|
4 (faddsub_nr) p p p p f f f p %%
|
|||
|
5 (faddsub_cp) p p p p f f f p %%
|
|||
|
6 (faddsub_dn) p p p p f f f p %%
|
|||
|
7 (faddsub_up) p p p p f f f p %%,&&
|
|||
|
8 (fmul) p p p p p f f p
|
|||
|
9 (fdivn) p p p p p p p p
|
|||
|
10 (fdiv) p p p p p p f p
|
|||
|
11 (fxch) p p p p p p p p
|
|||
|
12 (fyl2x) p p p f f f f p ++
|
|||
|
13 (fyl2xp1) f p p f f f f p ++
|
|||
|
14 (fsqrt) p p p p p p p p
|
|||
|
15 (fsincos) f p p f f f f p ++
|
|||
|
16 (fptan) p p p f p f f p ++
|
|||
|
17 (fpatan) p p p f f f f p ++
|
|||
|
18 (f2xm1) p p p f f f f p ++
|
|||
|
19 (fscale) f f p f f f f p **
|
|||
|
20 (fcom1) p p p p p f f p
|
|||
|
21 (fprem) p p p p p p p p
|
|||
|
22 (misc1) p p p p p f f p
|
|||
|
23 (misc3) p p p p p p p p
|
|||
|
24 (misc4) p p p p f f p p %%
|
|||
|
|
|||
|
failed modules: 4 1 0 7 12 16 17 0
|
|||
|
|
|||
|
|
|||
|
## the failure of the Intel RapidCAD is caused by the fact that
|
|||
|
it stores the value of BCD INDEFINITE differently from the
|
|||
|
Intel 387DX. It uses FFFFC000000000000000, while the 387DX uses
|
|||
|
FFFF8000000000000000. However, both encodings are valid according
|
|||
|
to Intel's documentation, which defines the BCD INDEFINITE as
|
|||
|
FFFFUUUUUUUUUUUUUUUU, where U is undefined. So failure of the
|
|||
|
RapidCAD to deliver the same answer as the 387DX is not an
|
|||
|
"error", just a very slight incompatibility.
|
|||
|
** the FSCALE errors reported for the Intel 387DX, Intel RapidCAD,
|
|||
|
Cyrix 83D87, Cyrix 387+, and ULSI 83C87 are due to a single
|
|||
|
'wrong' result each returned by one of the FSCALE computations.
|
|||
|
SMDIAG expects the result returned by the first generation
|
|||
|
Intel 80387 (and, of course, the C&T 38700DX). However, this
|
|||
|
result is wrong according to Intel's documentation and the
|
|||
|
behavior was corrected in the second generation Intel 387DX.
|
|||
|
Therefore, the Intel RapidCAD, Cyrix 83D87, Cyrix 387+, and ULSI
|
|||
|
83C87 return the correct result compatible with the Intel 387DX.
|
|||
|
%% Failures reported for the Cyrix 83D87 are due to the fact that it
|
|||
|
converts pseudodenormals contained in its registers to normalized
|
|||
|
numbers upon storing them to memory with the FSTP TBYTE PTR
|
|||
|
instruction. Intel's processors store pseudodenormals without
|
|||
|
'normalizing' them. This is an incompatibility, but not an error,
|
|||
|
because both encodings will evaluate to the same value should
|
|||
|
they be reused in a calculation.
|
|||
|
&& Two of the failures reported for the Cyrix 83D87 are actual
|
|||
|
errors where the Cyrix 83D87 fails to deliver the correct result.
|
|||
|
1) control word = 0A7F (closure=proj., round=up, precision=53bit)
|
|||
|
ST(0) = 0001 ABCEF9876542101
|
|||
|
ST(1) = 0001 800000000345FFF
|
|||
|
instruction: FSUBRP ST(1), ST
|
|||
|
result should be: 0000 2BCEF987650EC800, status word = 3A30
|
|||
|
83D87 returns: 0000 3BCEF987650EC000, status word = 3830
|
|||
|
2) control word = 0A7F (closure=proj., round=up, precision=53bit)
|
|||
|
ST(0) = 0001 ABCEF9876542101
|
|||
|
ST(1) = 0001 800000000000000
|
|||
|
instruction: FSUB ST, ST(1)
|
|||
|
result should be: 0000 2BCEF98765432800, status word = 3A30
|
|||
|
83D87 returns: 0000 3BCEF98765432000, status word = 3830
|
|||
|
++ The failures for the test of transcendental functions are caused
|
|||
|
by the tested coprocessor returning results that differ from the
|
|||
|
ones returned by the Intel 387DX. On the Cyrix 83D87, Cyrix 387+,
|
|||
|
and Intel RapidCAD, this is simply due to the improved accuracy
|
|||
|
these coprocessors provide over the Intel 387DX. The failures of
|
|||
|
the IIT 3C87 and ULSI 83C87 are mainly due to the lesser accuracy
|
|||
|
in the transcendental functions of these coprocessors, but for
|
|||
|
the IIT 3C87 an additional source of failures is its inability to
|
|||
|
handle extended-precision denormals.
|
|||
|
|
|||
|
|
|||
|
Another compatibility issue that has been discussed on Usenet is the behavior
|
|||
|
of the math coprocessors under protected-mode operating systems. I have seen
|
|||
|
postings claiming that coprocessors from ULSI, IIT, and Cyrix locked up the
|
|||
|
machine when a protected mode operating system (several UNIX derivatives were
|
|||
|
also mentioned) was run on them. However, there have also been reports that
|
|||
|
several 486-based systems also have this problem, while others do not.
|
|||
|
Therefore, I think most of these problems are caused by poor motherboard
|
|||
|
design, especially wrong handling of error interrupts coming from the
|
|||
|
coprocessor. There could also be bugs in the exception handlers of the
|
|||
|
operating system.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
==========
|
|||
|
References
|
|||
|
==========
|
|||
|
|
|||
|
[1] Schnurer, G.: Zahlenknacker im Vormarsch. c't 1992, Heft 4, Seiten 170-
|
|||
|
186
|
|||
|
|
|||
|
[2] Curnow, H.J.; Wichmann, B.A.: A synthetic benchmark. Computer Journal,
|
|||
|
Vol. 19, No. 1, 1976, pp. 43-49
|
|||
|
|
|||
|
[3] Wichmann, B.A.: Validation code for the Whetstone benchmark. NPL Report
|
|||
|
DITC 107/88, National Physics Laboratory, UK, March 1988
|
|||
|
|
|||
|
[4] Curnow, H.J.: Wither Whetstone? The Synthetic Benchmark after 15 Years.
|
|||
|
In: Aad van der Steen (ed.): Evaluating Supercomputers. London: Chapman
|
|||
|
and Hall 1990
|
|||
|
|
|||
|
[5] Dongarra, J.J.: The Linpack Benchmark: An Explanation. In: Aad van der
|
|||
|
Steen (ed.): Evaluating Supercomputers. London: Chapman and Hall 1990
|
|||
|
[6] Dongarra, J.J.: Performance of Various Computers Using Standard Linear
|
|||
|
Equations Software. Report CS-89-85, Computer Science Department,
|
|||
|
University of Tennessee, March 11, 1992
|
|||
|
|
|||
|
[7] Huth, N.: Dichtung und Wahrheit oder Datenblatt und Test. Design &
|
|||
|
Elektronik 1990, Heft 13, Seiten 105-110
|
|||
|
|
|||
|
[8] Ungerer, B.: Sockelfolger. c't 1990, Heft 4, Seiten 162-163
|
|||
|
|
|||
|
[9] Coonen, J.T.: Contributions to a Proposed Standard for Binary Floating-
|
|||
|
Point Arithmetic Ph.D. thesis, University of California, Berkeley, 1984
|
|||
|
|
|||
|
[10] IEEE: IEEE Standard for Binary Floating-Point Arithmetic. SIGPLAN
|
|||
|
Notices, Vol. 22, No. 2, 1985, pp. 9-25
|
|||
|
|
|||
|
[11] IEEE Standard for Binary Floating-Point Arithmetic. ANSI/IEEE Std 754-
|
|||
|
1985. New York, NY: Institute of Electrical and Electronics Engineers
|
|||
|
1985
|
|||
|
|
|||
|
[12] FasMath 83D87 Compatibility Report. Cyrix Corporation, Nov. 1989 Order
|
|||
|
No. B2004
|
|||
|
|
|||
|
[13] FasMath 83D87 Accuracy Report. Cyrix Corporation, July 1990 Order No.
|
|||
|
B2002
|
|||
|
|
|||
|
[14] FasMath 83D87 Benchmark Report. Cyrix Corporation, June 1990 Order No.
|
|||
|
B2004
|
|||
|
|
|||
|
[15] FasMath 83D87 User's Manual. Cyrix Corporation, June 1990 Order No.
|
|||
|
L2001-003
|
|||
|
|
|||
|
[16] Brent, R.P.: A FORTRAN multiple-precision arithmetic package. ACM
|
|||
|
Transactions on Mathematical Software, Vol. 4, No. 1, March 1978, pp.
|
|||
|
57-70
|
|||
|
|
|||
|
[17] 387DX User's Manual, Programmer's Reference. Intel Corporation, 1989
|
|||
|
Order No. 231917-002
|
|||
|
|
|||
|
[18] Volder, J.E.: The CORDIC Trigonometric Computing Technique. IRE
|
|||
|
Transactions on Electronic Computers, Vol. EC-8, No. 5, September 1959,
|
|||
|
pp. 330-334
|
|||
|
|
|||
|
[19] Walther, J.S.: A unified algorithm for elementary functions. AFIPS
|
|||
|
Conference Proceedings, Vol. 38, SJCC 1971, pp. 379-385
|
|||
|
|
|||
|
[20] Esser, R.; Kremer, F.; Schmidt, W.G.: Testrechnungen auf der IBM 3090E
|
|||
|
mit Vektoreinrichtung. Arbeitsbericht RRZK-8803, Regionales
|
|||
|
Rechenzentrum an der Universit"at zu K<>ln, Februar 1988
|
|||
|
|
|||
|
[21] McMahon, H.H.: The Livermore Fortran Kernels: A test of the numerical
|
|||
|
performance range. Technical Report UCRL-53745, Lawrence Livermore
|
|||
|
National Laboratory, USA, December 1986
|
|||
|
|
|||
|
[22] Nave, R.: Implementation of Transcendental Functions on a Numerics
|
|||
|
Processor. Microprocessing and Microprogramming, Vol. 11, No. 3-4,
|
|||
|
March-April 1983, pp. 221-225
|
|||
|
|
|||
|
[23] Yuen, A.K.: Intel's Floating-Point Processors. Electro/88 Conference
|
|||
|
Record, Boston, MA, USA, 10-12 May 1988, pp. 48/5-1 - 48/5-7
|
|||
|
|
|||
|
[24] Stiller, A.; Ungerer, B.: Ausgerechnet. c't 1990, Heft 1, Seiten 90-92
|
|||
|
|
|||
|
[25] Rosch, W.L.: Handfeste Hilfe oder Seifenblase? PC Professionell, Juni
|
|||
|
1991, Seiten 214-237
|
|||
|
[26] Intel 80286 Hardware Reference Manual. Intel Corporation, 1987 Order
|
|||
|
No.210760-002
|
|||
|
|
|||
|
[27] AMD 80C287 80-bit CMOS Numeric Processor. Advanced Micro Devices, June
|
|||
|
1989 Order No. 11671B/0
|
|||
|
|
|||
|
[28] Intel RapidCAD(tm) Engineering CoProcessor Performance Brief. Intel
|
|||
|
Corporation, 1992
|
|||
|
|
|||
|
[29] i486(tm) Microprocessor Performance Report. Intel Corporation, April
|
|||
|
1990 Order No. 240734-001
|
|||
|
|
|||
|
[30] Intel486(tm) DX2 Microprocessor Performance Brief. Intel Corporation,
|
|||
|
March 1992 Order No. 241254-001
|
|||
|
|
|||
|
[31] Abacus 3167 Floating-Point Coprocessor Data Book. Weitek Corporation,
|
|||
|
July 1990 DOC No. 9030
|
|||
|
|
|||
|
[32] WTL 4167 Floating-Point Coprocessor Data Book. Weitek Corporation, July
|
|||
|
1989 DOC No. 8943
|
|||
|
|
|||
|
[33] Abacus Software Designer's Guide. Weitek Corporation, September 1989 DOC
|
|||
|
No. 8967
|
|||
|
|
|||
|
[34] Stiller, A.: Cache & Carry. c't 1992, Heft 6, Seiten 118-130
|
|||
|
|
|||
|
[35] Stiller, A.: Cache & Carry, Teil 2. c't 1992, Heft 7, Seiten 28-34
|
|||
|
|
|||
|
[36] Palmer, J.F.; Morse, S.P.: Die mathematischen Grundlagen der Numerik-
|
|||
|
Prozessoren 8087/80287. M<>nchen: tewi 1985
|
|||
|
|
|||
|
[37] 80C187 80-bit Math Coprocessor Data Sheet. Intel Corporation, September
|
|||
|
1989 Order No. 270640-003
|
|||
|
|
|||
|
[38] IIT-2C87 80-bit Numeric Co-Processor Data Sheet. IIT, May 1990
|
|||
|
|
|||
|
[39] Engineering note 4x4 matrix multiply transformation. IIT, 1989
|
|||
|
|
|||
|
[40] Tscheuschner, E.: 4 mal 4 auf einen Streich. c't 1990, Heft 3, Seiten
|
|||
|
266-276
|
|||
|
|
|||
|
[41] Goldberg, D.: Computer Arithmetic. In: Hennessy, J.L.; Patterson, D.A.:
|
|||
|
Computer Architecture A Quantitative Approach. San Mateo, CA: Morgan
|
|||
|
Kaufmann 1990
|
|||
|
|
|||
|
[42] 8087 Math Coprocessor Data Sheet. Intel Corporation, October 1989, Order
|
|||
|
No. 205835-007
|
|||
|
|
|||
|
[43] 8086/8088 User's Manual, Programmer's and Hardware Reference. Intel
|
|||
|
Corporation, 1989 Order No. 240487-001
|
|||
|
|
|||
|
[44] 80286 and 80287 Programmer's Reference Manual. Intel Corporation, 1987
|
|||
|
Order No. 210498-005
|
|||
|
|
|||
|
[45] 80287XL/XLT CHMOS III Math Coprocessor Data Sheet. Intel Corporation,
|
|||
|
May 1990 Order No. 290376-001
|
|||
|
|
|||
|
[46] Cyrix FasMath(tm) 82S87 Coprocessor Data Sheet. Cyrix Coporation, 1991
|
|||
|
Document 94018-00 Rev. 1.0
|
|||
|
|
|||
|
[47] IIT-3C87 80-bit Numeric Co-Processor Data Sheet. IIT, May 1990
|
|||
|
|
|||
|
[48] 486(tm)SX(tm) Microprocessor/ 487(tm)SX(tm) Math CoProcessor Data Sheet.
|
|||
|
Intel Corporation, April 1991. Order No. 240950-001
|
|||
|
|
|||
|
[49] Schnurer, G.: Die gro"se Verlade. c't 1991, Heft 7, Seiten 55-57
|
|||
|
|
|||
|
[50] Schnurer, G.: Eine 4 f"ur alle. c't 1991, Heft 6, Seite 25
|
|||
|
|
|||
|
[51] Intel486(tm)DX Microprocessor Data Book. Intel Corporation, June 1991
|
|||
|
Order No. 240440-004
|
|||
|
|
|||
|
[52] i486(tm) Microprocessor Hardware Reference Manual. Intel Corporation,
|
|||
|
1990 Order No. 240552-001
|
|||
|
|
|||
|
[53] i486(tm) Microprocessor Programmer's Reference Manual. Intel
|
|||
|
Corporation, 1990 Order No. 240486-001
|
|||
|
|
|||
|
[54] Ungerer, B.: Kalte H"ute. c't 1992, Heft 8, Seiten 140-144
|
|||
|
|
|||
|
[55] Ungerer, B.: Hei"se Sache. c't 1991, Heft 4, Seiten 104-108
|
|||
|
|
|||
|
[56] Rosch, W.L.: Handfeste Hilfe oder Seifenblase? PC Profesionell, Juni
|
|||
|
1991, Seiten 214-237
|
|||
|
|
|||
|
[57] Niederkr"uger, W.: Lebendige Vergangenheit. c't 1990, Heft 12, Seiten
|
|||
|
114-116
|
|||
|
|
|||
|
[58] ULSI Math*Co Advanced Math Coprocessor Technical Specification. ULSI
|
|||
|
System, 5/92, Rev. E
|
|||
|
|
|||
|
[59] 387(tm)DX Math CoProcessor Data Sheet. Intel Corporation, September
|
|||
|
1990. Order No. 240448-003
|
|||
|
|
|||
|
[60] 387(tm) Numerics Coprocessor Extension Data Sheet. Intel Corporation,
|
|||
|
February 1989. Order No. 231920-005
|
|||
|
|
|||
|
[61] Koren, I.; Zinaty, O.: Evaluating Elementary Functions in a Numerical
|
|||
|
Coprocessor Based on Rational Approximations. IEEE Transactions on
|
|||
|
Computers, Vol. C-39, No. 8, August 1990, pp. 1030-1037
|
|||
|
|
|||
|
[62] 387(tm) SX Math CoProcessor Data Sheet. Intel Corporation, November 1989
|
|||
|
Order No. 240225-005
|
|||
|
|
|||
|
[63] Frenkel, G.: Coprocessors Speed Numeric Operations. PC-Week, August 27,
|
|||
|
1990
|
|||
|
|
|||
|
[64] Schnurer, G.; Stiller, A.: Auto-Matt. c't 1991, Heft 10, Seiten 94-96
|
|||
|
|
|||
|
[65] Grehan, R.: FPU Face-Off. Byte, November 1990, pp. 194-200
|
|||
|
|
|||
|
[66] Tang, P.T.P.: Testing Computer Arithmetic by Elementary Number Theory.
|
|||
|
Preprint MCS-P84-0889, Mathematics and Computer Science Division,
|
|||
|
Argonne National Laboratory, August 1989
|
|||
|
|
|||
|
[67] Ferguson, W.E.: Selecting math coprocessors. IEEE Spectrum, July 1991,
|
|||
|
pp. 38-41
|
|||
|
|
|||
|
[68] Schnabel, J.: Viermal 387. Computer Pers"onlich 1991, Heft 22, Seiten
|
|||
|
153-156
|
|||
|
|
|||
|
[69] Hofmann, J.: Starke Rechenknechte. mc 1990, Heft 7, Seiten 64-67
|
|||
|
|
|||
|
[70] Woerrlein, H.; Hinnenberg, R.: Die Lust an der Power. Computer Live
|
|||
|
1991, Heft 10, Seiten 138-149
|
|||
|
|
|||
|
[71] email from Peter Forsberg (peterf@vnet.ibm.com), email from Alan Brown
|
|||
|
(abrown@Reston.ICL.COM)
|
|||
|
|
|||
|
[72] email from Eric Johnson (johnsone%camax01@uunet.UU.NET), email from
|
|||
|
Jerry Whelan (guru@stasi.bradley.edu), email from Arto Viitanen
|
|||
|
(av@cs.uta.fi), email from Richard Krehbiel (richk@grebyn.com)
|
|||
|
|
|||
|
[73] email from Fred Dunlap (cyrix!fred@texsun.Central.Sun.COM)
|
|||
|
|
|||
|
[74] correspondence with Bengt Ask (f89ba@efd.lth.se)
|
|||
|
|
|||
|
[75] email from Thomas Hoberg (tmh@prosun.first.gmd.de)
|
|||
|
|
|||
|
[76] Microsoft Macro Assembler Programmer's Guide Version 6.0, Microsoft
|
|||
|
Corporation, 1991. Document No. LN06556-0291
|
|||
|
|
|||
|
[77] FasMath EMC87 User's Manual, Rev. 2. Cyrix Corporation, February 1991
|
|||
|
Order No. 90018-00
|
|||
|
|
|||
|
[78] Persson, C.: Die 32-Bit-Parade c't 1992, Heft 9, Seiten 150-156
|
|||
|
|
|||
|
[79] email from Duncan Murdoch (dmurdoch@mast.QueensU.CA)
|
|||
|
|
|||
|
|
|||
|
|
|||
|
========================
|
|||
|
Manufacturer's addresses
|
|||
|
========================
|
|||
|
|
|||
|
Intel Corporation
|
|||
|
3065 Bowers Avenue
|
|||
|
Santa Clara, CA 95051
|
|||
|
USA
|
|||
|
|
|||
|
IIT Integrated Information Technology, Inc.
|
|||
|
2540 Mission College Blvd.
|
|||
|
Santa Clara, CA 95054
|
|||
|
USA
|
|||
|
|
|||
|
ULSI Systems, Inc.
|
|||
|
58 Daggett Drive
|
|||
|
San Jose, CA 95134
|
|||
|
USA
|
|||
|
|
|||
|
Chips & Technologies, Inc.
|
|||
|
3050 Zanker Road
|
|||
|
San Jose, CA 95134
|
|||
|
USA
|
|||
|
|
|||
|
Weitek Corporation
|
|||
|
1060 East Arques Avenue
|
|||
|
Sunnyvale, CA 94086
|
|||
|
USA
|
|||
|
|
|||
|
AMD Advanced Microdevices, Inc.
|
|||
|
901 Thompson Place
|
|||
|
P.O.B. 3453
|
|||
|
Sunnyvale, CA 94088-3453
|
|||
|
USA
|
|||
|
|
|||
|
Cyrix Corporation
|
|||
|
P.O.B. 850118
|
|||
|
Richardson, TX 75085
|
|||
|
USA
|
|||
|
|
|||
|
|
|||
|
|
|||
|
===============================
|
|||
|
Appendix A: Test program source
|
|||
|
===============================
|
|||
|
|
|||
|
{$N+,E+}
|
|||
|
PROGRAM PCtrl;
|
|||
|
|
|||
|
VAR B,c: EXTENDED;
|
|||
|
Precision, L: WORD;
|
|||
|
|
|||
|
PROCEDURE SetPrecisionControl (Precision: WORD);
|
|||
|
(* This procedure sets the internal precision of the NDP. Available *)
|
|||
|
(* precision values: 0 - 24 bits (SINGLE) *)
|
|||
|
(* 1 - n.a. (mapped to single) *)
|
|||
|
(* 2 - 53 bits (DOUBLE) *)
|
|||
|
(* 3 - 64 bits (EXTENDED) *)
|
|||
|
|
|||
|
VAR CtrlWord: WORD;
|
|||
|
|
|||
|
BEGIN {SetPrecisionCtrl}
|
|||
|
IF Precision = 1 THEN
|
|||
|
Precision := 0;
|
|||
|
Precision := Precision SHL 8; { make mask for PC field in ctrl word}
|
|||
|
ASM
|
|||
|
FSTCW [CtrlWord] { store NDP control word }
|
|||
|
MOV AX, [CtrlWord] { load control word into CPU }
|
|||
|
AND AX, 0FCFFh { mask out precision control field }
|
|||
|
OR AX, [Precision] { set desired precision in PC field }
|
|||
|
MOV [CtrlWord], AX { store new control word }
|
|||
|
FLDCW [CtrlWord] { set new precision control in NDP }
|
|||
|
END;
|
|||
|
END; {SetPrecisionCtrl}
|
|||
|
|
|||
|
BEGIN {main}
|
|||
|
FOR Precision := 1 TO 3 DO BEGIN
|
|||
|
B := 1.2345678901234567890;
|
|||
|
SetPrecisionControl (Precision);
|
|||
|
FOR L := 1 TO 20 DO BEGIN
|
|||
|
B := Sqrt (B);
|
|||
|
END;
|
|||
|
FOR L := 1 TO 20 DO BEGIN
|
|||
|
B := B*B;
|
|||
|
END;
|
|||
|
SetPrecisionControl (3); { full precision for printout }
|
|||
|
WriteLn (Precision, B:28);
|
|||
|
END;
|
|||
|
END.
|
|||
|
|
|||
|
|
|||
|
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
|
|||
|
|
|||
|
{$N+,E+}
|
|||
|
PROGRAM RCtrl;
|
|||
|
|
|||
|
VAR B,c: EXTENDED;
|
|||
|
RoundingMode, L: WORD;
|
|||
|
|
|||
|
|
|||
|
PROCEDURE SetRoundingMode (RCMode: WORD);
|
|||
|
(* This procedure selects one of four available rounding modes *)
|
|||
|
(* 0 - Round to nearest (default) *)
|
|||
|
(* 1 - Round down (towards negative infinity) *)
|
|||
|
(* 2 - Round up (towards positive infinity) *)
|
|||
|
(* 3 - Chop (truncate, round towards zero) *)
|
|||
|
|
|||
|
VAR CtrlWord: WORD;
|
|||
|
|
|||
|
BEGIN
|
|||
|
RCMode := RCMode SHL 10; { make mask for RC field in control word}
|
|||
|
ASM
|
|||
|
FSTCW [CtrlWord] { store NDP control word }
|
|||
|
MOV AX, [CtrlWord] { load control word into CPU }
|
|||
|
AND AX, 0F3FFh { mask out rounding control field }
|
|||
|
OR AX, [RCMode] { set desired precision in RC field }
|
|||
|
MOV [CtrlWord], AX { store new control word }
|
|||
|
FLDCW [CtrlWord] { set new rounding control in NDP }
|
|||
|
END;
|
|||
|
END;
|
|||
|
|
|||
|
BEGIN
|
|||
|
FOR RoundingMode := 0 TO 3 DO BEGIN
|
|||
|
B := 1.2345678901234567890e100;
|
|||
|
SetRoundingMode (RoundingMode);
|
|||
|
FOR L := 1 TO 51 DO BEGIN
|
|||
|
B := Sqrt (B);
|
|||
|
END;
|
|||
|
FOR L := 1 TO 51 DO BEGIN
|
|||
|
B := -B*B;
|
|||
|
END;
|
|||
|
SetRoundingMode (0); { round to nearest for printout }
|
|||
|
WriteLn (RoundingMode, B:28);
|
|||
|
END;
|
|||
|
END.
|
|||
|
|
|||
|
|
|||
|
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
|
|||
|
|
|||
|
{$N+,E+}
|
|||
|
|
|||
|
PROGRAM DenormTs;
|
|||
|
|
|||
|
VAR E: EXTENDED;
|
|||
|
D: DOUBLE;
|
|||
|
S: SINGLE;
|
|||
|
|
|||
|
BEGIN
|
|||
|
WriteLn ('Testing support and printing of denormals');
|
|||
|
WriteLn;
|
|||
|
Write ('Coprocessor is: ');
|
|||
|
CASE Test8087 OF
|
|||
|
0: WriteLn ('Emulator');
|
|||
|
1: WriteLn ('8087 or compatible');
|
|||
|
2: WriteLn ('80287 or compatible');
|
|||
|
3: WriteLn ('80387 or compatible');
|
|||
|
END;
|
|||
|
WriteLn;
|
|||
|
S := 1.18e-38;
|
|||
|
S := S * 3.90625e-3;
|
|||
|
IF S = 0 THEN
|
|||
|
WriteLn ('SINGLE denormals not supported')
|
|||
|
ELSE BEGIN
|
|||
|
WriteLn ('SINGLE denormals supported');
|
|||
|
WriteLn ('SINGLE denormal prints as: ', S);
|
|||
|
WriteLn ('Denormal should be printed as 4.60943...E-0041');
|
|||
|
END;
|
|||
|
WriteLn;
|
|||
|
D := 2.24e-308;
|
|||
|
D := D * 3.90625e-3;
|
|||
|
IF D = 0 THEN
|
|||
|
WriteLn ('DOUBLE denormals not supported')
|
|||
|
ELSE BEGIN
|
|||
|
WriteLn ('DOUBLE denormals supported');
|
|||
|
WriteLn ('DOUBLE denormal prints as: ', D);
|
|||
|
WriteLn ('Denormal should be printed as 8.75...E-0311');
|
|||
|
END;
|
|||
|
WriteLn;
|
|||
|
E := 3.37e-4932;
|
|||
|
E := E * 3.90625e-3;
|
|||
|
IF E = 0 THEN
|
|||
|
WriteLn ('EXTENDED denormals not supported')
|
|||
|
ELSE BEGIN
|
|||
|
WriteLn ('EXTENDED denormals supported');
|
|||
|
WriteLn ('EXTENDED denormal prints as: ', E);
|
|||
|
WriteLn ('Denormal should be printed as 1.3164...E-4934');
|
|||
|
END;
|
|||
|
END.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
====================================
|
|||
|
Appendix B: Benchmark program source
|
|||
|
====================================
|
|||
|
|
|||
|
|
|||
|
; FILE: APFELM4.ASM
|
|||
|
; assemble with MASM /e APFELM4 or TASM /e APFELM4
|
|||
|
|
|||
|
|
|||
|
CODE SEGMENT BYTE PUBLIC 'CODE'
|
|||
|
ASSUME CS: CODE
|
|||
|
|
|||
|
PAGE ,120
|
|||
|
|
|||
|
PUBLIC APPLE87;
|
|||
|
|
|||
|
APPLE87 PROC NEAR
|
|||
|
PUSH BP ; save caller's base pointer
|
|||
|
MOV BP, SP ; make new frame pointer
|
|||
|
PUSH DS ; save caller's data segment
|
|||
|
PUSH SI ; save register
|
|||
|
PUSH DI ; variables
|
|||
|
LDS BX, [BP+04] ; pointer to parameter record
|
|||
|
FINIT ; init 80x87 FSP->R0
|
|||
|
FILD WORD PTR [BX+02] ; maxrad FSP->R7
|
|||
|
FLD QWORD PTR [BX+08] ; qmax FSP->R6
|
|||
|
FSUB QWORD PTR [BX+16] ; qmax-qmin FSP->R6
|
|||
|
DEC WORD PTR [BX+04] ; ymax-1
|
|||
|
FIDIV WORD PTR [BX+04] ; (qmax-qmin)/(ymax-1)FSP->R6
|
|||
|
FSTP QWORD PTR [BX+16] ; save delta_q FSP->R7
|
|||
|
FLD QWORD PTR [BX+24] ; pmax FSP->R6
|
|||
|
FSUB QWORD PTR [BX+32] ; pmax-pmin FSP->R6
|
|||
|
DEC WORD PTR [BX+06] ; xmax-1
|
|||
|
FIDIV WORD PTR [BX+06] ; delta_p FSP->R6
|
|||
|
MOV AX, [BX] ; save maxiter,[BX] needed for
|
|||
|
MOV [BX+2], AX ; 80x87 status now
|
|||
|
XOR BP, BP ; y=0
|
|||
|
FLD QWORD PTR [BX+08] ; qmax FSP->R5
|
|||
|
CMP WORD PTR [BX+40], 0 ; fast mode on 8087 desired ?
|
|||
|
JE yloop ; no, normal mode
|
|||
|
FSTCW [BX] ; save NDP control word
|
|||
|
AND WORD PTR [BX], 0FCFFh; set PCTRL = single-precision
|
|||
|
FLDCW [BX] ; get back NDP control word
|
|||
|
yloop: XOR DI, DI ; x=0
|
|||
|
FLD QWORD PTR [BX+32] ; pmin FSP->R4
|
|||
|
xloop: FLDZ ; j**2= 0 FSP->R3
|
|||
|
FLDZ ; 2ij = 0 FSP->R2
|
|||
|
FLDZ ; i**2= 0 FSP->R1
|
|||
|
MOV CX, [BX+2] ; maxiter
|
|||
|
MOV DL, 41h ; mask for C0 and C3 cond.bits
|
|||
|
iteration: FSUB ST, ST(2) ; i**2-j**2 FSP->R1
|
|||
|
FADD ST, ST(3) ; i**2-j**2+p = i FSP->R1
|
|||
|
FLD ST(0) ; duplicate i FSP->R0
|
|||
|
FMUL ST(1), ST ; i**2 FSP->R0
|
|||
|
FADD ST, ST(0) ; 2i FSP->R0
|
|||
|
FXCH ST(2) ; 2*i*j FSP->R0
|
|||
|
FADD ST, ST(5) ; 2*i*j+q = j FSP->R0
|
|||
|
FMUL ST(2), ST ; 2*i*j FSP->R0
|
|||
|
FMUL ST, ST(0) ; j**2 FSP->R0
|
|||
|
FST ST(3) ; save j**2 FSP->R0
|
|||
|
FADD ST, ST(1) ; i**2+j**2 FSP->R0
|
|||
|
FCOMP ST(7) ; i**2+j**2 > maxrad? FSP->R1
|
|||
|
FSTSW [BX] ; save 80x87 cond.codeFSP->R1
|
|||
|
TEST BYTE PTR [BX+1], DL ; test carry and zero flags
|
|||
|
LOOPNZ iteration ; until maxiter if not diverg.
|
|||
|
MOV DX, CX ; number of loops executed
|
|||
|
NEG CX ; carry set if CX <> 0
|
|||
|
ADC DX, 0 ; adjust DX if no. of loops<>0
|
|||
|
|
|||
|
; plot point here (DI = X, BP = y, DX has the color)
|
|||
|
|
|||
|
FSTP ST(0) ; pop i**2 FSP->R2
|
|||
|
FSTP ST(0) ; pop 2ij FSP->R3
|
|||
|
FSTP ST(0) ; pop j**2 FSP->R4
|
|||
|
FADD ST,ST(2) ; p=p+delta_p FSP->R4
|
|||
|
INC DI ; x:=x+1
|
|||
|
CMP DI, [BX+6] ; x > xmax ?
|
|||
|
JBE xloop ; no, continue on same line
|
|||
|
FSTP ST(0) ; pop p FSP->R5
|
|||
|
FSUB QWORD PTR [BX+16] ; q=q-delta_q FSP->R5
|
|||
|
INC BP ; y:=y+1
|
|||
|
CMP BP, [BX+4] ; y > ymax ?
|
|||
|
JBE yloop ; no, picture not done yet
|
|||
|
|
|||
|
groesser: POP DI ; restore
|
|||
|
POP SI ; register variables
|
|||
|
POP DS ; restore caller's data segm.
|
|||
|
POP BP ; save caller's base pointer
|
|||
|
RET 4 ; pop parameters and return
|
|||
|
APPLE87 ENDP
|
|||
|
|
|||
|
CODE ENDS
|
|||
|
|
|||
|
END
|
|||
|
|
|||
|
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
|
|||
|
|
|||
|
UNIT Time;
|
|||
|
|
|||
|
INTERFACE
|
|||
|
|
|||
|
FUNCTION Clock: LONGINT; { same as VMS; time in milliseconds }
|
|||
|
|
|||
|
|
|||
|
IMPLEMENTATION
|
|||
|
|
|||
|
FUNCTION Clock: LONGINT; ASSEMBLER;
|
|||
|
ASM
|
|||
|
PUSH DS { save caller's data segment }
|
|||
|
XOR DX, DX { initialize data segment to }
|
|||
|
MOV DS, DX { access ticker counter }
|
|||
|
MOV BX, 46Ch { offset of ticker counter in segm.}
|
|||
|
MOV DX, 43h { timer chip control port }
|
|||
|
MOV AL, 4 { freeze timer 0 }
|
|||
|
PUSHF { save caller's int flag setting }
|
|||
|
STI { allow update of ticker counter }
|
|||
|
LES DI, DS:[BX] { read BIOS ticker counter }
|
|||
|
OUT DX, AL { latch timer 0 }
|
|||
|
LDS SI, DS:[BX] { read BIOS ticker counter }
|
|||
|
IN AL, 40h { read latched timer 0 lo-byte }
|
|||
|
MOV AH, AL { save lo-byte }
|
|||
|
IN AL, 40h { read latched timer 0 hi-byte }
|
|||
|
POPF { restore caller's int flag }
|
|||
|
XCHG AL, AH { correct order of hi and lo }
|
|||
|
MOV CX, ES { ticker counter 1 in CX:DI:AX }
|
|||
|
CMP DI, SI { ticker counter updated ? }
|
|||
|
JE @no_update { no }
|
|||
|
OR AX, AX { update before timer freeze ? }
|
|||
|
JNS @no_update { no }
|
|||
|
MOV DI, SI { use second }
|
|||
|
MOV CX, DS { ticker counter }
|
|||
|
@no_update:NOT AX { counter counts down }
|
|||
|
MOV BX, 36EDh { load multiplier }
|
|||
|
MUL BX { W1 * M }
|
|||
|
MOV SI, DX { save W1 * M (hi) }
|
|||
|
MOV AX, BX { get M }
|
|||
|
MUL DI { W2 * M }
|
|||
|
XCHG BX, AX { AX = M, BX = W2 * M (lo) }
|
|||
|
MOV DI, DX { DI = W2 * M (hi) }
|
|||
|
ADD BX, SI { accumulate }
|
|||
|
ADC DI, 0 { result }
|
|||
|
XOR SI, SI { load zero }
|
|||
|
MUL CX { W3 * M }
|
|||
|
ADD AX, DI { accumulate }
|
|||
|
ADC DX, SI { result in DX:AX:BX }
|
|||
|
MOV DH, DL { move result }
|
|||
|
MOV DL, AH { from DL:AX:BX }
|
|||
|
MOV AH, AL { to }
|
|||
|
MOV AL, BH { DX:AX:BH }
|
|||
|
MOV DI, DX { save result }
|
|||
|
MOV CX, AX { in DI:CX }
|
|||
|
MOV AX, 25110 { calculate correction }
|
|||
|
MUL DX { factor }
|
|||
|
SUB CX, DX { subtract correction }
|
|||
|
SBB DI, SI { factor }
|
|||
|
XCHG AX, CX { result back }
|
|||
|
MOV DX, DI { to DX:AX }
|
|||
|
POP DS { restore caller's data segment }
|
|||
|
END;
|
|||
|
|
|||
|
|
|||
|
BEGIN
|
|||
|
Port [$43] := $34; { need rate generator, not square wave}
|
|||
|
Port [$40] := 0; { generator as prog. by some BIOSes }
|
|||
|
Port [$40] := 0; { for timer 0 }
|
|||
|
END. { Time }
|
|||
|
|
|||
|
|
|||
|
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
|
|||
|
|
|||
|
{$A+,B-,R-,I-,V-,N+,E+}
|
|||
|
PROGRAM PeakFlop;
|
|||
|
|
|||
|
USES Time;
|
|||
|
|
|||
|
TYPE ParamRec = RECORD
|
|||
|
MaxIter, MaxRad, YMax, XMax: WORD;
|
|||
|
Qmax, Qmin, Pmax, Pmin: DOUBLE;
|
|||
|
FastMod: WORD;
|
|||
|
PlotFkt: POINTER;
|
|||
|
FLOPS:LONGINT;
|
|||
|
END;
|
|||
|
|
|||
|
VAR Param: ParamRec;
|
|||
|
Start: LONGINT;
|
|||
|
|
|||
|
|
|||
|
{$L APFELM4.OBJ}
|
|||
|
|
|||
|
PROCEDURE Apple87 (VAR Param: ParamRec); EXTERNAL;
|
|||
|
|
|||
|
|
|||
|
BEGIN
|
|||
|
WITH Param DO BEGIN
|
|||
|
MaxIter:= 50;
|
|||
|
MaxRad := 30;
|
|||
|
YMax := 30;
|
|||
|
XMax := 30;
|
|||
|
Pmin :=-2.1;
|
|||
|
Pmax := 1.1;
|
|||
|
Qmin :=-1.2;
|
|||
|
Qmax := 1.2;
|
|||
|
FastMod:= Word (FALSE);
|
|||
|
PlotFkt:= NIL;
|
|||
|
Flops := 0;
|
|||
|
END;
|
|||
|
Start := Clock;
|
|||
|
Apple87 (Param); { executes 104002 FLOP }
|
|||
|
Start := Clock - Start; { elapsed time in milliseconds }
|
|||
|
WriteLn ('Peak-MFLOPS: ', 104.002 / Start);
|
|||
|
END.
|
|||
|
|
|||
|
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
|
|||
|
|
|||
|
; FILE: M4X4.ASM
|
|||
|
;
|
|||
|
; assemble with TASM /e M4X4 or MASM /e M4X4
|
|||
|
|
|||
|
CODE SEGMENT BYTE PUBLIC 'CODE'
|
|||
|
|
|||
|
ASSUME CS:CODE
|
|||
|
|
|||
|
PUBLIC MUL_4x4
|
|||
|
PUBLIC IIT_MUL_4x4
|
|||
|
|
|||
|
|
|||
|
FSBP0 EQU DB 0DBh, 0E8h ; declare special IIT
|
|||
|
FSBP1 EQU DB 0DBh, 0EBh ; instructions
|
|||
|
FSBP2 EQU DB 0DBh, 0EAh
|
|||
|
F4X4 EQU DB 0DBh, 0F1h
|
|||
|
|
|||
|
|
|||
|
;---------------------------------------------------------------------
|
|||
|
;
|
|||
|
; MUL_4x4 multiplicates a four-by-four matrix by an array of four
|
|||
|
; dimensional vectors. This operation is needed for 3D transformations
|
|||
|
; in graphics data processing. There are arrays for each component of
|
|||
|
; a vector. Thus there is an ; array containing all the x components,
|
|||
|
; another containing all the y components and so on. Each component is
|
|||
|
; an 8 byte IEEE floating-point number. Two indices into the array of
|
|||
|
; vectors are given. The first is the index of the vector that will be
|
|||
|
; processed first, the second is the index of the vector processed
|
|||
|
; last.
|
|||
|
;
|
|||
|
;---------------------------------------------------------------------
|
|||
|
|
|||
|
MUL_4x4 PROC NEAR
|
|||
|
|
|||
|
AddrX EQU DWORD PTR [BP+24] ; address of X component array
|
|||
|
AddrY EQU DWORD PTR [BP+20] ; address of Y component array
|
|||
|
AddrZ EQU DWORD PTR [BP+16] ; address of Z component array
|
|||
|
AddrW EQU DWORD PTR [BP+12] ; address of W component array
|
|||
|
AddrT EQU DWORD PTR [BP+8] ; addr. of 4x4 transform. mat.
|
|||
|
F EQU WORD PTR [BP+6] ; first vector to process
|
|||
|
K EQU WORD PTR [BP+4] ; last vector to process
|
|||
|
RetAddr EQU WORD PTR [BP+2] ; return address saved by call
|
|||
|
SavdBP EQU WORD PTR [BP+0] ; saved frame pointer
|
|||
|
SavdDS EQU WORD PTR [BP-2] ; caller's data segment
|
|||
|
|
|||
|
PUSH BP ; save TURBO-Pascal frame ptr
|
|||
|
MOV BP, SP ; new frame pointer
|
|||
|
PUSH DS ; save TURBO-Pascal data segmnt
|
|||
|
|
|||
|
MOV CX, K ; final index
|
|||
|
SUB CX, F ; final index - start index
|
|||
|
JNC $ok ; must not
|
|||
|
JMP $nothing ; be negative
|
|||
|
$ok: INC CX ; number of elements
|
|||
|
|
|||
|
MOV SI, F ; init offset into arrays
|
|||
|
SHL SI, 1 ; each
|
|||
|
SHL SI, 1 ; element
|
|||
|
SHL SI, 1 ; has 8 bytes
|
|||
|
|
|||
|
LDS DI, AddrT ; addr. of transformation mat.
|
|||
|
FLD QWORD PTR [DI] ; load a[0,0] = R7
|
|||
|
FLD QWORD PTR [DI+8] ; load a[0,1] = R6
|
|||
|
|
|||
|
$mat_mul: LES BX, AddrX ; addr. of x component array
|
|||
|
FLD QWORD PTR ES:[BX+SI] ; load x[a] = R5
|
|||
|
LES BX, AddrY ; addr. of y component array
|
|||
|
FLD QWORD PTR ES:[BX+SI] ; load y[a] = R4
|
|||
|
LES BX, AddrZ ; addr. of z component array
|
|||
|
FLD QWORD PTR ES:[BX+SI] ; load z[a] = R3
|
|||
|
LES BX, AddrW ; addr. of w component array
|
|||
|
FLD QWORD PTR ES:[BX+SI] ; load w[a] = R2
|
|||
|
|
|||
|
FLD ST(5) ; load a[0,0] = R1
|
|||
|
FMUL ST, ST(4) ; a[0,0] * x[a] = R1
|
|||
|
FLD ST(5) ; load a[0,1] = R0
|
|||
|
FMUL ST, ST(4) ; a[0,1] * y[a] = R0
|
|||
|
FADDP ST(1), ST ; a[0,0]*x[a]+a[0,1]*y[a]=R1
|
|||
|
FLD QWORD PTR [DI+16] ; load a[0,2] = R0
|
|||
|
FMUL ST, ST(3) ; a[0,2] * z[a] = R0
|
|||
|
FADDP ST(1), ST ; a[0,0]*x[a]...a[0,2]*z[a]=R1
|
|||
|
FLD QWORD PTR [DI+24] ; load a[0,3] = R0
|
|||
|
FMUL ST, ST(2) ; a[0,3] * w[a] = R0
|
|||
|
FADDP ST(1), ST ; a[0,0]*x[a]...a[0,3]*w[a]=R1
|
|||
|
LES BX, AddrX ; get address of x vector
|
|||
|
FSTP QWORD PTR ES:[BX+SI] ; write new x[a]
|
|||
|
|
|||
|
FLD QWORD PTR [DI+32] ; load a[1,0] = R1
|
|||
|
FMUL ST, ST(4) ; a[1,0] * x[a] = R1
|
|||
|
FLD QWORD PTR [DI+40] ; load a[1,1] = R0
|
|||
|
FMUL ST, ST(4) ; a[1,1] * y[a] = R0
|
|||
|
FADDP ST(1), ST ; a[1,0]*x[a]+a[1,1]*y[a]=R1
|
|||
|
FLD QWORD PTR [DI+48] ; load a[1,2] = R0
|
|||
|
FMUL ST, ST(3) ; a[1,2] * z[a] = R0
|
|||
|
FADDP ST(1), ST ; a[1,0]*x[a]...a[1,2]*z[a]=R1
|
|||
|
FLD QWORD PTR [DI+56] ; load a[1,3] = R0
|
|||
|
FMUL ST, ST(2) ; a[1,3] * w[a] = R0
|
|||
|
FADDP ST(1), ST ; a[1,0]*x[a]...a[1,3]*w[a]=R1
|
|||
|
LES BX, AddrY ; get address of y vector
|
|||
|
FSTP QWORD PTR ES:[BX+SI] ; write new y[a]
|
|||
|
|
|||
|
FLD QWORD PTR [DI+64] ; load a[2,0] = R1
|
|||
|
FMUL ST, ST(4) ; a[2,0] * x[a] = R1
|
|||
|
FLD QWORD PTR [DI+72] ; load a[2,1] = R0
|
|||
|
FMUL ST, ST(4) ; a[2,1] * y[a] = R0
|
|||
|
FADDP ST(1), ST ; a[2,0]*x[a]+a[2,1]*y[a]=R1
|
|||
|
FLD QWORD PTR [DI+80] ; load a[2,2] = R0
|
|||
|
FMUL ST, ST(3) ; a[2,2] * z[a] = R0
|
|||
|
FADDP ST(1), ST ; a[2,0]*x[a]...a[2,2]*z[a]=R1
|
|||
|
FLD QWORD PTR [DI+88] ; load a[2,3] = R0
|
|||
|
FMUL ST, ST(2) ; a[2,3] * w[a] = R0
|
|||
|
FADDP ST(1), ST ; a[2,0]*x[a]...a[2,3]*w[a]=R1
|
|||
|
LES BX, AddrZ ; get address of z vector
|
|||
|
FSTP QWORD PTR ES:[BX+SI] ; write new z[a]
|
|||
|
|
|||
|
FLD QWORD PTR [DI+96] ; load a[3,0] = R1
|
|||
|
FMULP ST(4), ST ; a[3,0] * x[a] = R5
|
|||
|
FLD QWORD PTR [DI+104] ; load a[3,1] = R1
|
|||
|
FMULP ST(3), ST ; a[3,1] * y[a] = R4
|
|||
|
FLD QWORD PTR [DI+112] ; load a[3,2] = R1
|
|||
|
FMULP ST(2), ST ; a[3,2] * z[a] = R3
|
|||
|
FLD QWORD PTR [DI+120] ; load a[3,3] = R1
|
|||
|
FMULP ST(1), ST ; a[3,3] * w[a] = R2
|
|||
|
FADDP ST(1), ST ; a[3,3]*w[a]+a[3,2]*z[a]=R3
|
|||
|
FADDP ST(1), ST ; a[3,3]*w[a]...a[3,1]*y[a]=R4
|
|||
|
FADDP ST(1), ST ; a[3,3]*w[a]...a[3,0]*x[a]=R5
|
|||
|
LES BX, AddrW ; get address of w vector
|
|||
|
FSTP QWORD PTR ES:[BX+SI] ; write new w[a]
|
|||
|
|
|||
|
ADD SI, 8 ; new offset into arrays
|
|||
|
DEC CX ; decrement element counter
|
|||
|
JZ $done ; no elements left, done
|
|||
|
JMP $mat_mul ; transform next vector
|
|||
|
|
|||
|
$done: FSTP ST(0) ; clear
|
|||
|
FSTP ST(0) ; FPU stack
|
|||
|
$nothing: POP DS ; restore TP data segment
|
|||
|
POP BP ; restore TP frame pointer
|
|||
|
RET 24 ; pop parameters and return
|
|||
|
|
|||
|
MUL_4X4 ENDP
|
|||
|
|
|||
|
|
|||
|
;---------------------------------------------------------------------
|
|||
|
;
|
|||
|
; IIT_MUL_4x4 multiplicates a four-by-four matrix by an array of four
|
|||
|
; dimensional vectors. This operation is needed for 3D transformations
|
|||
|
; in graphics data processing. There are arrays for each component of
|
|||
|
; a vector. Thus there is an array containing all the x components,
|
|||
|
; another containing all the y components and so on. Each component is
|
|||
|
; an 8 byte IEEE floating-point number. Two indices into the array of
|
|||
|
; vectors are given. The first is the index of the vector that will be
|
|||
|
; processed first, the second is the index of the vector processed
|
|||
|
; last. This subroutine uses the special instructions only available
|
|||
|
; on IIT coprocessors to provide fast matrix multiply capabilities.
|
|||
|
; So make sure to use it only on IIT coprocessors.
|
|||
|
;
|
|||
|
;---------------------------------------------------------------------
|
|||
|
|
|||
|
IIT_MUL_4x4 PROC NEAR
|
|||
|
|
|||
|
AddrX EQU DWORD PTR [BP+24] ; address of X component array
|
|||
|
AddrY EQU DWORD PTR [BP+20] ; address of Y component array
|
|||
|
AddrZ EQU DWORD PTR [BP+16] ; address of Z component array
|
|||
|
AddrW EQU DWORD PTR [BP+12] ; address of W component array
|
|||
|
AddrT EQU DWORD PTR [BP+8] ; addr. of 4x4 transf. matrix
|
|||
|
F EQU WORD PTR [BP+6] ; first vector to process
|
|||
|
K EQU WORD PTR [BP+4] ; last vector to process
|
|||
|
RetAddr EQU WORD PTR [BP+2] ; return address saved by call
|
|||
|
SavdBP EQU WORD PTR [BP+0] ; saved frame pointer
|
|||
|
SavdDS EQU WORD PTR [BP-2] ; caller's data segment
|
|||
|
Ctrl87 EQU WORD PTR [BP-4] ; caller's 80x87 control word
|
|||
|
|
|||
|
PUSH BP ; save TURBO-Pascal frame ptr
|
|||
|
MOV BP, SP ; new frame pointer
|
|||
|
PUSH DS ; save TURBO-Pascal data seg.
|
|||
|
SUB SP, 2 ; make local variabe
|
|||
|
FSTCW [Ctrl87] ; save 80x87 ctrl word
|
|||
|
LES SI, AddrT ; ptr to transformation matrix
|
|||
|
FINIT ; initialize coprocessor
|
|||
|
FSBP2 ; set register bank 2
|
|||
|
FLD QWORD PTR ES:[SI] ; load a[0,0]
|
|||
|
FLD QWORD PTR ES:[SI+32] ; load a[1,0]
|
|||
|
FLD QWORD PTR ES:[SI+64] ; load a[2,0]
|
|||
|
FLD QWORD PTR ES:[SI+96] ; load a[3,0]
|
|||
|
FLD QWORD PTR ES:[SI+8] ; load a[0,1]
|
|||
|
FLD QWORD PTR ES:[SI+40] ; load a[1,1]
|
|||
|
FLD QWORD PTR ES:[SI+72] ; load a[2,1]
|
|||
|
FLD QWORD PTR ES:[SI+104] ; load a[3,1]
|
|||
|
FINIT ; initialize coprocessor
|
|||
|
FSBP1 ; set register bank 1
|
|||
|
FLD QWORD PTR ES:[SI+16] ; load a[0,2]
|
|||
|
FLD QWORD PTR ES:[SI+48] ; load a[1,2]
|
|||
|
FLD QWORD PTR ES:[SI+80] ; load a[2,2]
|
|||
|
FLD QWORD PTR ES:[SI+112] ; load a[3,2]
|
|||
|
FLD QWORD PTR ES:[SI+24] ; load a[0,3]
|
|||
|
FLD QWORD PTR ES:[SI+56] ; load a[1,3]
|
|||
|
FLD QWORD PTR ES:[SI+88] ; load a[2,3]
|
|||
|
FLD QWORD PTR ES:[SI+120] ; load a[3,3]
|
|||
|
|
|||
|
; transformation matrix loaded
|
|||
|
|
|||
|
MOV AX, F ; index of first vector
|
|||
|
MOV DX, K ; index of last vector
|
|||
|
|
|||
|
MOV BX, AX ; index 1st vector to process
|
|||
|
MOV CL, 3 ; component has 8 (2**3) bytes
|
|||
|
SHL BX, CL ; compute offset into arrays
|
|||
|
|
|||
|
FINIT ; initialize coprocessor
|
|||
|
FSBP0 ; set register bank 0
|
|||
|
|
|||
|
$mat_loop:LES SI, AddrW ; addr. of W component array
|
|||
|
FLD QWORD PTR ES:[SI+BX] ; W component current vector
|
|||
|
LES SI, AddrZ ; addr. of Z component array
|
|||
|
FLD QWORD PTR ES:[SI+BX] ; Z component current vector
|
|||
|
LES SI, AddrY ; addr. of Y component array
|
|||
|
FLD QWORD PTR ES:[SI+BX] ; Y component current vector
|
|||
|
LES SI, AddrX ; addr. of X component array
|
|||
|
FLD QWORD PTR ES:[SI+BX] ; X component current vector
|
|||
|
F4X4 ; mul 4x4 matrix by 4x1 vector
|
|||
|
INC AX ; next vector
|
|||
|
MOV DI, AX ; next vector
|
|||
|
SHL DI, CL ; offset of vector into arrays
|
|||
|
|
|||
|
FSTP QWORD PTR ES:[SI+BX] ; store X comp. of curr. vect.
|
|||
|
LES SI, AddrY ; address of Y component array
|
|||
|
FSTP QWORD PTR ES:[SI+BX] ; store Y comp. of curr. vect.
|
|||
|
LES SI, AddrZ ; address of Z component array
|
|||
|
FSTP QWORD PTR ES:[SI+BX] ; store Z comp. of curr. vect.
|
|||
|
LES SI, AddrW ; address of W component array
|
|||
|
FSTP QWORD PTR ES:[SI+BX] ; store W comp. of curr. vect.
|
|||
|
|
|||
|
MOV BX, DI ; ofs nxt vect. in comp. arrays
|
|||
|
CMP AX, DX ; nxt vector past upper bound?
|
|||
|
JLE $mat_loop ; no, transform next vector
|
|||
|
FLDCW [Ctrl87] ; restore orig 80x87 ctrl word
|
|||
|
|
|||
|
ADD SP, 2 ; get rid of local variable
|
|||
|
POP DS ; restore TP data segment
|
|||
|
POP BP ; restore TP frame pointer
|
|||
|
RET 24 ; pop parameters and return
|
|||
|
IIT_MUL_4x4 ENDP
|
|||
|
|
|||
|
CODE ENDS
|
|||
|
|
|||
|
END
|
|||
|
|
|||
|
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
|
|||
|
|
|||
|
{$N+,E+}
|
|||
|
|
|||
|
PROGRAM Trnsform;
|
|||
|
|
|||
|
USES Time;
|
|||
|
|
|||
|
CONST VectorLen = 8190;
|
|||
|
|
|||
|
TYPE Vector = ARRAY [0..VectorLen] OF DOUBLE;
|
|||
|
VectorPtr = ^Vector;
|
|||
|
Mat4 = ARRAY [1..4, 1..4] OF DOUBLE;
|
|||
|
|
|||
|
VAR X: VectorPtr;
|
|||
|
Y: VectorPtr;
|
|||
|
Z: VectorPtr;
|
|||
|
W: VectorPtr;
|
|||
|
T: Mat4;
|
|||
|
K: INTEGER;
|
|||
|
L: INTEGER;
|
|||
|
First: INTEGER;
|
|||
|
Last: INTEGER;
|
|||
|
Start: LONGINT;
|
|||
|
Elapsed:LONGINT;
|
|||
|
|
|||
|
PROCEDURE MUL_4X4 (X, Y, Z, W: VectorPtr;
|
|||
|
VAR T: Mat4; First, Last: INTEGER); EXTERNAL;
|
|||
|
PROCEDURE IIT_MUL_4X4 (X, Y, Z, W: VectorPtr;
|
|||
|
VAR T: Mat4; First, Last: INTEGER); EXTERNAL;
|
|||
|
|
|||
|
{$L M4X4.OBJ}
|
|||
|
|
|||
|
BEGIN
|
|||
|
WriteLn ('Test8087 = ', Test8087);
|
|||
|
New (X);
|
|||
|
New (Y);
|
|||
|
New (Z);
|
|||
|
New (W);
|
|||
|
FOR L := 1 TO VectorLen DO BEGIN
|
|||
|
X^ [L] := Random;
|
|||
|
Y^ [L] := Random;
|
|||
|
Z^ [L] := Random;
|
|||
|
W^ [L] := Random;
|
|||
|
END;
|
|||
|
X^ [0] := 1;
|
|||
|
Y^ [0] := 1;
|
|||
|
Z^ [0] := 1;
|
|||
|
W^ [0] := 1;
|
|||
|
FOR K := 1 TO 4 DO BEGIN
|
|||
|
FOR L := 1 TO 4 DO BEGIN
|
|||
|
T [K, L] := (K-1)*4 + L;
|
|||
|
END;
|
|||
|
END;
|
|||
|
First := 0;
|
|||
|
Last := 8190;
|
|||
|
Start := Clock;
|
|||
|
MUL_4X4 (X, Y, Z, W, T, First, Last);
|
|||
|
{ IIT_MUL_4X4 (X, Y, Z, W, T, First, Last); }
|
|||
|
Elapsed := Clock - Start;
|
|||
|
WriteLn ('Number of vectors: ', Last-First+1);
|
|||
|
WriteLn ('Time: ', Elapsed, ' ms');
|
|||
|
WriteLn ('Equivalent to ', (28.0*(Last-First+1)/1e6)/
|
|||
|
(Elapsed*1e-3):0:4, ' MFLOPS');
|
|||
|
WriteLn;
|
|||
|
WriteLn ('Last vector:');
|
|||
|
WriteLn;
|
|||
|
WriteLn (X^[Last]);
|
|||
|
WriteLn (Y^[Last]);
|
|||
|
WriteLn (Z^[Last]);
|
|||
|
WriteLn (W^[Last]);
|
|||
|
END
|