4531 lines
245 KiB
Plaintext
4531 lines
245 KiB
Plaintext
|
||
|
||
EVERYTHING YOU ALWAYS WANTED TO KNOW ABOUT MATH COPROCESSORS
|
||
|
||
This document has been created to provide the net.community with some
|
||
detailed information about mathematical coprocessors for the Intel 80x86 CPU
|
||
family. It may also help to answer some of the FAQs (frequently asked
|
||
questions) about this topic. The primary focus of this document is on 80387-
|
||
compatible chips, but there is also some information on the other chips in
|
||
the 80x87 family and the Weitek family of coprocessors. Care was taken to
|
||
make the information included as accurate as possible. If you think you have
|
||
discovered erroneous information in this text, or think that a certain detail
|
||
needs to be clarified, or want to suggest additions, feel free to contact me
|
||
at:
|
||
|
||
S_JUFFA@IRAVCL.IRA.UKA.DE
|
||
|
||
or at my SnailMail address:
|
||
|
||
Norbert Juffa
|
||
Wielandtstr. 14
|
||
7500 Karlsruhe 1
|
||
Germany
|
||
|
||
|
||
This is the fifth version of this document (dated 01-13-93) and I'd like
|
||
to thank those who have helped improving it by commenting on the previous
|
||
versions:
|
||
|
||
Fred Dunlap (cyrix!fred@texsun.Central.Sun.COM), Peter Forsberg
|
||
(peter@vnet.ibm.com), Richard Krehbiel (richk@grevyn.com), Arto
|
||
Viitanen (av@cs.uta.fi), Jerry Whelan (guru@stasi.bradley.edu),
|
||
Eric Johnson (johnson%camax01@uunet.UU.NET), Warren Ferguson
|
||
(ferguson@seas.smu.edu), Bengt Ask (f89ba@efd.lth.se), Thomas Hoberg
|
||
(tmh@prosun.first.gmd.de), Nhuan Doduc (ndoduc@framentec.fr), John
|
||
Levine (johnl@iecc.cambridge.ma.us), David Hough (dgh@validgh.com),
|
||
Duncan Murdoch (dmurdoch@mast.QueensU.CA), Benjamin Eitan
|
||
(benny.iil.intel.com)
|
||
|
||
A very special thanks goes to David Ruggiero (osiris@halcyon.halcyon.com),
|
||
who did a great job editing and formatting this article. Thanks David!
|
||
|
||
|
||
Contents of this document
|
||
-------------------------
|
||
|
||
1) What are math coprocessors?
|
||
2) How PC programs use a math coprocessor
|
||
3) Which applications benefit from a math coprocessor
|
||
4) Potential performance gains with a math coprocessor
|
||
5) How various math coprocessors work
|
||
6) Coprocessor emulator software
|
||
7) Installing a math coprocessor
|
||
8) Detailed description and specifications for all available math
|
||
coprocessor chips
|
||
9) Finding out which coprocessor you have (the COMPTEST program)
|
||
10) Current coprocessor prices and purchasing advice
|
||
11) The coprocessor benchmark programs (performance comparisons of
|
||
available math coprocessors using various CPUs)
|
||
12) Clock-cycle timings for each coprocessor instruction
|
||
13) Accuracy tests and IEEE-754 conformance for various coprocessors
|
||
14) Accuracy of transcendental function calculations for various coprocessors
|
||
15) Compatibility tests with Intel's 387DX / the SMDIAG program
|
||
16) References (literature)
|
||
17) Addresses of manufacturers of math coprocessors
|
||
18) Appendix A: Test programs for partial compatibility and accuracy checks
|
||
19) Appendix B: Benchmark programs TRNSFORM and PEAKFLOP
|
||
|
||
|
||
|
||
===========================
|
||
What are math coprocessors?
|
||
===========================
|
||
|
||
A coprocessor in the traditional sense is a processor, separate from the main
|
||
CPU, that extends the capabilities of a CPU in a transparent manner. This
|
||
means that from the program's (and programmer's) point of view, the CPU and
|
||
coprocessor together look like a single, unified machine.
|
||
|
||
The 80x87 family of math coprocessors (also known as MCPs [Math
|
||
CoProcessors], NDPs [Numerical Data Processors], NPXs [Numerical Processor
|
||
eXtensions], or FPUs [Floating-Point Units], or simply "math chips") are
|
||
typical examples of such coprocessors. The 80x86 CPUs, with the exception of
|
||
the 80486 (which has a built-in FPU) can only handle 8, 16, or 32 bit
|
||
integers as their basic data types. However, many PC-based applications
|
||
require the use of not only integers, but floating-point numbers. Simply put,
|
||
the use of floating-point numbers enables a binary representation of not only
|
||
integers, but also fractional values over a wide range. A common application
|
||
of floating-point numbers is in scientific applications, where very small
|
||
(e.g., Planck's constant) and very large numbers (e.g., speed of light) must
|
||
be accurately expressed. But floating-point numbers are also useful for
|
||
business applications such as computing interest, and in the geometric
|
||
calculations inherent in CAD/CAM processing.
|
||
|
||
Because the instruction sets of all 80x86 CPUs directly support only integers
|
||
and calculations upon integers, floating-point numbers and operations on them
|
||
must be programmed indirectly by using series of CPU integer instructions.
|
||
This means that computations when floating-point numbers are used are far
|
||
slower than normal, integer calculations. And this is where the 80x87
|
||
coprocessors come in: adding an 80x87 to an 80x86-based system augments the
|
||
CPU architecture with eight floating-point registers, five additional data
|
||
types and over 70 additional instructions, all designed to deal directly with
|
||
floating-point numbers as a basic data type. This removes the 'penalty' for
|
||
floating-point computations, and greatly increases overall system performance
|
||
for applications which depend heavily on these calculations.
|
||
|
||
In addition to being able to quickly execute load/store operations on
|
||
floating-point numbers, the 80x87 coprocessors can directly perform all the
|
||
basic arithmetic operation on them. Besides "knowing" how to add, subtract,
|
||
multiply and divide floating-point numbers, they can also operate on them to
|
||
perform comparisons, square roots, transcendental functions (such as logarithms
|
||
and sine/cosine/tangent), and compute their absolute value and remainder.
|
||
|
||
Like most things in life, floating-point arithmetic has been standardized.
|
||
The relevant standard (to which I will refer quite often in this document) is
|
||
the "IEEE-754 Standard for Binary Floating-Point Arithmetic" [10,11]. The
|
||
standard specifies numeric formats, value sets and how the basic arithmetic
|
||
(+,-,*,/,sqrt, remainder) has to work. All the coprocessors covered in this
|
||
document claim full or at least partial compliance with the IEEE-754
|
||
standard.
|
||
|
||
|
||
|
||
=================================================
|
||
How PC programs use 80x87 and Weitek coprocessors
|
||
=================================================
|
||
|
||
The basic data type used by all 80x87 coprocessors is an 80-bit long
|
||
floating-point number. This data type (called "temporary real" or "double
|
||
extended precision") can directly represent numbers which range in size
|
||
between 3.36*10^-4932 and 1.19*10^4932 (3.65*10^-4951 to 1.19*10^4932
|
||
including denormal numbers) where '^' denotes the power operator. (For those
|
||
familiar with floating-point formats, this format has 64 mantissa bits, 15
|
||
exponent bits and 1 sign bit, for the total of 80 bits.) This format provides
|
||
a precision of about 19 decimal places. 80x87s can also handle additional
|
||
data types that are converted to/from the internal format upon being loaded
|
||
or stored to/from the coprocessor. These include 16 bit, 32 bit, and 64 bit
|
||
integers as well as a 18 digit BCD (binary coded decimal) data type occupying
|
||
10 bytes and providing 18 decimal digits.
|
||
|
||
The 80x87 also supports two additional floating-point types. The short real
|
||
data type (also called "single-precision") has 32 bits that split into 23
|
||
mantissa bits, 8 exponent bit and a sign bit. By using the "hidden bit"
|
||
technique, the effective length of the mantissa is increased to 24 bits. (The
|
||
hidden bit technique exploits the fact that for normalized floating-point
|
||
numbers, the mantissa m always is in the range 1 <= m < 2. Since the first
|
||
mantissa bit represents the integer part of the mantissa, it is always set
|
||
for normalized numbers, and therefore need not be stored, as it is guaranteed
|
||
to always be 1.) The IEEE single-precision format provides a precision of
|
||
about 6-7 decimal places and can represent numbers between 1.17*10^-38 and
|
||
3.40*10^38 (1.40*10^-45 to 3.40*10^38 including denormal numbers). The long
|
||
real, or double-precision, data type has 64 bits, consisting of 52 mantissa
|
||
bits, 11 exponent bits, and the sign bit. It provides 15-16 decimal digits of
|
||
precision and can handle numbers from 2.22*10^-308 to 1.79*10^308 (4.94*10^-
|
||
324 to 1.79*10^308 including denormal numbers). (This format also uses the
|
||
hidden bit technique to provide effectively 53 mantissa bits.)
|
||
|
||
The eight registers in the 80x87 are organized in a stack-like manner which
|
||
takes some time getting used to if one programs the coprocessor directly in
|
||
assembly language. However, nowadays the compilers or interpreters for most
|
||
high level languages (HLLs) can give a programmer easy access to the
|
||
coprocessor's data types and use their instructions, so there is not much
|
||
need to deal directly with the rather unusual architecture of the 80x87.
|
||
|
||
|
||
The architecture of the Weitek chips differs significantly from the 80x87.
|
||
Strictly speaking, the Weitek Abacus 3167 and 4167 are not coprocessors in
|
||
that they do not transparently extend the CPU architecture; rather, they
|
||
could be described as highly-specialized, memory-mapped IO devices. But as
|
||
the term "coprocessor" has been traditionally used for these chips, they will
|
||
be referred to as such here.
|
||
|
||
The Weitek coprocessors have a RISC-like architecture which has been tuned
|
||
for maximum performance. Only a small instruction set has been implemented in
|
||
the chip, but each instruction executes at a very high speed (usually only a
|
||
few clock cycles each). Instructions available include load/store, add,
|
||
subtract, subtract reverse, multiply, multiply and negate, multiply and
|
||
accumulate, multiply and take absolute value, divide reverse, negate,
|
||
absolute value, compare/test, convert fix/float, and square root. In contrast
|
||
to the 80x87 family, the Weitek Abacus does not support a double extended
|
||
format, has no built-in transcendental functions, and does not support
|
||
denormals. The resources required to implement such features have instead
|
||
been devoted to implement the basic arithmetic operations as fast as
|
||
possible.
|
||
|
||
While the 80x87 coprocessors perform all internal calculations in double
|
||
extended precision and therefore have about the same performance for single
|
||
and double-precision calculations, the Weitek features explicit single and
|
||
double-precision operations. For applications that require only single-
|
||
precision operations, the Weitek can therefore provide very high performance,
|
||
as single-precision operations are about twice as fast as their double-
|
||
precision counterparts. Also, since the Weitek Abacus has more registers than
|
||
the 80x87 coprocessors (31 versus 8), values can be kept in registers more
|
||
often and have to be loaded from memory less frequently. This also leads to
|
||
performance gains.
|
||
|
||
The Weitek's register file consists of 31 32-bit registers, each one capable
|
||
of holding an IEEE single-precision number. Pairs of consecutive single-
|
||
precision registers can also be used as 64-bit IEEE double-precision
|
||
registers; thus there are 15 double-precision registers. The Weitek register
|
||
file has the standard organization like the register files in the 80386, not
|
||
the special stack-like organization of the 80x87 coprocessors.
|
||
|
||
To the main CPU, the Weitek Abacus appears as a 64 KB block of memory
|
||
starting at physical address 0C0000000h. Each address in this range
|
||
corresponds to a coprocessor instruction. Accessing a specified memory
|
||
location within this block with a MOV instruction causes the corresponding
|
||
Weitek instruction to be executed. (The instructions have been cleverly
|
||
assigned to memory locations in such a way that loads to consecutive
|
||
coprocessor registers can make use of the 386/486 MOVS string instruction.)
|
||
This memory-mapped interface is much faster than the IO-oriented protocol
|
||
that is used to couple the CPU to an 80287 or 80387 coprocessor. The Weitek's
|
||
memory block can actually be assigned to any logical address using the MMU
|
||
(memory management unit) in the 386/486's protected and virtual modes. This
|
||
also means that the Weitek Abacus *cannot* be used in the real mode of those
|
||
processors, since their physical starting address (0C0000000h) is not within
|
||
the 1 MByte address range and the MMU is inoperable in real mode. However,
|
||
DOS programs can make use of the Weitek by using a DOS extender or a memory
|
||
manager (such as QEMM or EMM386) that runs in protected/virtual mode itself
|
||
and can therefore map the Weitek's memory block to any desired location in
|
||
the 1 MByte address range.
|
||
|
||
Typically the FS segment register is then set up to point to the Weitek's
|
||
memory block. On the 80486, this technique has severe drawbacks, as using the
|
||
FS: prefix takes an additional clock cycle, thereby nearly halving the
|
||
performance of the 4167. Most DOS-based compilers exhibit this problem, so
|
||
the only way around it is to code in assembly language [75]. The Weitek
|
||
Abacus 3167 and 4167 are also supported by the UNIX operating system [33].
|
||
|
||
|
||
|
||
==========================================================
|
||
Which application programs benefit from a math coprocessor
|
||
==========================================================
|
||
|
||
According to the Intel 387DX User's Guide, there are more than 2100
|
||
commercial programs that can make use of a 387-compatible coprocessor. Every
|
||
program that uses floating-point arithmetic somewhere and contains the
|
||
instructions to support an 80x87 or Weitek chip can gain speed by installing
|
||
one. However, the speedup will vary from program to program (and even within
|
||
the same program) depending on how computation-intensive the program or
|
||
operation within the program is. Typical applications that benefit from the
|
||
use of a math coprocessor are:
|
||
|
||
- CAD programs (AutoCAD, VersaCAD, GenericCAD)
|
||
- Spreadsheet programs (Lotus 1-2-3, Excel, Quattro, Wingz)
|
||
- Business graphics programs (Arts&Letters, Freedom of Press, Freelance)
|
||
- Mathematical analysis and statistical programs (Mathematica, TKSolver,
|
||
SPSS/PC, Statgraphics)
|
||
- Database programs (dBase IV, FoxBase, Paradox, Revelation)
|
||
|
||
Note that for spreadsheets and databases, a coprocessor only helps if some
|
||
kind of floating-point computation is performed; this is true more often for
|
||
spreadsheets than for databases. Also note that the speed of many programs
|
||
depends quite heavily on factors such the speed of the graphics adapter (CAD)
|
||
or the disk performance (databases), so the computational performance is only
|
||
a (small) part of the total performance of the application. There are some
|
||
programs that won't run without a coprocessor, among them AutoCAD (R10 and
|
||
later) and Mathematica.
|
||
|
||
Most GUIs (graphical user interfaces) such as Microsoft Windows or the OS/2
|
||
Presentation Manager do *not* gain additional speed from using a
|
||
*mathematical* coprocessor, since their graphics operations only use integer
|
||
arithmetic [71]. They *will* benefit from a graphics board with a graphics
|
||
"coprocessor" that speeds up certain common graphics operations such as
|
||
BitBlt or line drawing. A few GUIs used on PCs, such as X-Windows, use a
|
||
certain amount of floating-point operations for operations such as arc
|
||
drawing. However, the use of floating-point operations in X-Windows seems to
|
||
have decreased significantly in versions after X11R3, so the overall
|
||
performance impact of a coprocessor is small [72]. Applications running under
|
||
any GUI may take advantage of a math coprocessor, of course (for example,
|
||
Microsoft Excel running under Windows).
|
||
|
||
While support for 80x87 coprocessors is very common in application programs,
|
||
the Weitek Abacus coprocessors do not enjoy such widespread support. Due to
|
||
their higher price, only a few high-end PCs have been equipped with Weitek
|
||
coprocessors. Some machines, such as IBM's PS/2 series, do not even have
|
||
sockets to accommodate them. Therefore, most of the programs that support
|
||
these coprocessors are also high-end products, like AutoCAD and Versacad-386.
|
||
|
||
|
||
|
||
==============================================
|
||
Potential performance gains with a coprocessor
|
||
==============================================
|
||
|
||
The Intel Math Coprocessor Utilities Disk that accompanies the Intel 387DX
|
||
coprocessor has a demonstration program that shows the speedup of certain
|
||
application programs when run with the Intel coprocessor versus a system with
|
||
no coprocessor:
|
||
|
||
Application Time w/o 387 Time w/387 Speedup
|
||
|
||
Art&Letters 87.0 sec 34.8 sec 150%
|
||
Quattro Pro 8.0 sec 4.0 sec 100%
|
||
Wingz 17.9 sec 9.1 sec 97%
|
||
Mathematica 420.2 sec 337.0 sec 25%
|
||
|
||
|
||
The following table is an excerpt from [70]:
|
||
|
||
Application Time w/o 387 Time w/387 Speedup
|
||
|
||
Corel Draw 471.0 sec 416.0 sec 13%
|
||
Freedom Of Press 163.0 sec 77.0 sec 112%
|
||
Lotus 1-2-3 257.0 sec 43.0 sec 597%
|
||
|
||
|
||
The following table is an excerpt from [25]:
|
||
|
||
Application Time w/o 387 Time w/387 Speedup
|
||
|
||
Design CAD, Test1 98.1 sec 50.0 sec 96%
|
||
Design CAD, Test2 75.3 sec 35.0 sec 115%
|
||
Excel, Test 1 9.2 sec 6.8 sec 35%
|
||
Excel, Test 1 12.6 sec 9.3 sec 35%
|
||
|
||
|
||
Note that coprocessor performance also depends on the motherboard, or more
|
||
specifically, the chipset used on the motherboard. In [34] and [35]
|
||
identically configured motherboards using different 386 chipsets were tested.
|
||
Among other tests a coprocessor benchmark was run which is based on a fractal
|
||
computation and its execution time recorded. The following tables showing
|
||
coprocessor performance to vary with the chipset have been copied from these
|
||
articles in abridged form:
|
||
|
||
Cyrix Cyrix
|
||
chip set 387+ chip set 83D87
|
||
|
||
Opti, 40 MHz 24.57 sec 97.0% PC-Chips, 33 MHz 26.97 sec 93.0%
|
||
Elite,40 MHz 24.46 sec 97.4% UMC, 33 MHz 27.69 sec 90.5%
|
||
ACT, 40 MHz 23.84 sec 100.0% Headland, 33 MHz 25.08 sec 100.0%
|
||
Forex,40 MHz 23.84 sec 100.0% Eteq, 33 MHz 27.38 sec 91.6%
|
||
|
||
|
||
This shows that performance of the same coprocessor can vary by up to ~10%
|
||
depending on the chipset used on your board, at least for 386 motherboards
|
||
(similar numbers for 286, 386SX, and 486 are, unfortunately, not available).
|
||
The benchmarks for this article were run on a motherboard with the Forex chip
|
||
set, one of the fastest 386 chip sets available, and not only with respect to
|
||
floating-point performance [35].
|
||
|
||
|
||
|
||
==================================
|
||
How various math coprocessors work
|
||
==================================
|
||
|
||
In any 80x86 system with an 80x87 math coprocessor, CPU instructions and
|
||
coprocessor instructions are executed concurrently. This means that the CPU
|
||
can execute CPU instructions while the coprocessor executes a coprocessor
|
||
instruction at the same time. The concurrency is restricted somewhat by the
|
||
fact that the CPU has to aid the coprocessor in certain operations. As the
|
||
CPU and the coprocessor are fed from the same instruction stream and both
|
||
instruction streams may operate on the same data, there has to be a
|
||
synchronizing mechanism between the CPU and the coprocessor.
|
||
|
||
|
||
The 8087
|
||
--------
|
||
In 8086/8088 systems with 8087 coprocessors, both chips look at every opcode
|
||
coming in from the bus. To do this, both chips have the same BIU (bus
|
||
interface unit) and the 8086 BIU sends the status signals of its prefetch
|
||
queue to the 8087 BIU. This insures that both processors always decode the
|
||
same instructions in parallel. Since all coprocessor instruction start with
|
||
the bit pattern 11011, it is easy for the 8087 to ignore all other
|
||
instructions. Likewise the CPU ignores all coprocessor instructions, unless
|
||
they access memory. In this case, the CPU computes the address of the LSB
|
||
(least significant byte) of the memory operand and does a dummy read. The
|
||
8087 then takes the data from the data bus. If more than one memory access is
|
||
needed to load an memory operand, the 8087 requests the bus from the CPU,
|
||
generates the consecutive addresses of the operand's bytes and fetches them
|
||
from the data bus. After completing the operation, the 8087 hands bus control
|
||
back to the CPU. Since 8087 and CPU are hooked up to the same synchronous
|
||
bus, they must run at the same speed. This means that with the 8087, only
|
||
synchronous operation of CPU and coprocessor is possible.
|
||
|
||
Another 8087 coprocessor instruction can only be started if the previous one
|
||
has been completed in the NEU (numerical execution unit) of the 8087. To
|
||
prevent the 8086 from decoding a new coprocessor instruction while the 8087
|
||
is still executing the previous coprocessor instruction, a coding mechanism
|
||
is employed: All 8087-capable compilers and assemblers automatically
|
||
generate a WAIT instruction before each coprocessor instruction. The WAIT
|
||
instruction tests the CPU's /TEST pin and suspends execution until its input
|
||
becomes "LOW". In all 8086/8087 systems, the 8086 /TEST pin is connected to
|
||
the 8087 BUSY pin. As long as the NEU executes a coprocessor instruction, it
|
||
forces its BUSY pin "HIGH"; thus, the WAIT opcode preceding the coprocessor
|
||
instruction stops the CPU until any still-executing coprocessor instruction
|
||
has finished.
|
||
|
||
The same synchronization is used before the CPU accesses data that was
|
||
written by the coprocessor. A WAIT instruction after any coprocessor
|
||
instruction that writes to memory causes the CPU to stop until the
|
||
coprocessor has completed transfer of the data to memory, after which the CPU
|
||
can safely access it.
|
||
|
||
|
||
The 80287
|
||
---------
|
||
The 80287 coprocessor-CPU interface is totally different from the 8087
|
||
design. Since the 80286 implements memory protection via an MMU based on
|
||
segmentation, it would have been much too expensive to duplicate the whole
|
||
memory protection logic on the coprocessor, which an interface solution
|
||
similar to the 8087 would have required. Instead, in an 80286/80287 system,
|
||
the CPU fetches and stores all opcodes and operands for the coprocessor.
|
||
Information is then passed through the CPU ports F8h-FFh. (As these ports are
|
||
accessible under program control, care must be taken in user programs not to
|
||
accidentally perform write operations to them, as this could corrupt data in
|
||
the math coprocessor.)
|
||
|
||
The 8087/8087 combination can be characterized as a cooperation of partners
|
||
with equal rights, while the 80286/287 is more a master-slave relationship.
|
||
This makes synchronization easier, since the complete instruction and data
|
||
flow of the coprocessor goes through the CPU. Before executing most
|
||
coprocessor instructions, the 80286 tests its /BUSY pin, which is tied to the
|
||
287 coprocessor and signals if the 80287 is still executing a previous
|
||
coprocessor instruction or has encountered an exception. The 80286 then waits
|
||
until the /BUSY signal goes to "low" before loading the next coprocessor
|
||
instruction into the 80287. Therefore, a WAIT instruction before every
|
||
coprocessor instruction is not required. These WAITs are permissible, but not
|
||
necessary, in 80287 programs. The second form of WAIT synchronization (after
|
||
the coprocessor has written a memory operand) *is* still necessary on 286/287
|
||
systems.
|
||
|
||
The execution unit of the 80287 is practically identical to that of the 8087;
|
||
that is, nearly all coprocessor instructions execute in the same number of
|
||
clock cycles on both coprocessors. However, due to the additional overhead of
|
||
the 80287's CPU/coprocessor interface (at least ~40 clock cycles), an 8 MHz
|
||
80286/80287 combination can have lower floating-point performance than an
|
||
8086/8087 system running at the same speed. Additionally, older 286 boards
|
||
were often configured to run the coprocessor at only 2/3 the speed of the
|
||
CPU, making use of the ability of the 80287 to run asynchronously: The 80287
|
||
has a CKM pin that causes the incoming system clock to be divided by three
|
||
for the coprocessor if it is tied to ground. The 80286 always divides the
|
||
system clock by two internally, hence the final ratio of 2/3. However, when
|
||
the CKM (ClocK Mode) pin is tied high on the 80287, it does not divide the
|
||
CLK input. This feature has been exploited by the maker of coprocessor speed
|
||
sockets. These sockets tie CKM high and supply their own CLK signal with a
|
||
built-in oscillator, thereby allowing the 80287 or compatible to run at a
|
||
much higher speed than the CPU. With an IIT or Cyrix 287 one can have a 20
|
||
MHz coprocessor running with a 8 MHz 80286! Note, however, that the floating-
|
||
point performance of such a configuration does not scale linearly with the
|
||
coprocessor clock, since all the data has to be passed through the much
|
||
slower CPU. If the coprocessor executes mostly simple instructions (such as
|
||
addition and multiplication), doubling the coprocessor clock to 20 MHz in a
|
||
10 MHz system does not show any performance increase at all [24].
|
||
|
||
The Intel 80287XL, the Cyrix 82S87, and the IIT 2C87 contain the internals of
|
||
a 387 coprocessor, but are pin-compatible to the original 287. These chips
|
||
divide the system clock by two internally, as opposed to three in the
|
||
original 80287. Since the 80286 also divides the system clock by two, they
|
||
usually run synchronously with respect to the CPU, although they can also be
|
||
run asynchronously.
|
||
|
||
|
||
The 80387
|
||
---------
|
||
The coprocessor interface in 80386/80387 systems is very similar to the one
|
||
found in 286/287 systems. However, to prevent corruption of the coprocessor's
|
||
contents by programming errors, the IO ports 800000F8h-800000FFh are used,
|
||
which are not accessible to programs. The CPU/coprocessor interface has been
|
||
optimized and uses full 32-bit transfers; the interface overhead has been
|
||
reduced to about 14-20 clock cycles. For some operations on the 387 'clones'
|
||
that take less than about 16 clock cycles to complete, this overhead
|
||
effectively limits the execution rate of coprocessor instructions. The only
|
||
sensible solution to provide even higher floating-point performance was to
|
||
integrate the CPU and coprocessor functionality onto the same chip, which
|
||
is exactly what Intel did with the 80486 CPU. The FPU in the 486 also benefits
|
||
from the instruction pipelining and from the on-chip cache.
|
||
|
||
|
||
|
||
=====================
|
||
Coprocessor emulators
|
||
=====================
|
||
|
||
In the absence of a coprocessor, floating-point calculations are often
|
||
performed by a software package that simulates its operations. Such a program
|
||
is called a coprocessor emulator. Simulating the coprocessor has the
|
||
advantage for application programs that identical code can be generated for
|
||
use with either the coprocessor and the emulator, so that it's possible to
|
||
write programs that run on any system without regard to whether a coprocessor
|
||
is present or not. Whether the program will use an actual coprocessor or
|
||
software emulating it can easily be determined at run-time by detecting the
|
||
presence or absence of the coprocessor chip.
|
||
|
||
Two approaches to interface an 80x87 emulator to programs are common. The
|
||
first method makes use of the fact that all coprocessor instruction start
|
||
with the same five bit pattern 11011. Thus the first byte of a coprocessor
|
||
instruction will be in the range D8-DF hexadecimal. In addition, coprocessor
|
||
instructions usually are preceded by a WAIT instruction (opcode 9Bh) which is
|
||
one byte long (the reason for doing this has been described in the previous
|
||
chapter dealing with the operating details of the 80x87). One common approach
|
||
is to replace the WAIT instruction and the first byte of the coprocessor
|
||
instruction with one out of eight interrupt instructions; the remaining bytes
|
||
of the coprocessor instruction are left unchanged. Interrupts 34 to 3B
|
||
hexadecimal are used for this emulation technique. (Note that the sequences
|
||
9B D8 ... 9B DF can be easily converted to the interrupt instructions CD 34
|
||
... CD 3B by simple addition and subtraction of constants.) The compiler or
|
||
assembler initially produces code that contains these appropriate interrupt
|
||
calls instead of the coprocessor instructions. If a hardware coprocessor is
|
||
detected at run-time, the emulator interrupts point to a short routine that
|
||
converts the interrupts calls back to coprocessor instructions (yes, this
|
||
is known as "self-modifying code"). If no coprocessor is found the interrupts
|
||
point to the emulation package, which examines the byte(s) following the
|
||
interrupt instruction to determine which floating-point operation to perform.
|
||
This method is used by many compilers, including those from Microsoft and
|
||
Borland. It works with every 80x86 CPU from the 8086/8088 on.
|
||
|
||
The second method to interface an emulator is only available on 286/386/486
|
||
machines. If the emulation bit in the machine status word of these processors
|
||
is set, the processors will generate an interrupt 7 whenever a coprocessor
|
||
instruction is encountered. The vector for this interrupt will have been set
|
||
up to point at an emulation package that decodes the instruction and performs
|
||
the desired operation. This approach has the advantage that the emulator
|
||
doesn't have to be included in the program code, but can be loaded once (as a
|
||
TSR or device driver) and then used by every program that requires a
|
||
coprocessor. Emulation via interrupt 7 is transparent, which means that
|
||
programs containing coprocessor instructions execute just like a coprocessor
|
||
was present, only slower. This approach is taken by the public domain EM87
|
||
emulator, the shareware program Q387, and the commercial Franke387 emulator,
|
||
for example. Even programs that require a coprocessor to run like AutoCAD
|
||
are 'fooled' to believe that a coprocessor is present with emulators using
|
||
INT 7.
|
||
|
||
Operating systems such as OS/2 2.0 and Windows 3.1 provide coprocessor
|
||
emulations using INT 7 automatically if they do not find a coprocessor to be
|
||
installed. The emulator in Windows doesn't seem to be very fast, as people
|
||
who have ported their Turbo Pascal programs from the TP 6.0 DOS compiler
|
||
(using the emulation built into the TP 6.0 run-time library) to the TPW 1.5
|
||
Windows compiler (using MS Windows' emulator) have noticed. Slowdowns of as
|
||
much as a factor of five have been reported [79].
|
||
|
||
The size of the emulator used by TP 6.0 is about 9.5 KB, while EM87 occupies
|
||
about 15.8 KB as a TSR, and Franke387 uses about 13.4 KB as a device driver.
|
||
Note that Franke387 and especially EM87 model a real coprocessor much more
|
||
closely than Turbo Pascal's emulator does. In particular, EM87 supports
|
||
denormal numbers, precision control, and rounding control. The emulator in TP
|
||
6.0 does not implement these features. The version of Franke387 tested (V2.4)
|
||
supports denormals in single and double-precision, but not double extended
|
||
precision, and it supports precision control, but not rounding control.
|
||
The recently introduced shareware program Q387 only runs on 386, 386SX, 486SX
|
||
and compatible processors. The program loads completely into extended memory
|
||
and uses about 330 KB. To enable INT 7 trapping to a service routine in
|
||
extended memory it needs to run with a memory manager (e.g. EMM386, QEMM,
|
||
or 386MAX). The huge size of the program stems from the fact that it was
|
||
solely optimized for speed, assuming that extended memory is a cheap resource.
|
||
Presumably it uses large tables to speed computations. Intel's E80287 program
|
||
is supposed to be an 100% exact emulation of the 80287 coprocessor [44]. Note
|
||
that the more closely a real coprocessor is modelled by the emulator, the
|
||
slower the emulator runs and the larger the code for the emulator gets.
|
||
|
||
|
||
Relative execution times of coprocessor vs. software emulators
|
||
for selected coprocessor instructions
|
||
|
||
Intel 387DX TP 6.0 Emulator EM87 Emulator
|
||
|
||
FADD ST, ST(0) 1 26 104
|
||
FDIV [DWord] 1 22 136
|
||
FXAM 1 10 73
|
||
FYL2X 1 33 102
|
||
FPATAN 1 36 110
|
||
F2XM1 1 38 110
|
||
|
||
|
||
|
||
The following table is an excerpt from [44]:
|
||
|
||
Intel 80287 Intel E80287 Emulator
|
||
|
||
FADD ST, ST(0) 1 42
|
||
FDIV [DWord] 1 266
|
||
FXAM 1 139
|
||
FYL2X 1 99
|
||
FPATAN 1 153
|
||
F2XM1 1 41
|
||
|
||
|
||
|
||
The following has been adapted from [43] and merged with my own
|
||
data:
|
||
|
||
Intel 8087 TP 6.0 Emul. (8086) Intel Emul. (8086)
|
||
|
||
FADD ST, ST(0) 1 20 94
|
||
FDIV [DWord] 1 22 82
|
||
FPTAN 1 18 144
|
||
F2XM1 1 6 171
|
||
FSQRT 1 44 544
|
||
|
||
|
||
|
||
One of the reasons emulators are so slow is that they are often designed to
|
||
run with every CPU from the 8086/8088 on upwards. This is the case with the
|
||
emulators built into the compiler libraries of the Turbo Pascal 6.0 (also
|
||
used by Turbo C/C++) and Microsoft C 6.0 compiler (probably also used in
|
||
other Microsoft products) and is also true for the EM87 emulator in the
|
||
public domain. By using code that can run on a 8086/8088, these emulators
|
||
forego the speed advantage offered by the additional instructions and
|
||
architectural enhancements (such as 32-bit registers) of the more advanced
|
||
Intel 80x86 processors. A notable exception to this is the Franke387
|
||
emulator, a commercial emulator that is also sold as shareware. It uses 386-
|
||
specific 32-bit code and only runs on 386/386SX/486SX computers.
|
||
|
||
Besides being slow, coprocessor emulators have other drawbacks when compared
|
||
with real coprocessors. Most of the emulators do not support the additional
|
||
instructions that the 387-compatible coprocessors offer over the 80287.
|
||
Often, some of the low-level stack-manipulating instructions like FDECSTP are
|
||
not emulated. For example, [76] lists the coprocessor instructions not
|
||
emulated by Microsoft's emulator (included in the MS-C and MS-FORTRAN
|
||
libraries) as follows:
|
||
|
||
FCOS FRSTOR FSINCOS FXTRACT
|
||
FDECSTP FSAVE FUCOM
|
||
FINCSTP FSETPM FUCOMP
|
||
FPREM1 FSIN FUCOMPP
|
||
|
||
Additionally, some parts of the coprocessor architecture, like the status
|
||
register, are often not or only partially emulated. Some emulators do not
|
||
conform to the IEEE-754 standard in their implementation of the basic
|
||
arithmetic functions, while the hardware coprocessors do. Also, they
|
||
sometimes lack the support for denormals (a special class of floating-point
|
||
numbers) although it is required by the standard. Not all the 80x87 emulators
|
||
support rounding control and precision control, also features required by
|
||
IEEE-754. Most of these omissions are aimed at making the emulator faster and
|
||
smaller. Because of the performance gap and these other shortcomings of
|
||
coprocessor emulators, a real coprocessor is a must for anybody planning to
|
||
do some serious computations. (At today's prices, this shouldn't pose much of
|
||
a problem to anybody!)
|
||
|
||
Nhuan Doduc (ndoduc@framentec.fr) has tested a number of standalone
|
||
coprocessor emulators for PCs, among them the two emulators, EM87 and
|
||
Franke387 V2.4, already mentioned. He found Franke387 to be the best in terms
|
||
of reliability, speed, and accuracy.
|
||
|
||
|
||
|
||
=============================
|
||
Installing a math coprocessor
|
||
=============================
|
||
|
||
Usually, installing a coprocessor doesn't pose much of a problem, as every
|
||
coprocessor comes with installation instructions and a diagnostic disk that
|
||
lets you check its correct operation after installation. In addition, the
|
||
user manuals of most computers have a section on coprocessor installation.
|
||
|
||
1) Make sure to buy the right coprocessor for your system. An 8087 works
|
||
together with 8086, 8088, V20, and V30 CPUs. An 80287, 287XL or
|
||
compatible works with a 80286 CPU. (There are also some old 386
|
||
motherboards that accept a 80287 coprocessor, but they usually also
|
||
provide a socket for the 387; given today's pricing, it makes no sense
|
||
not to get a 387 for these systems.) A 80387, 387DX or compatible
|
||
coprocessor is for 386-based systems, as is the Intel RapidCAD. 387
|
||
coprocessors also work with the Cyrix 486DLC CPU (which, despite its
|
||
name, does not include an FPU). Similarly, the 387SX or compatible
|
||
coprocessor go into systems whose CPU is a 386SX or Cyrix 486SLC.
|
||
|
||
The Weitek Abacus 3167 works with a 386 CPU but requires a 121-pin EMC
|
||
socket in the system; this is *not* the same socket used by a 80387 or
|
||
compatible chip, and some computers, such as IBM's PS/2s, don't have
|
||
this socket. The Weitek Abacus 4167 works together with the 486 and
|
||
requires a special 142-pin socket to be present.
|
||
|
||
2) Always install a coprocessor that's rated at the same clock speed as the
|
||
CPU. For example, in a 40 MHz 386 system using an AMD Am386-40, install
|
||
a coprocessor rated for 40 MHz such as a Cyrix 83D87-40, C&T 38700DX-40,
|
||
IIT 3C87-40, or ULSI 83C87-40. Running a coprocessor above its specified
|
||
frequency rating may cause it to produce false results, which you might
|
||
fail to recognize as such. (I have personally experienced this problem
|
||
with a Cyrix 83D87-33 that I tried to push to 40 MHz. It passed all the
|
||
diagnostic benchmarks on the Cyrix diagnostic disk and the tests of some
|
||
commercial system test programs. However, I found it to fail the
|
||
Whetstone and Linpack benchmarks, which include accuracy checks.)
|
||
Although there is usually no problem with overheating when pushing a
|
||
coprocessor over the specified maximum frequency rating, be warned that
|
||
operation of a coprocessor above the maximum ratings stated by the
|
||
manufacturer may make its operation unreliable.
|
||
|
||
Some 386 boards allow the coprocessor to be clocked differently than the
|
||
CPU. This is called "asynchronous operation" and allows you, for
|
||
example, to run the coprocessor at 33 MHz while the CPU runs at 40 MHz.
|
||
Of the currently available math coprocessors, only the Intel 80387 and
|
||
387DX support asynchronous operation. The 387-compatible "clones" from
|
||
Cyrix, C&T, IIT and ULSI always run at the full speed of the CPU, even
|
||
if you have set up your motherboard for asynchronous operation.
|
||
|
||
3) Once you've got the correct coprocessor for your system you can start
|
||
the actual installation process. Turn off the computer's power switch
|
||
and unplug the power cord from the wall outlet, remove the case, and
|
||
locate the math coprocessor socket. This socket is always located right
|
||
next to the main CPU, which can be identified by the printing on top of
|
||
the chip. (It's also usually one of the biggest chips on the board). The
|
||
8078 and 80287 DIL sockets are rectangular sockets with 20 pin holes on
|
||
each of the longer sides. The 387SX PLCC socket is a square socket that
|
||
has 17 vertical connector strips on the 'wall' of each side. The 387 PGA
|
||
socket is square and has two rows of pin holes on each side. The EMC
|
||
socket for the Weitek 3167 is similar but has three rows of holes on
|
||
each side. The PGA socket for the Weitek 4167 is also square with three
|
||
rows of holes on each side. If you can't find the math coprocessor
|
||
socket, consult your owner's manual, your computer dealer, or a
|
||
knowledgeable friend.
|
||
|
||
If you are installing the Intel RapidCAD chipset in a 386 system, you
|
||
will have to remove the 386 CPU first. Intel provides an easy-to-use
|
||
chip extractor and a storage box for the 386 chip for this purpose. Just
|
||
follow the instructions in the RapidCAD installation manual.
|
||
|
||
On many systems, the motherboard is supported only at a small number of
|
||
points. Since considerable force is required to insert a pin grid chip
|
||
like the 80387, RapidCAD, or Weitek Abacus 3167 into its socket, the
|
||
board may bend quite a lot due to the insertion pressure. This could
|
||
cause cracks in the board's conductive traces that may render it
|
||
intermittently or completely inoperable. Damage done to the board in
|
||
this way is usually not covered by the computer's warranty! Therefore,
|
||
it may be a good idea to first check how much the board bends by
|
||
pressing on the math coprocessor socket with your finger. If you find it
|
||
to bend easily, try to put something under the board directly beneath
|
||
the coprocessor socket. If this is impossible, as it is in many desktop
|
||
cases, consider removing the whole mother board from the case, and
|
||
placing it on a hard, flat surface free of static electricity. (You will
|
||
also have to do this if your system's CPU and coprocessor socket are on
|
||
a separate card rather than on the motherboard, as is typical in many
|
||
modular systems.)
|
||
|
||
Be sure you are properly grounded before you remove the coprocessor from
|
||
its antistatic box, as even a tiny jolt of static electricity can ruin
|
||
the coprocessor. Make sure you do not touch the pins on the bottom of
|
||
the chip.
|
||
|
||
Check the pins and make sure none are bent; if some are, you can
|
||
*carefully* straighten them with needle-nose pliers or tweezers.
|
||
|
||
4) Match the coprocessor's orientation with the orientation of the socket.
|
||
Correct orientation of the coprocessor is absolutely essential, because
|
||
if you insert it the wrong way it may be damaged.
|
||
|
||
8087 and 287 coprocessors have a notch on one the shorter sides of their
|
||
rectangular DIL package that should be matched with the notch of the
|
||
coprocessor socket. Usually the 286 CPU and the 287 coprocessor are
|
||
placed alongside each other and both have the same orientation, (that
|
||
is, their respective notches point in the same direction). 387SX
|
||
coprocessors feature a white dot or similar mark that matches with some
|
||
sort of marking on the socket. 387 coprocessors have a bevelled corner
|
||
that is also marked with a white dot or similar marking. This should be
|
||
matched with the bevelled or otherwise marked corner of the socket. If
|
||
your system has only a large EMC socket and you are installing a 387 in
|
||
it, you will leave one row of pin holes free on each side of the chip.
|
||
|
||
Once you have found the correct orientation, place the chip over the
|
||
socket and make sure all pins are correctly aligned with their
|
||
respective holes. Press firmly and evenly on the chip -- you may have to
|
||
press hard to seat the coprocessor all the way. Again, make sure your
|
||
motherboard does not bend more than slightly under the insertion
|
||
pressure. For 8087, 287, and 387 coprocessors it is normal that the
|
||
coprocessor does not go all the way in; about one millimeter (1/25 inch)
|
||
of space is usually left between the socket and the bottom of the
|
||
coprocessor chip. (This allows the insertion of a extraction device
|
||
should it become necessary to remove the chip. Note that the
|
||
construction of the 387SX's PLCC socket makes it next-to-impossible to
|
||
remove the coprocessor once fully inserted, as the top of the chip is
|
||
level with the socket's 'walls'.)
|
||
|
||
5) Check your computer's manual for the proper position of any jumpers or
|
||
switches that need to be set to tell the system it now has a coprocessor
|
||
(and possibly, which kind it has). Put the cover back on the system
|
||
unit, reconnect the power, and turn on your computer. Depending on your
|
||
system's BIOS, you may now have to run a setup or configuration program
|
||
to enable the coprocessor. Finally, run the programs supplied on the
|
||
diagnostic disk (included with your coprocessor) to check for its
|
||
correct operation.
|
||
|
||
|
||
|
||
=================================================================
|
||
Descriptions of available coprocessors, CPU+FPU (as of 01-11-93):
|
||
=================================================================
|
||
|
||
Intel 8087
|
||
|
||
[43] This was the first coprocessor that Intel made available for the
|
||
80x86 family. It was introduced in 1980 and therefore does not have full
|
||
compatibility with the IEEE-754 standard for floating-point arithmetic,
|
||
(which was finally released in 1985). It complements the 8088 and 8086
|
||
CPUs and can also be interfaced to the 80188 and 80186 processors.
|
||
|
||
The 8087 is implemented using NMOS. It comes in a 40-pin CERDIP (ceramic
|
||
dual inline package). It is available in 5 MHz, 8 MHz (8087-2), and 10
|
||
MHz (8087-1) versions. Power consumption is rated at max. 2400 mW [42].
|
||
|
||
A neat trick to enhance the processing power of the 8087 for
|
||
computations that use only the basic arithmetic operations (+,-,*,/) and
|
||
do not require high precision is to set the precision control to single-
|
||
precision. This gives one a performance increase of up to 20%. For
|
||
details about programming the precision control, see program PCtrl in
|
||
appendix A.
|
||
|
||
With the help of an additional chip, the 8087 can in theory be
|
||
interfaced to an 80186 CPU [36]. The 80186 was used in some PCs (e.g.
|
||
from Philips, Siemens) in the 1982/1983 time frame, but with IBM's
|
||
introduction of the 80286-based AT in 1984, it soon lost all
|
||
significance for the PC market.
|
||
|
||
|
||
Intel 80187
|
||
|
||
The 80187 is a rather new coprocessor designed to support the 80C186
|
||
embedded controller (a CMOS version of the 80186 CPU; see above). It was
|
||
introduced in 1989 and implements the complete 80387 instruction set. It
|
||
is available in a 40 pin CERDIP (ceramic dual inline package) and a 44
|
||
pin PLCC (plastic leaded chip carrier) for 12.5 and 16 MHz operation.
|
||
Power consumption is rated at max. 675 mW for the 12.5 MHz version and
|
||
max. 780 mW for the 16 MHz version [37].
|
||
|
||
|
||
Intel 80287
|
||
|
||
[44] This is the original Intel coprocessor for the 80286, introduced in
|
||
1983. It uses the same internal execution unit as the 8087 and therefore
|
||
has the same speed (actually, it is sometimes slower due to additional
|
||
overhead in CPU-coprocessor communication). As with the 8087, it does
|
||
not provide full compatibility with the IEEE-754 floating point standard
|
||
released in 1985.
|
||
|
||
The 80287 was manufactured in NMOS technology, and is packaged in a 40-
|
||
pin CERDIP (ceramic dual inline package). There are 6 MHz, 8 MHz, and 10
|
||
MHz versions. Power consumption can be estimated to be the same as that
|
||
for the 8087, which is 2400 mW max.
|
||
|
||
The 80287 has been replaced in the Intel 80x87 family with its faster
|
||
successor, the CMOS-based Intel 287XL, which was introduced in 1990 (see
|
||
below). There may still be a few of the old 80287 chips on the market,
|
||
however.
|
||
|
||
|
||
Intel 80287XL
|
||
|
||
This chip is Intel's second-generation 287, first introduced in 1990.
|
||
Since it is based on the 80387 coprocessor core, it features full IEEE
|
||
754 compatibility and faster instruction execution. Intel claims about
|
||
50% faster operation than the 80287 for typical benchmark tests such as
|
||
Whetstone [45]. Comparison with benchmark results for the AMD 80C287,
|
||
which is identical to the Intel 80287, support this claim [1]: The Intel
|
||
287XL performed 66% faster than the AMD 80C287 on a fractal benchmark
|
||
and 66% faster on the Whetstone benchmark in these tests. Whetstone
|
||
results from [46] show the Intel 287XL at 12.5 MHz to perform 552
|
||
kWhets/sec as opposed to the AMD's 80C287 289 kWhets/sec, a 91%
|
||
performance increase. A benchmark using the MathPak program showed the
|
||
Intel 287XL to be 59% faster than the Intel 80287 (6.9 sec. vs. 11.0
|
||
sec.) [26]. Since the 287XL has all the additional instructions and
|
||
enhancements of a 387, most software automatically identifies it as an
|
||
80387-compatible coprocessor and therefore can make use of extra 387-
|
||
only features, such as the FSIN and FCOS instructions.
|
||
|
||
The 287XL is manufactured in CMOS and therefore uses much less power
|
||
than the older NMOS-based 80287. At 12.5 MHz, the power consumption is
|
||
rated at max. 675 mW, about 1/4 of the 80287 power consumption. The
|
||
287XL is available in either a 40-pin CERDIP (ceramic dual inline
|
||
package) or a 44 pin PLCC (plastic leaded chip carrier). (This latter
|
||
version is called the 287XLT and intended mainly for laptop use.) The
|
||
287XL is rated for speeds of up to 12.5 MHz.
|
||
|
||
|
||
AMD 80C287
|
||
|
||
This chip, manufactured by Advanced Micro Devices (AMD), is an exact
|
||
clone of the old Intel 80287, and was first brought to market by AMD in
|
||
1989. It contains the original microcode of the 80287 and is therefore
|
||
100% compatible with it. However, as the name indicates, the 80C287 is
|
||
manufactured in CMOS and therefore uses less power than an equivalent
|
||
Intel 80287. At 12.5 MHz, its power consumption is rated at max. 625 mW
|
||
or slightly less than that of the Intel 80287XL [27]. There is also
|
||
another version called AMD 80EC287 that uses an 'intelligent' power save
|
||
feature to reduce the power consumption below 80C287 levels. Tests at
|
||
10.7 MHz show typical power consumption for the 80EC287 to be at 30 mW,
|
||
compared to 150 mW for the AMD 80C287, 300 mW for the Intel 287XL and
|
||
1500 mW for the Intel 80287 [57]. The 80EC287 is therefore ideally
|
||
suited for low power laptop systems.
|
||
|
||
The AMD 80C287 is available in speeds of 10, 12, and 16 MHz. (I have
|
||
only seen it being offered in 10 MHz and 12 MHz versions, however.) At
|
||
about US$ 50, it is currently the cheapest coprocessor available. Note
|
||
that it provides less performance than the newer Intel 287XL (see
|
||
above). The AMD 80C287 is available in 40 pin ceramic and plastic DIPs
|
||
(dual inline package) and as 44 pin PLCC (plastic leaded chip carrier).
|
||
|
||
Due to recent legal battles with Intel over the right to use the 287
|
||
microcode, which AMD lost, AMD may have to discontinue this product
|
||
(disclaimer: I am not a legal expert).
|
||
|
||
|
||
Cyrix 82S87
|
||
|
||
This 80287-compatible chip was developed from the Cyrix 83D87, (Cyrix's
|
||
80387 'clone') and has been available since 1991. It complies completely
|
||
with the IEEE-754 standard for floating-point arithmetic and features
|
||
nearly total compatibility with Intel's coprocessors, including
|
||
implementation of the full Intel 80387 instruction set. It implements
|
||
the transcendental functions with the same degree of accuracy and the
|
||
superior speed of the Cyrix 83D87. This makes the Cyrix 82S87 the
|
||
fastest [1] and most accurate 287 compatible coprocessor available.
|
||
Documentation by Cyrix [46] rates the 82S87 at 730 kWhets/sec for a 12.5
|
||
MHz system, while the Intel 287XL performs only 552 kWhets/sec. 82S87
|
||
chips manufactured after 1991 use the internals of the Cyrix 387+, which
|
||
succeeds the original 83D87 [73].
|
||
|
||
The 82S87 is a fully static CMOS design with very low power requirements
|
||
that can run at speeds of 6 to 20 MHz. Cyrix documentation shows the
|
||
82S87 to consume about the same amount of power as the AMD 80C287 (see
|
||
above). The 82S87 comes in a 40 pin DIP or a 44 pin PLCC (plastic leaded
|
||
chip carrier) compatible with the pinout of the Intel 287XLT and
|
||
ideally suited for laptop use.
|
||
|
||
|
||
IIT 2C87
|
||
|
||
This chip was the first 80287 clone available, introduced to the market
|
||
in 1989. It has about the same speed as the Intel 287XL [1]. The 2C87
|
||
implements the full 80387 instruction set [38]. Tests I ran on the 3C87
|
||
seem to indicate that it is not fully compatible with the IEEE-754
|
||
standard for floating-point arithmetic (see below for details), so it
|
||
can be assumed that the 2C87 also fails these test (as it presumably
|
||
uses the same core as the 3C87).
|
||
|
||
The IIT 2C87 provides extra functions not available on any other 287
|
||
chip [38]. It has 24 user-accessible floating-point registers organized
|
||
into three register banks. Additional instructions (FSBP0, FSBP1, FSBP2)
|
||
allow switching from one bank to another. (Transfers between registers
|
||
in different banks are not supported, however, so this feature by itself
|
||
is of limited usefulness. Also, there seems to be only one status
|
||
register (containing the stack top pointer), so it has to be manually
|
||
loaded and stored when switching between banks with a different number
|
||
of registers in use [40]). The register bank's main purpose is to aid
|
||
the fourth additional instruction the 2C87 has (F4X4), which does a full
|
||
multiply of a 4x4 matrix by a 4x1 vector, an operation common in 3D-
|
||
graphics applications [39]. The built-in matrix multiply speeds this
|
||
operation up by a factor of 6 to 8 when compared to a programmed
|
||
solution according to the manufacturer [38]. Tests show the speed-up to
|
||
be indeed in this range [40]. For the 3C87, I measured the execution
|
||
time of F4X4 to be about 280 clock cycles; the execution time on the
|
||
2C87 should be somewhat larger - I estimate it to be around 310 clock
|
||
cycles due to the higher CPU-NDP communication overhead in instruction
|
||
execution in 286/287 systems (~45-50 clock cycles) compared with 386/387
|
||
systems (~16-20 clock cycles). As desirable as the F4X4 instruction may
|
||
seem, however, there are very few applications that make use of it when
|
||
an IIT coprocessor is detected at run time (among them Schroff
|
||
Development's Silver Screen and Evolution Computing's Fast-CAD 3-D
|
||
[25]).
|
||
|
||
The 2C87 is available for speeds of up to 20 MHz. It is implemented in
|
||
an advanced CMOS process and has therefore a low power consumption of
|
||
typically about 500 mW [38].
|
||
|
||
|
||
Intel 80387
|
||
|
||
This chip was the first generation of coprocessors designed specifically
|
||
for the Intel 80386 CPU. It was introduced in 1986, about one year after
|
||
the 80386 was brought to market. Early 386 system were therefore
|
||
equipped with both a 80287 and a 80387 socket. The 80386 does work with
|
||
an 80287, but the numerical performance is hardly adequate for such a
|
||
system.
|
||
|
||
The 80387 has itself since been superseded by the Intel 387DX introduced
|
||
by a quiet change in 1989 (see below). You might find it when acquiring
|
||
an older 386 machine, though. The old 80387 is about 20% slower than the
|
||
newer 387DX.
|
||
|
||
The 80387 is packaged in a 68-pin ceramic PGA, and was manufactured
|
||
using Intel's older 1.5 micron CHMOS III technology, giving it moderate
|
||
power requirements. Power consumption at 16 MHz is max. 1250 mW (750 mW
|
||
typical), at 20 MHz max. 1550 mW (950 mW typical), and at 25 MHz max.
|
||
1950 mW (1250 mW typical) [60].
|
||
|
||
|
||
Intel 387DX
|
||
|
||
The 387DX is the second-generation Intel 387; it was quietly introduced
|
||
to replace the original 80387 in 1989. This version is done in a more
|
||
advanced CMOS process which enables the coprocessor to run at a maximum
|
||
frequency of 33 MHz (the 80387 was limited to a maximum frequency of 25
|
||
MHz). The 387DX is also about 20% faster than the 80387 on the average
|
||
for the same clock frequency. For a 386/387 system operating at 29 MHz
|
||
the Whetstone benchmark (compiled with the highly optimizing Metaware
|
||
High-C V1.6) runs at 2377 kWhetstones/sec for the 80387 and at 2693
|
||
kWhetstones/sec for the 387DX, a 13% increase. In a fractal calculation
|
||
programmed in assembly language, the 387DX performance was 28% higher
|
||
than the performance of the 80387. The transcendental functions have
|
||
also sped up from the 80387 to the 387DX. In the Savage benchmark
|
||
(again, compiled with Metaware High-C V1.6 and running on a 29 MHz
|
||
system), the 80387 evaluated 77600 function calls/second, while the
|
||
387DX evaluated 97800 function calls/second, a 26% increase [7]. Some
|
||
instructions have been sped up a lot more than the average 20%. For
|
||
example, the performance of the FBSTP instruction has increased by a
|
||
factor of 3.64.
|
||
|
||
The Intel 387DX (and its predecessor 80387) are the only 387
|
||
coprocessors that support asynchronous operation of CPU and coprocessor.
|
||
The 387 consists of a bus interface unit and a numerical execution unit.
|
||
The bus interface unit always runs at the speed of the CPU clock
|
||
(CPUCLK2). If the CKM (ClocK Mode) pin of the 387 is strapped to Vcc,
|
||
the numerical execution unit runs at the same speed as the bus interface
|
||
unit. If CKM is tied to ground, the numerical execution unit runs at the
|
||
speed provided by the NUMCLK2 input. The ratio of NUMCLK2 (coprocessor
|
||
clock) to CPUCLK2 (CPU clock) must lie within the range 10:16 to 14:10.
|
||
For example, for a 20 MHz 386, the Intel 387DX could be clocked from
|
||
12.5 MHz to 28 MHz via the NUMCLK2 input. (On the Cyrix 83D87, Cyrix
|
||
387+, ULSI 83C87, and the IIT 387, the CKM pin is not connected. These
|
||
coprocessors are therefore not capable of asynchronous operation and
|
||
always run at the speed of the CPU.)
|
||
|
||
The Intel 387DX is manufactured using Intel's advanced low power CHMOS
|
||
IV technology. Power consumption at 20 MHz is max. 900 mW (525 mW
|
||
typical), at 25 MHz max. 1050 mW (625 mW typical), and at 33 MHz max.
|
||
1250 mW (750 mW typical) [59].
|
||
|
||
|
||
Intel 387SX
|
||
|
||
This is the coprocessor paired with the Intel 386SX CPU. The 386SX is an
|
||
Intel 80386 with a 16-bit, rather than 32-bit, data path. This reduces
|
||
(somewhat) the costs to build a 386SX system as compared to a full 32-
|
||
bit design required by a 386DX. (The 386SX's main *marketing* purpose
|
||
was to replace the 80286 CPU, which was being sold more cheaply by other
|
||
manufacturers [such as AMD], and which Intel subsequently stopped
|
||
producing.) Due to the 16-bit data path, the 386SX is slower than the
|
||
386DX and offers about the same speed as an 80286 at the same clock
|
||
frequency for 16-bit applications. But as the 386SX is a complete 80386
|
||
internally, it offers also the possibility to run 32-bit applications
|
||
and supports the virtual 8086 mode (used for example by Windows' 386
|
||
enhanced mode).
|
||
|
||
The 387SX has all the features of the Intel 80387, including the ability
|
||
of asynchronous operation of CPU and coprocessor (see Intel 387DX
|
||
information, above). Due to the 16 bit data path between the CPU and the
|
||
coprocessor, the 387SX is a bit slower than a 80387 operating at the
|
||
same frequency. In addition, the 387SX is based on the core of the
|
||
original 80387, which executes instructions slower than the second
|
||
generation 387DX.
|
||
|
||
The 387SX comes in a 68-pin PLCC (plastic leaded chip carrier) package
|
||
and is available in 16 MHz and 20 MHz versions. (Coprocessors for faster
|
||
386SX systems based on the Am386SX CPU are available from IIT, Cyrix,
|
||
and ULSI.) Power consumption for the 387SX at 16 MHz is max. 1250 mW
|
||
(740 mW typical); for the 20 MHz version it is max. 1500 mW (1000 mW
|
||
typical) [62].
|
||
|
||
|
||
Intel 387SL
|
||
|
||
This coprocessor is designed for use in systems that contain an Intel
|
||
386SL as the CPU. The 386SL is directly derived from the 386SX. It is a
|
||
static CHMOS IV design with very low power requirements that is intended
|
||
to be used in notebook and laptop computers. It features an integrated
|
||
cache controller, a programmable memory controller, and hardware support
|
||
for expanded memory according to the LIM EMS 4.0 standard. The 387SL,
|
||
introduced in early 1992, has been designed to accompany the 386SL in
|
||
machines with low power consumption and substitute the 387SX for this
|
||
purpose. It features advanced power saving mechanisms. It is based on
|
||
the 387DX core, rather than on the older and slower 80387 core (which is
|
||
used by the 387SX).
|
||
|
||
|
||
IIT 3C87
|
||
|
||
This IIT chip was introduced in 1989, about the same time as the Cyrix
|
||
83D87. Both coprocessors are faster than Intel's 387DX coprocessor. The
|
||
IIT 3C87 also provides extra functions not available on any other 387
|
||
chip [38]. It has 24 user-accessible floating-point registers organized
|
||
into three register banks. Three additional instructions (FSBP0, FSBP1,
|
||
FSBP2) allow switching from one bank to another. (Transfers between
|
||
registers in different banks are not supported, however, so this feature
|
||
by itself is of limited usefulness. Also, there seems to be only one
|
||
status register [containing the stack top pointer], so it has to be
|
||
manually loaded and stored when switching between banks with a different
|
||
number of registers in use [40]). The register bank's main purpose is to
|
||
aid the fourth additional instruction the 3C87 has (F4X4), which does a
|
||
full multiply of a 4x4 matrix by a 4x1 vector, an operation common in
|
||
3D-graphics applications [39]. The built-in matrix multiply speeds this
|
||
operation up by a factor of 6 to 8 when compared to a programmed
|
||
solution according to the manufacturer [38]. Tests show the speed-up to
|
||
be indeed in this range [40]. I measured the F4X4 to execute in about
|
||
280 clock cycles, during which time it executes 16 multiplications and
|
||
12 additions. The built-in matrix multiply speeds up the matrix-by-
|
||
vector multiply by a factor of 3 compared with a programmed solution
|
||
according to IIT [39]. The results for my own TRNSFORM benchmark support
|
||
this claim (see results below), showing a performance increase by a
|
||
factor of about 2.5. This makes matrix multiplies on the IIT 3C87 nearly
|
||
as fast as on an Intel 486 at the same clock frequency. As desirable as
|
||
the F4X4 instruction may seem, however, there are very few applications
|
||
that make use of it when an IIT coprocessor is detected at run time
|
||
(among them Schroff Development's Silver Screen and Evolution
|
||
Computing's Fast-CAD 3-D [25]).
|
||
|
||
These IIT-specific instructions also work correctly when using a Chips &
|
||
Technologies 38600DX or a Cyrix 486DLC CPU, which are both marketed as
|
||
faster replacements for the Intel 386DX CPU.
|
||
|
||
Tests I ran with the IEEETEST program show that the 3C87 is not fully
|
||
compatible with the IEEE-754 standard for floating-point arithmetic,
|
||
although the manufacturer claims otherwise. It is indeed possible that
|
||
the reported errors are due to personal interpretations of the standard
|
||
by the program's author that have been incorporated into IEEETEST and
|
||
that the standard also supports the different interpretation chosen by
|
||
IIT. On the other hand, the IEEE test vectors incorporated into IEEETEST
|
||
have become somewhat of an industry standard [66] and Intel's 387, 486,
|
||
and RapidCAD chips pass the test without a single failure, so the fact
|
||
that the IIT 3C87 fails some of the tests indicates that it is not fully
|
||
compatible with the Intel 387 coprocessor. My tests also show that the
|
||
IIT 3C87 does not support denormals for the double extended format. It
|
||
is not entirely clear whether the IEEE standard mandates support for
|
||
extended precision denormals, as the IEEE-754 document explicitly only
|
||
mentions single and double-precision denormals. Missing support for
|
||
denormals is not a critical issue for most applications, but there are
|
||
some programs for which support of denormals is at the very least quite
|
||
helpful [41]. In any case, failure of the 3C87 to support extended
|
||
precision denormal numbers does represent an incompatibility with the
|
||
Intel 387 and 486 chips.
|
||
|
||
The 3C87 is implemented in an advanced CMOS process and has low power
|
||
requirements, typically about 600 mW. Like the 387 'clones' from Cyrix
|
||
and ULSI, the 3C87 does not support asynchronous operation of the CPU
|
||
and the coprocessor, but always runs at the full speed of the CPU. It is
|
||
available in 16, 20, 25, 33, and 40 MHz versions.
|
||
|
||
|
||
IIT 3C87SX
|
||
|
||
This is the version of the IIT 3C87 that is intended for use with
|
||
Intel's 386SX or AMD's Am386SX CPU, and is functionally equivalent to
|
||
the IIT3C87. Due to the 16-bit data path between the CPU and the
|
||
coprocessor in a 386SX- based system, coprocessor instructions will
|
||
execute somewhat more slowly than on the 3C87. At present, the IIT
|
||
3C87SX is the only 387SX coprocessor that is offered at speeds of 16,
|
||
20, 25, and 33 MHz. (I have read that Cyrix has also announced an 83S87-
|
||
33, but haven't seen it being offered yet.) The 3C87SX is packaged in a
|
||
68-pin PLCC.
|
||
|
||
|
||
Cyrix FasMath 83D87
|
||
|
||
This chip was introduced in 1989, only shortly after the coprocessors
|
||
from IIT. It has been found to be the fastest 387-compatible coprocessor
|
||
in several benchmark comparisons [1,7,68,69]. It also came out as the
|
||
fastest coprocessor in my own tests (see benchmark results below).
|
||
Although the Cyrix 83D87 provides up to 50% more performance than the
|
||
Intel 387DX in benchmarks comparisons, the speed advantage over other
|
||
387-compatible coprocessors in real applications is usually much
|
||
smaller, because coprocessor instructions represent only a small part of
|
||
the total application code. For example, in a test using the program 3D-
|
||
Studio, the Cyrix 83D87 was 6% faster than the Intel 387DX [1].
|
||
|
||
Besides being the fastest 387 coprocessor, the 83D87 also offers the
|
||
most accurate transcendental functions results of all coprocessors
|
||
tested (see test results below). The new "387+" version of the 83D87,
|
||
available since November 1991, even surpasses the level of accuracy of
|
||
the original 83D87 design. Note that the name 387+ is used in European
|
||
distribution only. In other parts of the world, the new chip still goes
|
||
by the name 83D87.
|
||
|
||
Unlike Intel's coprocessors, which use the CORDIC [18,19] algorithm to
|
||
compute the transcendental functions, Cyrix uses polynomial and rational
|
||
approximations to the functions. In the past the CORDIC method has been
|
||
popular since it requires only shifts and adds, which made it relatively
|
||
easy to implement a reasonably fast algorithm. Recently, the cost for the
|
||
implementation of fast floating-point hardware multipliers has dropped
|
||
significantly (due to the availability of VLSI), making the use of
|
||
polynomial and rational approximations superior to CORDIC for the
|
||
generation of transcendental functions [61]. The Cyrix 83D87 uses a fast
|
||
array multiplier, making its transcendental functions faster than those
|
||
of any other 387 compatible coprocessor. It also uses 75 bit for the
|
||
mantissa in intermediate calculations (as opposed to 68 bits on other
|
||
coprocessors), making its transcendental functions more accurate than
|
||
those of any other coprocessor or FPU (see results below).
|
||
|
||
The 83D87 (and its successor, the 387+) are the 387 'clones' with the
|
||
highest degree of compatibility to the Intel 387DX. A few minor software
|
||
and hardware incompatibilities have been documented by Cyrix [12]. The
|
||
software differences are caused by some bugs present in the 387DX that
|
||
Cyrix fixed in the 83D87. Unlike the Intel 387DX, the 83D87 (and all
|
||
other 387-compatible chips as well) does not support asynchronous
|
||
operation of CPU and coprocessor. There were also problems in the past
|
||
with the CPU-coprocessor communications, causing the 83D87 to
|
||
occasionally hang on some machines. The reason behind this was that
|
||
Cyrix shaved off a wait state in the communication protocol, which
|
||
caused a communications breakdown between the CPU and the 83D87 for some
|
||
systems running at 25 MHz or faster. (One notable example of this
|
||
behavior was the Intel 302 board.) Also there were problems with boards
|
||
based on early revisions of the OPTI chipset. These problem are only
|
||
rarely encountered with the current generation of 386 motherboards, and
|
||
it is possible that it has been entirely eliminated in the 387+, the
|
||
successor to the 83D87.
|
||
|
||
To reduce power consumption the 83D87 features advanced power saving
|
||
features. Those portions of the coprocessor that are not needed are
|
||
automatically shut down. If no coprocessor instructions are being
|
||
executed, *all* parts except the bus interface unit are shut down [12].
|
||
Maximal power consumption of the Cyrix 83D87 at 33 MHz is 1900 mW, while
|
||
typical power consumption at this clock frequency is 500 mW [15].
|
||
|
||
|
||
Cyrix EMC87
|
||
|
||
This coprocessor is basically a special version of the Cyrix 83D87,
|
||
introduced in 1990. In addition to the normal 387 operating mode, in
|
||
which coprocessor-CPU communication is handled through reserved IO
|
||
ports, it also offers a memory-mapped mode of operation similar to the
|
||
operation principle of the Weitek Abacus. Like the Weitek chip, the
|
||
EMC87 occupies a block of memory starting at physical address C0000000h
|
||
(the Abacus occupies a memory block of 64 KB, while the EMC87 uses only
|
||
4 KB [77]). It can therefore only be accessed in the protected or
|
||
virtual modes of the 386 CPU. DOS programs can access the EMC87 with the
|
||
help of DOS extenders or memory managers like EMM386 which run in
|
||
protected/virtual mode themselves. To implement the memory-mapped
|
||
interface, the usual 80x87 architecture has been slightly expanded with
|
||
three additional registers and eleven additional instructions that can
|
||
only be used if the memory-mapped mode is enabled.
|
||
|
||
Using this special mode of the EMC87 provides a significant speed
|
||
advantage. The traditional 387 CPU-coprocessor interface via IO ports
|
||
has an overhead of about 14-20 clock cycles. Since the Cyrix 83D87
|
||
executes some operations like addition and multiplication in much less
|
||
time, its performance is actually limited by the CPU-coprocessor
|
||
interface. Since the memory-mapped mode has much less overhead, it
|
||
allows all coprocessor instructions to be executed at full speed with no
|
||
penalty.
|
||
|
||
Originally, Cyrix claimed support for the fast memory-mapped mode of the
|
||
EMC87 from a number of software vendors (including Borland and
|
||
Microsoft). However, there are only very few applications that make use
|
||
of it, among them Evolution Computing's FastCAD 3D, MicroWay Inc.'s NDP
|
||
FORTRAN-386 compiler, Metaware's High-C compiler version 1.6 and newer,
|
||
and Intusofts's Spice [63,73]. Part of the problem in supporting the
|
||
memory-mapped mode is that the application must reserve one of the
|
||
general purpose registers of the CPU to use memory-mapped mode
|
||
instructions that access memory.
|
||
|
||
(Note that the EMC87 is *not* compatible with Weitek's Abacus
|
||
coprocessor. They both use the same CPU interface technique [memory
|
||
mapping], but while the EMC87 uses the standard 387 instruction set, the
|
||
Weitek Abacus coprocessors use a different instruction set entirely its
|
||
own.)
|
||
|
||
Since the EMC87 provides also the standard 386/387 CPU interface via IO
|
||
ports, it can be used just like any other 387-compatible coprocessor and
|
||
delivers the same performance as the Cyrix 83D87 in this mode. The EMC87
|
||
even allows mixed use of memory-mapped and traditional instructions in
|
||
the same code. Cyrix has also implemented some additional instructions
|
||
in the EMC87 that are also available in the 387-compatible mode:
|
||
FRICHOP, FRINT2, and FRINEAR. These instructions enable rounding to
|
||
integer without setting the rounding mode by manipulating the
|
||
coprocessor control word, and are intended to make life easier for
|
||
compiler writers.
|
||
|
||
In a test, the EMC87 at 33 MHz ran the single-precision Whetstone
|
||
benchmark at 7608 kWhetstones/sec, while the Cyrix 83D87 at 33 MHz had a
|
||
speed of only 5049 kWhetstones/sec, an increase of 50.6% [63]. In
|
||
another test, the EMC87 ran a fractal computation at twice the speed of
|
||
the Cyrix 83D87 and 2.6 times as fast as an Intel 387DX [64]. A third
|
||
test found the EMC87's overall performance to be 20% higher than the
|
||
performance of the Cyrix 83D87 [65].
|
||
|
||
The Cyrix FasMath EMC87 has also been marketed as Cyrix AutoMATH; the
|
||
two chips are identical. Unlike the Cyrix 83D87, which fits into the 68-
|
||
pin 387 coprocessor socket, the EMC87 comes in a 121-pin PGA and
|
||
requires the 121-pin EMC (Extended Math Coprocessor) socket. Note that
|
||
not all boards have such a socket (a notable exception being IBM's
|
||
PS/2s, for example). The EMC87 is available 25 and 33 MHz versions.
|
||
Maximum power consumption at 33 MHz is 2000 mW.
|
||
|
||
Cyrix appears currently to be phasing out the EMC87.
|
||
|
||
|
||
Cyrix FasMath 387+
|
||
|
||
This chip is the second-generation successor to the Cyrix 83D87. (The
|
||
name "387+" is only used for European distribution; in other parts of
|
||
the world, it goes by the original 83D87 designation.) According to a
|
||
source within Cyrix [73], the 387+ was designed to make a smaller (and
|
||
thus cheaper to manufacture) coprocessor chip that could also be pushed
|
||
to higher frequencies than the original chip: the 387+ is available in
|
||
versions of up to 40 MHz, whereas the original 83D87 could go no faster
|
||
than 33 MHz.
|
||
|
||
The Cyrix 387+ is ideally suited to be used with Cyrix's 486DLC CPU,
|
||
which is a 486SX compatible replacement chips for the Intel 386DX.
|
||
Indeed Cyrix sells upgrade kits consisting of a 486DLC CPU and a
|
||
Cyrix 387+.
|
||
|
||
In my tests, I found the Cyrix 387+ to be about five to 10 percent
|
||
*slower* than the Cyrix 83D87. However, some instructions like the
|
||
square root (FSQRT) now run at only half the speed at which they ran in
|
||
the 83D87, and most transcendental functions show about a 40% drop in
|
||
performance compared to their 83D87 averages (see performance results,
|
||
below). However, I did find the transcendental functions on the 387+ to
|
||
be a bit *more* accurate than those implemented in the 83D87. The new
|
||
design uses a slower hardware multiplier that needs six clock cycles to
|
||
multiply the floating-point mantissa of an internal precision number,
|
||
while the multiplier in the 83D87 takes only 4 clocks to accomplish the
|
||
same task. Since the transcendental functions in Cyrix math coprocessors
|
||
are generated by polynomial and rational approximations, this slows them
|
||
down significantly.
|
||
|
||
The divide/square root logic has also been changed from the 83D87
|
||
design. The original design used an algorithm that could generate both
|
||
the quotient and square root, so the execution times for these
|
||
instructions were nearly identical. The algorithm chosen for the
|
||
division in the 387+ doesn't allow the square root to be taken so
|
||
easily, so it takes nearly twice as long.
|
||
|
||
In the 387+, the available argument range for the FYL2XP1 instruction
|
||
has been extended, from the usual range -1+sqrt(2)/2..sqrt(2)/2 that is
|
||
found on all 80x87 coprocessors, to include all floating-point numbers.
|
||
Also, four additional instructions have been implemented: FRICHOP
|
||
(opcode DD FC), FRINT2 (opcode DB FC), FRINEAR (opcode DF FC), and FTSTP
|
||
(opcode D9 E6).
|
||
|
||
|
||
Cyrix FasMath 83S87
|
||
|
||
The 83S87 is the SX version of the Cyrix 83D87. Just as the 83D87 is the
|
||
fastest 387-compatible coprocessor, the Cyrix 83S87 is the fastest of
|
||
the 387SX compatible coprocessors [1], as well as providing the most
|
||
accurate transcendental functions. 83S87 chips manufactured after 1991
|
||
use the internals of the Cyrix 387+, the successor to the original 83D87
|
||
[73] (above). The Cyrix 83S87 is ideally suited to be used with the
|
||
Cyrix Cx486SLC CPU, a 486SX compatible CPU which is a replacement chip
|
||
for the Intel 386SX CPU.
|
||
|
||
The 83S87 is packaged in a 68-pin PLCC and is available in 16, 20, and
|
||
25 MHz versions. Due to the advanced power saving features of the Cyrix
|
||
coprocessor, the typical power consumption of the 20 MHz version is only
|
||
about 350 mW [67].
|
||
|
||
|
||
ULSI Math*Co 83C87
|
||
|
||
The ULSI 83C87 is an 80387-compatible coprocessor first introduced in
|
||
early 1991, well after the IIT 3C87 and Cyrix 83D87 appeared. Like other
|
||
387 clones, it is somewhat faster than the Intel 387DX, particularly in
|
||
its basic arithmetic functions. The transcendental functions, however,
|
||
show only a slight speed improvement over the Intel 387DX (see benchmark
|
||
results below).
|
||
|
||
In my tests, the ULSI had the most inaccurate transcendental functions
|
||
of all tested coprocessors. However, the maximum relative error is still
|
||
within the limits set by Intel, so this is probably not an important
|
||
issue for all but a very few applications. The ULSI 83C87 shows some
|
||
minor flaws in the tests for IEEE 754 compatibility, but this, too, is
|
||
probably unimportant under typical operating conditions. ULSI claims
|
||
that the program IEEETEST, which was used to test for IEEE
|
||
compatibility, contains many personal interpretations of the IEEE
|
||
standard by the program's author and states that there is no ANSI-
|
||
certified IEEE-754 compliance test. While this may be true, it is
|
||
also a fact that the IEEE test vectors used in IEEETEST are a de facto
|
||
industry standard, and that Intel's 387, 486, and RapidCAD chips pass it
|
||
without a single failure, as do the coprocessors from Cyrix. Since the
|
||
ULSI Math*Co 83C87 fails some of the tests, it is certainly less than
|
||
100% compatible with Intel's chips, although this will likely make
|
||
little or no difference in typical operating conditions. (It is
|
||
interesting to note that an ULSI 83S87 manufactured in 92/17 showed
|
||
fewer errors in the IEEETEST test run [74] than the ULSI 83C87,
|
||
manufactured in 91/48, I used in my original test. This indicates that
|
||
ULSI might have applied some quick fixes to newer revisions of their
|
||
math coprocessors.)
|
||
|
||
The ULSI 83C87 fails to be compatible with the IEEE-754 in that is does
|
||
not implement the "precision control" feature. While all the internal
|
||
operations of 80x87 coprocessors are usually performed with the maximum
|
||
precision available (double-extended precision with 64 mantissa bits),
|
||
the 80x87 architecture also offer the possibility to force lower
|
||
precision to be used for the basic arithmetic functions (add, subtract,
|
||
multiply, divide, and square root). This feature is required by IEEE-754
|
||
for all coprocessors that can not store results *directly* to a single
|
||
or double-precision location. Since 80x87 coprocessors lack this storage
|
||
capability, they all implement precision control to provide correctly
|
||
rounded single- and double-precision results according to the floating-
|
||
point standard - except the ULSI chips. For programs that make use of
|
||
precision control (e.g., Interactive UNIX), correct implementation of
|
||
the feature may be essential for correct arithmetic results.
|
||
|
||
Like other non-Intel 387 compatibles, the 83C87 does not support
|
||
asynchronous operation of the CPU and the coprocessor. This means that
|
||
the 83C87 always runs at the full speed of the CPU. It is available in
|
||
20, 25, 33, and 40 MHz versions. The ULSI is produced in low power CMOS;
|
||
power consumption at 20 MHz is max. 800 mW (400 mW typical), at 25 MHz
|
||
it is max. 1000 mW (500 mW typical), at 33 MHz it is max. 1250 mW (625
|
||
mW), and at 40 MHz it is max. 1500 mW (750 mW typical) [58]. The 83C87
|
||
is packaged in a 68-pin ceramic PGA.
|
||
|
||
ULSI coprocessors come with a lifetime warranty. ULSI Systems, Inc.,
|
||
will replace the coprocessor up to three times free of charge should it
|
||
ever fail to function properly.
|
||
|
||
|
||
ULSI Math*Co 83S87
|
||
|
||
This chip is the SX version of the ULSI 83C87, for use in systems with
|
||
an Intel 387SX or an AMD Am387SX CPU. It is functionally equivalent to
|
||
the 83C87. To aid low-power laptop designs, the ULSI 83S87 features an
|
||
advanced power saving design with a sleep mode and a standby mode with
|
||
only minimal power requirements. Power consumption under normal
|
||
operating conditions (dynamic mode) is max. 400 mW at 16 MHz (300 mW
|
||
typical), max. 450 mW at 20 MHz (350 mW typical), and max. 500 mW at 25
|
||
MHz (400 mW typical) [58]. The ULSI 83S87 is packaged in a 68-pin PLCC.
|
||
|
||
|
||
C&T SuperMATH 38700DX
|
||
|
||
Produced by Chips&Technologies, this is the latest entry into the 387-
|
||
compatible marketplace. Originally announced in October, 1991, it has
|
||
apparently not been available to end-users before the third quarter of
|
||
1992, at least here in Germany. My tests show that its compatibility
|
||
with Intel products is very good, even for the more arcane features of
|
||
the 387DX and comparable to the coprocessors from Cyrix. Like these
|
||
chips, it passes the IEEETEST program without a single failure. It
|
||
passes, of course, all tests in Chips&Technologies' own compatibility
|
||
test program, SMDIAG. However, some of the tests (the transcendental
|
||
functions) in this program are selected in such a way that the C&T 38700
|
||
passes while the Cyrix 83D87 or Intel RapidCAD fail, so they are not
|
||
very useful. (There is also a 'bug' in the test for FSCALE that hides a
|
||
true bug in the C&T 38700.) My tests show the accuracy of the
|
||
transcendental functions on the C&T 38700DX varies. Overall, accuracy of
|
||
the transcendentals is slightly better than on the Intel 387DX.
|
||
|
||
In my own speed tests [see below] and those reported in [1], the C&T
|
||
38700DX showed performance at about 90-100% the level of the Cyrix
|
||
83D87, which is the 387 clone with the highest performance. For
|
||
floating-point-intensive benchmarks, the C&T 38700DX provides up to 50%
|
||
more computational performance than the Intel 387DX. However, as with
|
||
all other 387 compatible coprocessors, the speed advantage over the
|
||
Intel 387DX is far less significant in real applications.
|
||
|
||
The SuperMATH 38700DX is implemented in 1.2 micron CMOS with on-chip
|
||
power management, which makes for low power consumption. The 38700DX is
|
||
packaged in a 68-pin ceramic PGA (pin grid array and available in speeds
|
||
of 16, 20, 25, 33, and 40 MHz.
|
||
|
||
|
||
C&T 38700SX
|
||
|
||
This chip is the SX version of the 38700DX and compatible with the Intel
|
||
387SX. It provides performance comparable to a Cyrix 83S87 [1], the
|
||
387SX clone with the highest performance. Compatibility with the Intel
|
||
387SX is very good and on par with the high degree of the compatibility
|
||
found in the Cyrix 83S87.
|
||
|
||
The 38700SX has low power consumption. It is packaged in a 68-pin PLCC
|
||
(plastic leaded chip carrier) and available in speeds of 16, 20, and 25
|
||
MHz.
|
||
|
||
|
||
Intel RapidCAD
|
||
|
||
The RapidCAD is not a coprocessor, strictly seen, although it is
|
||
marketed as one. Rather, it is a full replacement for a 80386 CPU:
|
||
basically, an Intel 486DX CPU chip without the internal cache and with a
|
||
standard 386 pinout. RapidCAD is delivered as a set of two chips.
|
||
RapidCAD-1 goes into the 386 socket and contains the CPU and FPU.
|
||
RapidCAD-2 goes into the coprocessor (387) socket and contains a simple
|
||
PAL whose only purpose is to generate the FERR signal normally generated
|
||
by a coprocessor (This is needed by the motherboard circuitry to provide
|
||
287 compatible coprocessor exception handling in 386/387 systems.) The
|
||
RapidCAD instruction set is compatible with the 386, so it doesn't have
|
||
any newer, 486-specific instructions like BSWAP. However, since the
|
||
RapidCAD CPU core is very similar to 80486 CPU core, most of the
|
||
register-to-register instructions execute in the same number of clock
|
||
cycles as on the 486.
|
||
|
||
RapidCAD's use of the standard 386 bus interface causes instructions
|
||
that access memory to execute at about the same speed as on the 386. The
|
||
integer performance on the RapidCAD is definitely limited by the low
|
||
memory bandwidth provided by this interface (2 clock cycles per bus
|
||
cycle) and the lack of an internal cache. CPU instructions often execute
|
||
faster than they can be fetched from memory, even with a big and fast
|
||
external cache. Therefore, the integer performance of the RapidCAD
|
||
exceeds that of a 386 by *at most* 35%. This value was derived by
|
||
running some programs that use mostly register-to-register operations
|
||
and few memory accesses, and is supported by the SPEC ratings that Intel
|
||
reports for the 386-33 and the RapidCAD-33: while the 386-33 has a
|
||
SPECint of 6.4, the RapidCAD has a SPECint of 7.3 [28], a 14% increase.
|
||
(Note that these tests used the old [1989] SPEC benchmarks suite.)
|
||
|
||
While CPU and integer instructions often execute in one clock cycle on
|
||
the RapidCAD, floating-point operations always take more than seven
|
||
clock cycles. They are therefore rarely slowed down by the low-bandwidth
|
||
386 bus interface; My tests show a 70%-100% performance increase for
|
||
floating-point intensive benchmarks over a 386-based system using the
|
||
Intel 387DX math coprocessor. This is consistent with the SPECfp rating
|
||
reported by Intel. The 386/387 at 33 MHz is rated at 3.3 SPECfp, while
|
||
the RapidCAD is rated at 6.1 SPECfp at the same frequency, an 85%
|
||
increase. This means that a system that uses the RapidCAD is faster than
|
||
*any* 386/387 combination, regardless of the type of 387 used, whether
|
||
an Intel 387DX or a faster 387 clone. The diagnostic disk for the
|
||
RapidCAD also gives some application performance data for the RapidCAD
|
||
compared to the Intel 387DX:
|
||
|
||
Application Time w/ 387DX Time w/ RapidCAD Speedup
|
||
|
||
AutoCAD 11 52 sec 32 sec 63%
|
||
AutoShade/Renderman 180 sec 108 sec 67%
|
||
Mathematica(Windows ) 139 sec 103 sec 35%
|
||
SPSS/PC+ 4.01 17 sec 14 sec 21%
|
||
|
||
RapidCAD is available in 25 MHz and 33 MHz versions. It is distributed
|
||
through different channels than the other Intel math coprocessors, and I
|
||
have therefore been unable to obtain a data sheet for it. [78] gives the
|
||
typical power consumption of the 33 MHz RapidCAD as 3500 mW, which is
|
||
the same as for the 33 MHz 486DX. The RapidCAD-1 chip gets quite hot
|
||
when operating. Therefore, I recommend extra cooling for it (see the
|
||
paragraph below on the 486 for details). The RapidCAD-1 is packaged in a
|
||
132-pin PGA, just like the 80386, and the RapidCAD-2 is packaged in a
|
||
68-pin PGA like a 80387 coprocessor.
|
||
|
||
|
||
Intel 486DX
|
||
|
||
The Intel 486DX is, of course, not solely a coprocessor. This chip,
|
||
first introduced by Intel in 1989, functionally combines the CPU (a
|
||
heavily-pipelined implementation of the 386 architecture) with an
|
||
enhanced 387 (the chip's floating-point unit, FPU) and 8 KB of unified
|
||
on-chip code/data cache. (This description is necessarily simplified;
|
||
for a detailed hardware description, see [52].) The 486DX offers about
|
||
two to three times the integer performance of a 386 at the same clock
|
||
frequency, while floating-point performance is about three to four times
|
||
as high as the Intel 387DX at the same clock rate [29]. Since the FPU is
|
||
on the same chip as the CPU, the considerable communication overhead
|
||
between CPU and coprocessor in a 386/387 system is omitted, letting FPU
|
||
instructions run at the full speed permitted by the implementation. The
|
||
FPU also takes advantage of the on-chip cache and the highly pipelined
|
||
execution unit. The concurrent execution of CPU and coprocessor
|
||
instructions typical for 80x86/80x87 systems is still in existence on
|
||
the 486, but some FPU instructions like FSIN have nearly no concurrency
|
||
with CPU instructions, indicating that they make heavy use of both, CPU
|
||
and FPU resources [53, 1].
|
||
|
||
Besides its higher performance, the 486 FPU provides more accurate
|
||
transcendental functions than the 387DX coprocessor, according to my
|
||
tests (see below). To achieve better interrupt latency, FPU instructions
|
||
with a long execution times have been made abortable if an interrupt
|
||
occurs during their execution.
|
||
|
||
Due to the considerable amount of heat produced by these chips, and
|
||
taking into consideration the slow air flow provided by the fan in
|
||
garden-variety PC tower cases, I recommend an extra fan directly above
|
||
the CPU for safer operation. If you measure the surface temperature of
|
||
an 486DX after some time of operation in a normal tower case without
|
||
extra cooling, you may well come up with something like 80-90 degrees
|
||
Celsius (that is 175-195 degrees Fahrenheit for those not familiar with
|
||
metric units) [54,55]. You don't need the well known (and expensive)
|
||
IceCap[tm] to effectively cool your CPU; a simple fan mounted directly
|
||
above the CPU can bring the temperature of the chip down to about 50-60
|
||
degrees Celsius (120-140 degrees Fahrenheit), depending on the room
|
||
temperature and the temperature within the PC case (which depends on the
|
||
total power dissipation of all the components and the cooling provided
|
||
by the fan in the system's power supply). According to a simple rule
|
||
known as Arrhenius' Law, lowering the temperature by 10 degrees Celsius
|
||
slows down chemical reactions by a factor of two, so lowering the
|
||
temperature of your CPU by 30 degrees should prolong the life of the
|
||
device by a factor of eight, due to the slower ageing process. If you
|
||
are reluctant to add a fan to your system because of the additional
|
||
noise, settle for a low-noise fan like those available from the German
|
||
manufacturer Pabst (this is not meant to be an advertisement; I am just
|
||
the happy owner of such a fan, and have no other connections to the
|
||
firm).
|
||
|
||
The 486DX comes in a 168 pin ceramic PGA (pin grid array). It is
|
||
available in 25 MHz and 33 MHz versions. Since the end of 1991, a 50 MHz
|
||
version has also been available, manufactured by a CHMOS V process (the
|
||
25 MHz and 33 MHz are produced using the CHMOS IV process). Maximum
|
||
power consumption is 3500 mW for the 25 MHz 486 (2600 mW typical), 4500
|
||
mW for the 33 MHz version (3500 mW typical), and 5000 mW (3875 mW
|
||
typical) for the 50 MHz chip.
|
||
|
||
|
||
Intel 486DX2
|
||
|
||
The 486DX2 represents the latest generation of Intel CPUs. The "DX2"
|
||
suffix (instead of simply DX) is meant to be an indicator that these are
|
||
clock-doubled versions of the basic CPU. A normal 486DX operates at the
|
||
frequency provided by the incoming clock signal. A 486DX2 instead
|
||
generates a new clock signal from the incoming clock by means of a PLL
|
||
(phase locked loop). In the DX2, this clock signal has twice the
|
||
frequency of the incoming clock, hence the name clock-doubler. All
|
||
internal parts of the 486DX2 (cache, CPU core, and FPU) run at this
|
||
higher frequency; only the bus interface runs at the normal (undoubled)
|
||
speed. Using this technique, an Intel 486DX2-50 can run on an unmodified
|
||
motherboard designed for 25 MHz operation. Since motherboards which run
|
||
at 50 MHz are much harder to design and build than those for 25 MHz,
|
||
this makes a 486DX2-50 system cheaper than an 'equivalent' 486DX-50
|
||
system.
|
||
|
||
For all operations that don't access off-chip resources (e.g., register
|
||
operations), a 486DX2-50 provides exactly the same performance as a
|
||
486DX-50, and twice the performance of a 486DX-25. However, since the
|
||
main memory in a 486DX2-50 systems still operates at 25 MHz, all
|
||
instructions involving memory accesses are potentially slower than in a
|
||
486DX-50 system, whose memory also (presumably) runs at 50 MHz. The
|
||
internal cache of the 486 helps this problem a bit, but overall
|
||
performance of a 486DX2-50 is still lower than that of a 486DX-50.
|
||
Intel's documentation [32] shows this drop to be quite small, although
|
||
it is highly dependent upon the particular application.
|
||
|
||
The truly wonderful thing about the 486DX2 is that it allows easy
|
||
upgrading of 25 and 33 MHz 486 systems, since the 486DX2 is completely
|
||
pin-compatible with the 486DX: you need just take out the 486DX and plug
|
||
in the new 486DX2. Note that power consumption of the 486DX2-50 equals
|
||
that of the 486DX-50 (4000 mW typical, 4750 mW max.), and that the
|
||
486DX2-66 exceeds this by about 25% (4875 mW typical, 6000 mW max.).
|
||
These chips get *really* hot in a standard PC case with no extra
|
||
cooling, even if they come with an attached heat sink by default. (See
|
||
the discussion above for more detailed information on this problem and
|
||
possible solutions).
|
||
|
||
|
||
Intel 487SX
|
||
|
||
The 487SX is the math coprocessor intended for use in 486SX systems. The
|
||
486SX is basically a 486DX without the floating-point unit (FPU) [48,
|
||
50]. (Originally Intel sold 486DXs with a defective FPU as 486SXs but it
|
||
has now completely removed the FPU part from the 486SX mask for mass
|
||
production.) The introduction of the 486SX in 1991 has been viewed by
|
||
many as a marketing 'trick' by Intel to take market share from the 386
|
||
based systems once AMD became successful with their Am386. (AMD has
|
||
taken as much as 40% of the 386 market due to some superior features
|
||
such as higher clock frequency, lower power consumption, fully static
|
||
design, and availability of a 3V version). A 486SX at 20 MHz delivers
|
||
a bit less integer performance than a 40 MHz Am386.
|
||
|
||
To add floating-point capabilities to a 486SX based system, it would
|
||
seem to be easiest to swap the 486SX for a 486DX, which includes the FPU
|
||
on-chip. However, Intel has prevented this easy solution by giving the
|
||
486SX a slightly different pin out [48, 51]. Since only three pins are
|
||
assigned differently, clever board manufacturers have come out with
|
||
boards that accept anything from a 486SX-20 to a 486DX2-50 in their CPU
|
||
socket and by doing so provide a clean upgrade path. A set of three
|
||
jumpers ensures correct signal assignment to the changed pins for either
|
||
CPU type. To upgrade 486SX systems without this feature, you are forced
|
||
to buy a 487SX and install it in the "Performance Upgrade Socket"
|
||
(present in most systems).
|
||
|
||
Once the 487SX was available, it was quickly found out that it is just a
|
||
normal 486DX with a slightly different pinout [49]. Technically
|
||
speaking, the solution Intel chose was the only practical way to provide
|
||
a 486SX system with the high level of floating-point performance the
|
||
486DX offers. The CPU and FPU must be on the same chip; otherwise, the
|
||
FPU cannot make use of the CPU's internal cache and there would be
|
||
considerable overhead in CPU-FPU communication (similar to a 386/387
|
||
system), nullifying most of the arithmetic speedups over the 387. That
|
||
the 486SX, 487SX, and 486DX are *not* pin-compatible seems to be purely
|
||
for marketing reasons.
|
||
|
||
To upgrade a 486SX based system, Intel also offers the OverDrive chip,
|
||
which is just the same as a 487SX with internal clock doubling. It also
|
||
goes into the motherboard's "Performance Upgrade Socket". The OverDrive
|
||
roughly doubles the performance of a 486SX/487SX based system. (For a
|
||
explanation of clock doubling, see the description of the Intel 486DX2
|
||
above.)
|
||
|
||
Inserting the 487SX effectively shuts down the 486SX in the 486SX/487SX
|
||
system, so the 486SX could be removed once the 487SX is installed. Since
|
||
the shut down is logical, not electrical, the 486SX still uses power if
|
||
used with the 487SX, although it is inoperational. As with the 486SX,
|
||
the 487SX is currently available in 20 MHz and 25 MHz versions. At 20
|
||
MHz, the 487SX has a power consumption of max. 4000 mW (3250 mW
|
||
typical). It is available in a 169 pin ceramic PGA (pin grid array).
|
||
|
||
|
||
Weitek 1167
|
||
|
||
This math coprocessor was the predecessor of the Weitek Abacus 3167. It
|
||
was actually a small printed circuit board with three chips mounted on
|
||
it. In contrast to the Weitek 3167, the 1167 did not have a square root
|
||
instruction; instead, the square root function was computed by means of
|
||
a subroutine in the Weitek transcendental function library. However, the
|
||
1167 did have a mode in which it supported denormal numbers. (The Weitek
|
||
3167 and 4167 only implement the 'fast' mode, in which denormals are not
|
||
supported.) Overall performance of the 1167 is slightly less than that
|
||
of the Weitek 3167.
|
||
|
||
|
||
Weitek 3167
|
||
|
||
The 3167 was introduced by Weitek in 1989 and provided the fastest
|
||
floating-point performance possible on a 386 based system at that time.
|
||
The 3167 is not a real coprocessor, strictly speaking, but rather a
|
||
memory-mapped peripheral device. The architecture of the 3167 was
|
||
optimized for speed wherever possible. Besides using the faster memory
|
||
mapped interface to the CPU (the 80x87 uses IO-ports), it does not
|
||
support many of the features of the 80x87 coprocessors, allowing all of
|
||
the chip's resources to be concentrated on the fast execution of the
|
||
basic arithmetic operations. (For a more detailed description of the
|
||
Weitek 3167, see the first chapter of this document.)
|
||
|
||
In benchmark comparisons, the Weitek 3167 provided up to 2.5 times the
|
||
performance of an Intel 387DX coprocessor. For example, on a 33 MHz 3167
|
||
the Whetstone benchmark performed at 7574 kWhetstones/sec compared with
|
||
the 3743 kWhetstones/s for the Intel 387DX. (Note, however, that these
|
||
are single-precision results and that the Weitek 3167's performance
|
||
would drop to about half the stated rate for double-precision, while the
|
||
value for the Intel 387DX would change very little.) In any case, before
|
||
the advent of the Intel RapidCAD, the Weitek 3167 usually outperformed
|
||
all 387-compatible coprocessors, even for double-precision operations
|
||
[63,65,69]. For typical applications, the advantage of the Weitek 3167
|
||
over the 387 clones is much smaller. In a benchmark test using
|
||
AutoDesk's 3D-Studio the Weitek 3167 performed at 123% of the Intel
|
||
387DX's performance compared with 106% for the Cyrix FasMath 83D87 and
|
||
118% for the Intel RapidCAD.
|
||
|
||
The Weitek Abacus 3167 is packaged in a 121-pin PGA that fits into an
|
||
EMC socket (provided in most 386-based systems). It does *not* fit into
|
||
the normal 68-pin PGA socket intended for a 387 coprocessor.
|
||
|
||
To get the best of both worlds, one might want to use a Weitek 3167 and
|
||
a 387 compatible coprocessor in the same system. These coprocessors can
|
||
coexist in the same system without problems; however, most 386-based
|
||
systems contain only one coprocessor socket, usually of the EMC
|
||
(extended math coprocessor) type. Thus, you can install either a 387
|
||
coprocessor or a Weitek 3167, but not both at the same time. There *are*
|
||
small daughter boards available that plug into the EMC socket and
|
||
provide two sockets, an EMC and a standard coprocessor socket.
|
||
|
||
At 25 MHz, the Weitek 3167 has a power consumption of max. 1750 mW. At
|
||
33 MHz, max. power consumption is 2250 mW.
|
||
|
||
|
||
Weitek 4167
|
||
|
||
The 4167 is a memory-mapped coprocessor that has the same architecture
|
||
as the 3167; it is designed to provide 486-based systems with the
|
||
highest floating-point performance available. It executes coprocessor
|
||
instructions at three to four times the speed of the Weitek 3167.
|
||
Although it is up to 80% faster than the Intel 486 in some benchmarks
|
||
[1,69], the performance advantage for real application is probably more
|
||
like 10%. The introduction of the 486DX2 processors has more or less
|
||
obliterated the need for a Weitek 4167, since the DX2 CPUs provide the
|
||
same performance as the Weitek, as well as the additional features the
|
||
80x87 architecture has that the Weitek does not.
|
||
|
||
The Weitek 4167 is packaged in a 142-pin PGA package that is only
|
||
slightly smaller than the 486's package. At 25 MHz, it has a max. power
|
||
consumption of 2500 mW [32].
|
||
|
||
|
||
|
||
======================================
|
||
Finding out which coprocessor you have
|
||
======================================
|
||
|
||
If you are interested in programming techniques which allow the detection and
|
||
differentiation of the coprocessors described above, I refer you to my
|
||
COMPTEST program. COMPTEST reliably detects the type and clock frequency of
|
||
the CPU and coprocessor installed in your machine. The current version is
|
||
CTEST257.ZIP, with future versions to be called CTEST258, CTEST259 and so on.
|
||
COMPTEST can correctly identify all of the coprocessors described above, with
|
||
the exception of the Weitek chips, for which the detection mechanism is not
|
||
that reliable.
|
||
|
||
COMPTEST is in the public domain and comes with complete source code. It is
|
||
available via anonymous ftp from garbo.uwasa.fi and additional ftp sites that
|
||
mirror garbo.
|
||
|
||
|
||
|
||
================================================
|
||
Current coprocessor prices and purchasing advice
|
||
================================================
|
||
|
||
Due to mid-1992 price slashing by Cyrix (and subsequently, Intel) for 387
|
||
coprocessors, prices have dropped significantly for all 287 and 387
|
||
compatibles, with hardly any price difference between manufacturers. 387DX
|
||
compatible coprocessors typically sell for ~US$ 80 for all speeds except for
|
||
40 MHz versions, which are typically ~US$ 90. 387SX compatible coprocessors
|
||
sell for ~US$ 70, regardless of speed, with the exception of the 33 MHz
|
||
versions, which are ~US$ 80. The Intel 287XL sells for ~US$ 90, while the
|
||
IIT 2C87 and Cyrix 82S87 each sell for about US$ 60. 8087s may be more
|
||
expensive, the price of an 8087-10 being ~US$ 150. I purchased the Intel
|
||
RapidCAD for US$ 300 and haven't seen it offered for a better price. I see the
|
||
Weitek Abacus 3167-33 being offered for US$ 230 and the 4167-33 being offered
|
||
for US$ 850. The Intel 486SX OverDrive is available for ~US$ 570 for the 20 MHz
|
||
version, while the Intel 486DX2-50 costs ~650 US$. This price information
|
||
reflects the price situation as of 01-11-93; prices can be expected to drop
|
||
slightly in the near future.
|
||
|
||
|
||
Which coprocessor should you buy?
|
||
---------------------------------
|
||
Several computer magazines have published application-level performance
|
||
comparisons for various 387 coprocessors and Weitek's ABACUS 3167 and 4167
|
||
chips [1,25,68,70]. Applications tested included AutoCAD R11, RenderStar,
|
||
Quattro Pro, Lotus 1-2-3, and AutoDesk's 3D-Studio. For most tests,
|
||
performance improvements for the 387 clones over Intel's 387DX were small to
|
||
marginal, the clones running the applications no more than 5-15% faster than
|
||
the Intel 387DX. In the test of 3D-Studio, one of the few programs that
|
||
directly supports the Weitek Abacus, the Weitek 3167 improved performance by
|
||
23% over an Intel 387DX and the 4167 improved performance by 10% over the
|
||
486DX [1].
|
||
|
||
If you have a demand for high floating-point performance, you should consider
|
||
buying a full 486-based system, rather than a 386-based system with an
|
||
additional coprocessor. Consider: A 386/33 MHz motherboard currently sells for
|
||
~US$ 270; together with the coprocessor, the cost totals ~US$ 350. A 486/33 MHz
|
||
ISA motherboard sells for US$ 650. While this means that the 486 system is 85%
|
||
more expensive than the 386/387 system, it also provides 100% more integer
|
||
and floating-point performance (twice the performance), giving it better
|
||
price/performance for math-intensive applications. As prices for 486 chips
|
||
fall in the future, the price difference between these two systems should
|
||
become even smaller.
|
||
|
||
If you want to push your 386-based system to its maximum floating-point
|
||
performance and can't switch to a 486, I recommend the Intel RapidCAD
|
||
chipset. It is both faster [1] and cheaper than installing a Weitek Abacus
|
||
3167 in a 386 system, which used to be the highest performing combination
|
||
before the RapidCAD was introduced.
|
||
|
||
In a similar vein, the introduction of the Intel 486DX2 clock-doubler chips
|
||
has obliterated the need for a Weitek 4167 to get maximum floating-point
|
||
performance out of a 486-based system. A 486DX2-66 performs at or above the
|
||
performance level of a 33 MHz Weitek 4167, even if the latter uses single-
|
||
precision rather than double-precision. The 486DX-66 is rated by Intel at
|
||
24700 double-precision kWhetstones/sec and 3.1 double-precision Linpack
|
||
MFLOPS. (Of course, these benchmarks used the highest performance compilers
|
||
available. But even with a Turbo Pascal 6.0 program, I managed to squeeze 1.6
|
||
double-precision MFLOPS out of the 486DX2-66 for the LLL benchmark [for a
|
||
description of these benchmarks, see the paragraph on benchmarks below].)
|
||
Although I haven't yet seen 486DX2-66 processors being offered to end users
|
||
for upgrade purposes, I recommend the 486DX2-66 to those that need highest
|
||
floating-point performance and are planning to buy a new PC. The price
|
||
difference between a 33 MHz 486DX motherboard and a 486DX2-66 motherboard is
|
||
around US$ 450, well below the price for the Weitek Abacus 4167.
|
||
|
||
|
||
|
||
============================================================
|
||
The benchmark programs / Coprocessor performance comparisons
|
||
============================================================
|
||
|
||
The performance statistics below were put together with the help of four
|
||
widely-known numeric benchmarks and two benchmarks developed by me. Three
|
||
Pascal programs, one FORTRAN program, and two assembly language programs were
|
||
used. The assembly language programs were linked with Borland's Turbo Pascal
|
||
6.0 for library support, especially to include the coprocessor emulator of
|
||
the TP 6.0 run-time library. The Pascal programs were compiled with Turbo
|
||
Pascal 6.0, a non-optimizing compiler that produces 16-bit code. The FORTRAN
|
||
program was compiled using Microsoft's FORTRAN 5.0, an optimizing compiler
|
||
that generates 16-bit code. All programs use double-precision variables
|
||
(except PEAKFLOP and SAVAGE, which use double extended precision).
|
||
|
||
Note that the use of a highly optimizing compiler producing 32-bit code can
|
||
give much higher performance for some benchmarks. For example, Intel rates
|
||
the 33 MHz 386/387DX at 3290 kWhetstones/sec and 0.4 double-precision LINPACK
|
||
MFLOPS [28,29], and it rates the Intel 486 at 12300 kWhetstones/sec and 1.6
|
||
double-precision LINPACK MFLOPS [30]. The compilers used in these benchmarks
|
||
run by the chip's manufacturer are the ones that give the highest performance
|
||
available, and sell in the US$ 1000+ price range. Some of them may even be
|
||
experimental or prereleased versions not available to the general public. The
|
||
relative performance of one coprocessor to another can and does vary greatly
|
||
depending on the code generated by compilers. Non-optimizing compilers tend
|
||
to generate a high percentage of operations which access variables in memory,
|
||
while optimizing compiler produce code that contains many operations
|
||
involving registers. Thus it is well possible that coprocessor A beats
|
||
coprocessor B running benchmark Z if compiled with compiler C, but B beats A
|
||
when the same benchmark is compiled using compiler D.
|
||
|
||
All benchmark in this overview were run from floppy under a 'bare-bones' MS-
|
||
DOS 5.0 without the CONFIG.SYS and AUTOEXEC.BAT files. This way, it was made
|
||
sure no TSR or other program unnecessarily stole computing resources from the
|
||
benchmarks.
|
||
|
||
|
||
Description of benchmarks
|
||
-------------------------
|
||
PEAKFLOP is the kernel of a fractal computation. It consists mainly of a
|
||
tight loop written in assembly code and fine-tuned to give maximum
|
||
performance. The whole program fits nicely into even a very small CPU cache.
|
||
All variables are held in the CPU's and coprocessor's registers, so the only
|
||
memory access is for opcode fetches. The main loop contains three
|
||
multiplications and five additions/ subtractions; this ratio is fairly
|
||
typical for other floating-point intensive programs as well. Due to the
|
||
nature of this program, its MFLOPS rate is hardly to be exceeded by any
|
||
program that calculates anything useful; thus the name PEAKFLOP. You will
|
||
find the source code for PEAKFLOP in appendix B.
|
||
|
||
TRNSFORM multiplies an array of 8191 vectors with a 3D-transformation matrix
|
||
(a 4x4 matrix). Each vector consists of four double-precision values.
|
||
Multiplying vectors with a matrix is a typical operation in the manipulation
|
||
(e.g. rotation) of 3D objects which are made up from many vectors describing
|
||
the object. This benchmark stresses addition and multiplication as well as
|
||
memory access. For each vector, 16 multiplications and 12 additions are used,
|
||
and about 256 KB of data is accessed during the benchmark run.
|
||
|
||
For the IIT 3C87, a special version of TRNSFORM was written that makes use of
|
||
the special F4X4 instruction available on that coprocessor. F4X4 does a full
|
||
multiplication of a 4x4 matrix by a 4x1 vector in a single instruction.
|
||
TRNSFORM is implemented as an optimized assembler program linked with the
|
||
Turbo Pascal 6.0 library. The full source code can be found in appendix B.
|
||
|
||
LLL is short for Lawrence Livermore Loops [21], a set of kernels taken from
|
||
real floating-point extensive programs. Some of these loops are vectorizable,
|
||
but since we don't deal with vector processors here, this doesn't matter. For
|
||
this test, LLL was adapted from the FORTRAN original [20] to Turbo Pascal
|
||
6.0. By variable overlaying (similar to FORTRAN's EQUIVALENCE statement),
|
||
memory allocation for data was reduced to 64 KB, so all data fits into a
|
||
single 64 KB segment. The older version of LLL is used here which contains 14
|
||
loops. There also exists a newer, more elaborate version consisting of 24
|
||
kernels. The kernels in LLL exercise only multiplication and addition. The
|
||
MFLOPS rate reported is the average of the MFLOPS rate of all 14 kernels.
|
||
All floating-point variables in the programs are of type DOUBLE.
|
||
|
||
Both LLL and Whetstone results (see below) are reported as returned by my
|
||
COMPTEST test program, in which they have been included as a measure of
|
||
coprocessor/FPU performance. COMPTEST has been compiled under Turbo Pascal
|
||
6.0 with all 'optimizations' on and using my own run-time library, which
|
||
gives higher performance than the one included with TP 6.0. My library is
|
||
available as TPL60N18.ZIP from garbo.uwasa.fi and ftp sites that mirror this
|
||
site.
|
||
|
||
Linpack [5] is a well known floating-point benchmark that also heavily
|
||
exercises the memory system. Linpack operates on large matrices and takes up
|
||
about 570 KB in the version used for this test. This is about the largest
|
||
program size a pure DOS system can accommodate. Linpack was originally
|
||
designed to estimate performance of BLAS, a library of FORTRAN subroutines
|
||
that handles various vector and matrix operations. Note that vendors are
|
||
free to supply optimized (e.g., assembly language) versions of BLAS. Linpack
|
||
uses two routines from BLAS which are thought to be typical of the matrix
|
||
operations used by BLAS. Both routines only use addition/subtraction and
|
||
multiplication. The FORTRAN source code for Linpack can be obtained from
|
||
the automated mail server netlib@ornl.gov. Linpack was compiled using MS
|
||
FORTRAN 5.0 in the HUGE memory model (which can handle data structures
|
||
larger than 64 KB) and with compiler switches set for maximum optimization.
|
||
All floating-point variables in the program are of the DOUBLE type. Linpack
|
||
performs the same test repeatedly. The number reported is the maximum MFLOPS
|
||
rate returned by Linpack. Linpack MFLOPS ratings for a great number of
|
||
machines are contained in [6]. This PostScript document is also available
|
||
from netlib@ornl.gov.
|
||
|
||
Whetstone [2,3,4] is a synthetic benchmark based upon statistics collected
|
||
about the use of certain control and data structures in programs written in
|
||
high level languages. Based on these statistics, it tries to mirror a
|
||
'typical' HLL program. Whetstone performance is expressed by how many
|
||
hypothetical 'whetstone' instructions are executed per second. It was
|
||
originally implemented in ALGOL. Unlike PEAKFLOP, LLL, and Linpack,
|
||
Whetstone not only uses addition and multiplication but exercises all basic
|
||
arithmetic operations as well as some transcendental functions. Whetstone
|
||
performance depends on the speed of the CPU as well as on the coprocessor,
|
||
while PEAKFLOP, LLL, and Linpack place a heavier burden on the coprocessor/FPU.
|
||
|
||
There exist both old and new versions of Whetstone. Note that results from
|
||
the two versions can differ by as much as 20% for the same test configuration.
|
||
For this test, the new version in Pascal from [3] was used. It was compiled
|
||
with Turbo Pascal 6.0 and my own library (see above) with all 'optimizations'
|
||
on. All computations are performed using the DOUBLE type.
|
||
|
||
SAVAGE tests the performance of transcendental function evaluation. It is
|
||
basically a small loop in which the sin, cos, arctan, ln, exp, and sqrt
|
||
functions are combined in a single expression. While sin, cos, arctan, and
|
||
sqrt can be evaluated directly with a single 387 coprocessor instruction
|
||
each, ln and exp need additional preprocessing for argument reduction and
|
||
result conversion. According to [14], the Savage benchmark was devised by
|
||
Bill Savage, and is distributed by: The Wohl Engine Company, Ltd., 8200 Shore
|
||
Front Parkway, Rockaway Beach, NY 11693, USA. Usually, Savage is programmed
|
||
to make 250,000 passes though the loop. Here only 10,000 loops are executed
|
||
for a total of 60,000 transcendental function evaluations. The result is
|
||
expressed in function evaluations per second. SAVAGE source code was taken
|
||
from [7] and compiled with Turbo Pascal 6.0 and my own run-time library
|
||
(see above).
|
||
|
||
|
||
|
||
Benchmark results using the Intel 386DX CPU and various coprocessors
|
||
--------------------------------------------------------------------
|
||
|
||
My benchmark results for 387 coprocessors, coprocessor emulators and the
|
||
Intel RapidCAD and Intel 486 CPUs, using the programs described above, on
|
||
an Intel 386DX system:
|
||
|
||
|
||
33.3 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage
|
||
MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec
|
||
|
||
Intel 386DX WITH:
|
||
EM87 emulator 0.0070 0.0040 0.0050 0.0050 26 418 ##
|
||
Franke387 emu. 0.0307 0.0246 0.0194 0.0179 137 3335 $$
|
||
TP/MS-FORT emu 0.0263 0.0227 0.0167 0.0158 133 3160 %%
|
||
Q387 emulator 0.0920 0.0664 0.0305 0.0304 251 4796 ((
|
||
Intel 387DX 0.7647 0.6004 0.3283 0.2676 2046 43860
|
||
ULSI 83C87 1.0097 0.6609 0.3239 0.2598 2089 47431
|
||
IIT 3C87 0.8455 0.5957 0.3198 0.2646 2203 49020
|
||
IIT 3C87,4X4 0.8455 1.4334 0.3198 0.2646 2203 49020 @@
|
||
C&T 38700 0.9455 0.6907 0.3338 0.2700 2376 62565
|
||
Cyrix 387+ 0.9286 0.6806 0.3293 0.2669 2435 66890
|
||
Cyrix EMC87 1.0400 0.6628 0.3352 0.2808 2540 71685 //
|
||
|
||
Intel RapidCAD 1.8572 1.5798 0.6072 0.4533 3953 72464
|
||
Intel 486DX 2.0800 1.7779 0.9387 0.6682 5143 82192
|
||
|
||
|
||
|
||
40 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage
|
||
MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec
|
||
|
||
Intel 386DX WITH:
|
||
EM87 emulator 0.0084 0.0080 0.0060 0.0060 31 502 ##
|
||
Franke387 emu. 0.0369 0.0295 0.0233 0.0215 164 4002 $$
|
||
TP/MS-FORT emu 0.0316 0.0273 0.0200 0.0190 160 3794 %%
|
||
Q387 emulator 0.1103 0.0798 0.0365 0.0364 301 5758 ((
|
||
Intel 387DX 0.9204 0.7212 0.3932 0.3211 2428 52677
|
||
ULSI 83C87 1.2093 0.7936 0.3890 0.3120 2528 56926
|
||
IIT 3C87 1.0196 0.7145 0.3834 0.3179 2663 58766
|
||
IIT 3C87,4x4 1.0196 1.7244 0.3834 0.3179 2663 58766 @@
|
||
C&T 38700 1.0722 0.7908 0.4007 0.3222 2837 74906
|
||
Cyrix 387+ 1.1305 0.8162 0.3945 0.3208 2946 80322
|
||
Cyrix EMC87 1.2381 0.7963 0.4025 0.3324 3061 86083 //
|
||
|
||
Intel RapidCAD 2.2128 1.8931 0.7377 0.5432 4810 86957
|
||
Intel 486DX 2.4762 2.1335 1.1110 0.8204 6195 98522
|
||
|
||
|
||
|
||
Benchmark results using the Cyrix 486DLC CPU and various coprocessors
|
||
---------------------------------------------------------------------
|
||
|
||
The Cyrix 486DLC is the latest entry into the market of 386DX replacement
|
||
processors. It features an Intel 486SX-compatible instruction set, a 1 KB on-
|
||
chip cache, and a 16x16 bit hardware multiplier. The RISC-like execution unit
|
||
of the 486DLC executes many instructions in a single clock cycle. The
|
||
hardware multiplier multiplies 16-bit quantities in 3 clock cycles, as
|
||
compared to 12-25 cycles on a standard Intel 386DX. This is especially useful
|
||
in address calculations (code from non-optimizing compilers may contain many
|
||
MUL instructions for array accesses) and for software floating-point
|
||
arithmetic. The 1 KB cache helps the 486DLC to overcome some of the
|
||
limitations of the 386 bus interface, and although its hit rate averages only
|
||
about 65% under normal program conditions, a 5-15% overall performance
|
||
increase can usually be seen for both integer and floating-point-intensive
|
||
applications when it is enabled.
|
||
|
||
The 486DLC's internal cache is a unified data/instruction write-through type,
|
||
and can be configured as either a direct mapped or a 2-way set associative
|
||
cache. For compatibility reasons, the cache is disabled after a processor
|
||
reset and must be enabled with the help of a small routine provided by
|
||
Cyrix. Cyrix has also defined some additional cache control signals for some
|
||
of the 486DLC pins, intended to improve communication between the on-chip
|
||
cache and an external cache. Current 386 systems ignore these signals, since
|
||
they are not defined for the standard Intel 386DX. However, future systems
|
||
designed with the 486DLC in mind may take advantage of them for increased
|
||
performance.
|
||
|
||
In existing 386 systems, DMA transfers (e.g., by a SCSI controller or a
|
||
soundcard) may cause the 486DLC's entire on-chip cache to be flushed, since
|
||
no other means exist to enforce consistency between the cache contents and
|
||
main memory. This reduces the performance of the 486DLC in these cases. The
|
||
486DLC on-chip cache does, however, allow specification of up to four non-
|
||
cacheable regions, which is particularly useful if your system has memory
|
||
mapped peripherals (e.g., a Weitek coprocessor).
|
||
|
||
Although I successfully ran my test programs on the Cyrix chip with all
|
||
coprocessors, not all of them work well with the 486DLC in all circumstances.
|
||
The IIT 3C87, the Cyrix 83D87 (chips manufactured prior to November 1991),
|
||
and the Cyrix EMC87 should not be used with the 486DLC, since they may cause
|
||
the computer to lock up if the FSAVE and FRSTOR instructions are used. (These
|
||
instructions are typically used in protected mode multiple task environments
|
||
to save and restore the coprocessor state for each task. Note that Microsoft
|
||
Windows also fits this description.) According to Cyrix, this problem occurs
|
||
only with first revision 486DLCs (sample chips) and is fixed on newer ones.
|
||
To be on the safe side, I recommend using the Cyrix 387+ with the 486DLC,
|
||
both for assured compatibility and for best performance. Note that 387+ is a
|
||
'Europe only' name and that this chip is called 83D87 elsewhere, just like
|
||
the old version. You need to get a 83D87 produced after about October 1991
|
||
to guarantee that is works correctly with any 486DLC; the same caveat applies
|
||
to the Cyrix 486SLC and the Cyrix 83S87. If you already have a Cyrix
|
||
coprocessor, use my COMPTEST program to find out whether you have a 'new' or
|
||
'old' coprocessor. COMPTEST is available as CTEST257.ZIP via anonymous ftp
|
||
from garbo.uwasa.fi (in the /systest directory) and other ftp servers that
|
||
mirror garbo.
|
||
|
||
The Cyrix 486DLC is currently the 386 'clone' with the highest integer
|
||
performance. With the internal cache enabled, integer performance of the
|
||
486DLC can be up to 80% higher than that of an Intel 386DX at the same clock
|
||
frequency, with the average speed gain for most applications being about 35%.
|
||
Floating-point applications are typically accelerated by about 15%-30% when
|
||
using a Cyrix 486DLC (with its cache enabled) instead of the Intel 386DX.
|
||
|
||
|
||
33.3 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage
|
||
MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec
|
||
Cyrix 486DLC
|
||
(cache off) WITH:
|
||
EM87 emulator 0.0089 0.0082 0.0062 0.0063 31 472 ##
|
||
Franke387 emu. 0.0402 0.0324 0.0258 0.0240 184 4807 $$
|
||
TP/MS-FORT emu 0.0346 0.0288 0.0206 0.0212 173 4401 %%
|
||
Q387 emulator 0.1214 0.0810 0.0368 0.0382 320 6020 ((
|
||
Intel 387DX 0.8455 0.6552 0.3659 0.3033 2249 48780
|
||
ULSI 83C87 1.1818 0.7543 0.3752 0.3026 2381 53476
|
||
IIT 3C87 0.9541 0.6609 0.3653 0.3036 2476 55814
|
||
IIT 3C87,4X4 0.9541 1.4988 0.3653 0.3036 2476 55814 @@
|
||
C&T 38700 1.1183 0.7644 0.3796 0.3087 2703 73350
|
||
Cyrix 387+ 1.1305 0.7445 0.3727 0.3060 2731 81967
|
||
Cyrix EMC87 1.2236 0.7593 0.3823 0.3144 2908 88889 //
|
||
|
||
Intel RapidCAD 1.8572 1.5798 0.6072 0.4533 3953 72464
|
||
Intel 486DX 2.0800 1.7779 0.9387 0.6682 5143 82192
|
||
|
||
|
||
|
||
40.0 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage
|
||
MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec
|
||
Cyrix 486DLC
|
||
(cache off) WITH:
|
||
EM87 emulator 0.0107 0.0098 0.0075 0.0075 37 567 ##
|
||
Franke387 emu. 0.0488 0.0392 0.0311 0.0288 223 5808 $$
|
||
TP/MS-FORT emu 0.0416 0.0345 0.0246 0.0253 208 5284 %%
|
||
Q387 emulator 0.1463 0.0973 0.0442 0.0458 384 7237 ((
|
||
Intel 387DX 1.0196 0.7880 0.4375 0.3644 2712 58479
|
||
ULSI 83C87 1.4247 0.9064 0.4506 0.3630 2868 64171
|
||
IIT 3C87 1.1556 0.7963 0.4399 0.3611 2988 66964
|
||
IIT 3C87,4X4 1.1556 1.7916 0.4399 0.3611 2988 66964 @@
|
||
C&T 38700 1.3333 0.9210 0.4548 0.3708 3254 88106
|
||
Cyrix 387+ 1.3507 0.8958 0.4477 0.3754 3297 98361
|
||
Cyrix EMC87 1.4648 0.9136 0.4548 0.3773 3505 106572 //
|
||
|
||
Intel RapidCAD 2.2128 1.8931 0.7377 0.5432 4810 86957
|
||
Intel 486DX 2.4762 2.1335 1.1110 0.8204 6195 98522
|
||
|
||
|
||
|
||
33.3 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage
|
||
MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec
|
||
Cyrix 486DLC
|
||
(cache on) WITH:
|
||
EM87 emulator 0.0099 0.0089 0.0068 0.0069 35 550 ##
|
||
Franke387 emu. 0.0462 0.0362 0.0288 0.0265 205 5445 $$
|
||
TP/MS-FORT emu 0.0410 0.0330 0.0234 0.0241 198 5339 %%
|
||
Q387 emulator 0.1344 0.0902 0.0389 0.0403 339 6241 ((
|
||
Intel 387DX 0.8525 0.6552 0.3941 0.3279 2332 49834
|
||
ULSI 83C87 1.2093 0.7543 0.4068 0.3270 2478 57197
|
||
IIT 3C87 0.9720 0.6609 0.3959 0.3295 2579 57252
|
||
IIT 3C87,4X4 0.9720 1.5087 0.3959 0.3295 2579 57252 @@
|
||
C&T 38700 1.1305 0.7644 0.4126 0.3343 2839 75949
|
||
Cyrix 387+ 1.1429 0.7445 0.4023 0.3310 2866 85349
|
||
Cyrix EMC87 1.2381 0.7593 0.4150 0.3412 3051 93897 //
|
||
|
||
Intel RapidCAD 1.8572 1.5798 0.6072 0.4533 3953 72464
|
||
Intel 486DX 2.0800 1.7779 0.9387 0.6682 5143 82192
|
||
|
||
|
||
|
||
40.0 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage
|
||
MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec
|
||
Cyrix 486DLC
|
||
(cache on) WITH:
|
||
EM87 emulator 0.0118 0.0107 0.0082 0.0082 42 659 ##
|
||
Franke387 emu. 0.0565 0.0438 0.0350 0.0313 248 6585 $$
|
||
TP/MS-FORT emu 0.0491 0.0395 0.0279 0.0296 238 6408 %%
|
||
Q387 emulator 0.1610 0.1084 0.0470 0.0484 407 7509 ((
|
||
Intel 387DX 1.0297 0.7880 0.4748 0.3937 2801 59821
|
||
ULSI 83C87 1.4445 0.9028 0.4891 0.3926 2976 65789
|
||
IIT 3C87 1.1686 0.7963 0.4734 0.3916 3096 68729
|
||
IIT 3C87,4X4 1.1686 1.8057 0.4734 0.3916 3096 68729 @@
|
||
C&T 38700 1.3685 0.9173 0.4958 0.4012 3401 91185
|
||
Cyrix 387+ 1.3867 0.8958 0.4887 0.3962 3448 102564
|
||
Cyrix EMC87 1.4857 0.9100 0.4959 0.4091 3676 112360 //
|
||
|
||
Intel RapidCAD 2.2128 1.8931 0.7377 0.5432 4810 86957
|
||
Intel 486DX 2.4762 2.1335 1.1110 0.8204 6195 98522
|
||
|
||
|
||
|
||
|
||
Benchmark results using the C&T 38600DX CPU and various coprocessors
|
||
--------------------------------------------------------------------
|
||
|
||
The Chips&Technologies 38600DX CPU is marketed as a 100% compatible
|
||
replacement for the Intel 386DX CPU. Unlike AMD's Am386, which uses microcode
|
||
that is identical to the Intel 386DX's, the C&T 38600DX uses microcode
|
||
developed independently by C&T using "clean-room" techniques. C&T even
|
||
included the 386DX's "undocumented" LOADALL386 instruction into the
|
||
instruction set to provide full compatibility with the 386DX. In my tests,
|
||
however, I observed that the 38600DX has severe problems with the CPU-
|
||
coprocessor communication, which causes the floating-point performance to
|
||
drop below that of the Intel 386DX/Intel 387DX for most programs. This
|
||
problem exists with all available 387-compatible coprocessors (ULSI 83C87,
|
||
IIT 3C87, Cyrix EMC87, Cyrix 83D87, Cyrix 387+, C&T 38700, Intel 387DX). A
|
||
net.aquaintance also did tests with the 38600DX and arrived at similar
|
||
results. He contacted C&T and they said that they were aware of the problem.
|
||
|
||
Some instructions execute faster on the C&T 38600DX than on the 386DX, giving
|
||
an average speedup of 5-10% for integer applications. C&T also produces a
|
||
38605DX CPU that includes a 512 byte instruction cache and provides a further
|
||
performance increase. However, the 38605DX needs a bigger socket (144-pin
|
||
PGA) and is therefore *not* pin-compatible with the 386DX. Tests using the
|
||
38600DX were run at 33.3 MHz, as a 40 MHz version was not available as of 09-
|
||
17-92 and running the 33 MHz chip version at 40 MHz locked up the machine
|
||
frequently. Unfortunately, tests using the Intel 387DX consistently locked up
|
||
in the TRNSFORM benchmark when run at 33.3 MHz. It ran fine at 20 MHz, and
|
||
the results were scaled to show expected performance at 33.3 MHz.
|
||
|
||
|
||
33.3 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage
|
||
MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec
|
||
|
||
C&T 38600DX WITH:
|
||
Intel 387DX 0.7376 0.5620 0.3337 0.2636 2066 45489
|
||
ULSI 83C87 0.5226 0.4690 0.3236 0.2654 2087 43228
|
||
IIT 3C87 0.7879 0.5762 0.3397 0.2674 2263 51195
|
||
IIT 3C87,4X4 0.7879 0.6181 0.3397 0.2674 2263 51195 @@
|
||
C&T 38700 0.5977 0.5572 0.3463 0.2681 2338 63966
|
||
Cyrix 387+ 0.5896 0.5508 0.3438 0.2673 2375 66741
|
||
|
||
Intel RapidCAD 1.8572 1.5798 0.6072 0.4533 3953 72464
|
||
Intel 486 2.0800 1.7779 0.9387 0.6682 5143 82192
|
||
|
||
|
||
For comparison:
|
||
|
||
PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage
|
||
MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec
|
||
|
||
i486DX2-66 4.1601 3.4227 1.6531 1.3010 10655 163934
|
||
i486DX2-50 3.0589 2.6665 1.2537 0.9744 7962 123203
|
||
i387, 20 MHz 0.2253 0.3271 0.1434 0.1171 952 21739 ++
|
||
i387DX, 20 MHz 0.3567 0.4444 0.1484 0.1161 1034 24155 &&
|
||
i80287, 5 MHz 0.0281 0.0310 0.0242 0.0222 150 3261 !!
|
||
i8087,9.54 MHz 0.0636 0.0705 0.0321 0.0219 234 5782 **
|
||
|
||
|
||
|
||
Benchmark notes and footnotes
|
||
-----------------------------
|
||
|
||
Hardware configuration for test of 387 coprocessors with C&T 38600DX, Intel
|
||
386DX, Cyrix 486DLC, and Intel RapidCAD CPUs:
|
||
|
||
System A: Motherboard with Forex chip set, 128 KB CPU Cache, 8 MB RAM
|
||
|
||
|
||
Hardware configuration for test of 486 FPU (extra fan for 40 MHz operation):
|
||
|
||
System B: Motherboard with SIS chip set, 256 KB CPU Cache, 8 MB RAM
|
||
|
||
|
||
## EM87 V1.2 by Ron Kimball is a public domain coprocessor emulator that
|
||
loads as a TSR. It uses INT 7 traps emitted by 80286, 80386, or 486SX
|
||
systems with no coprocessor upon encountering coprocessor instructions
|
||
to catch coprocessor instructions and emulate them. Whetstone and Savage
|
||
benchmarks for this test were compiled with the original TP 6.0 library,
|
||
as EM87 chokes on the 387 specific FSIN and FCOS instructions used in my
|
||
own library if a 387 is detected. Obviously EM87 identifies itself as a
|
||
387, but it has no support for 387-specific instructions.
|
||
|
||
$$ Franke387 is a commercial 387 emulator that is also available in a
|
||
shareware version. For this test, shareware version V2.4 was used.
|
||
Franke387 unlike many other emulators supports all 387 instructions.
|
||
It is loaded as a device driver and uses INT 7 to trap coprocessor
|
||
instructions.
|
||
|
||
(( Q387 is an emulator that is distributed as a shareware program by
|
||
Quickware of Austin, Texas. As the name implies, this emulator uses
|
||
386 specific code and supports the full 387 instruction set. The
|
||
program is about 330 kByte in size and loads completely into extended
|
||
memory, using absolutely no DOS memory. It is loaded as a TSR and
|
||
requires an EMM (expanded memory manager) to be present. The emulation
|
||
uses the INT 7 mechanism. The version of Q387 used was 3.0a.
|
||
|
||
%% These benchmarks were run using the built-in coprocessor emulators of
|
||
the TP 6.0 (for Savage, LLL, Whetstone, TRNSFORM, PEAKFLOP) and the MS
|
||
FORTRAN 5.0 (for Linpack) run-time libraries by forcing the libraries
|
||
into not using a coprocessor by using the environment settings NO87=NC
|
||
and 87=N.
|
||
|
||
@@ The 3C87 specific F4X4 instruction was used in the vector transformation
|
||
benchmark.
|
||
|
||
// The EMC87 was used in the 387-compatible mode only. The faster memory-
|
||
mapped mode was *not* used. Times should therefore be identical to the
|
||
Cyrix 83D87.
|
||
|
||
++ Older motherboard with no chip set (discrete logic), no CPU cache, 16 MB
|
||
RAM
|
||
|
||
&& System A, CPU cache disabled via extended set-up, turbo-switch set to
|
||
half speed (that is, 20 MHz)
|
||
|
||
!! 80386 @ 20 MHz / Intel 80287 @ 5 MHz, no CPU cache, 4 MB RAM due to the
|
||
fast CPU used here, performance figures are somewhat higher than can be
|
||
expected for a 80286/287 combination, except for the PEAKFLOP benchmark,
|
||
which is basically coprocessor limited.
|
||
|
||
** 8086/8087 system with 640 KB RAM
|
||
|
||
|
||
Benchmark results for Weitek coprocessors
|
||
------------------------------------------
|
||
Since neither a Weitek coprocessor nor a compiler that generates code for the
|
||
Weitek chips were available to me, performance data for the Weitek Abacus is
|
||
given here according to [31,32] and scaled to show performance of a 33 MHz
|
||
system. The benchmarks were compiled using highly-optimizing 32-bit
|
||
compilers.
|
||
|
||
Single Prec. Double Prec. Double Prec.
|
||
|
||
3167 4167 3167 4167 387 486
|
||
|
||
Linpack MFLOPS 1.8 5.0 0.8 3.2 0.4 1.6
|
||
Whetstone kWhet/sec 7470 22700 4900 14000 3290 12300
|
||
|
||
Note that for the Intel coprocessors, running programs in single vs. double-
|
||
precision doesn't provide much of an performance advantage since all internal
|
||
calculations are always done in extended precision. Using Weitek
|
||
coprocessors, however, performance nearly doubles in single-precision mode.
|
||
For double-precision calculations using only basic arithmetic, the Weitek
|
||
Abacus can at most provide performance at twice the level of the respective
|
||
Intel coprocessor (387/486) at the same clock speed.
|
||
|
||
|
||
Comparison of floating-point performance [30,32]
|
||
|
||
single-precision
|
||
|
||
Weitek 4167-33 Intel 486-33 Intel 486DX2-66
|
||
|
||
Linpack MFLOPS 5.0 1.8 3.5
|
||
Whetstones kWhet/sec 22700 12700 25500
|
||
|
||
|
||
double-precision
|
||
|
||
Weitek 4167-33 Intel 486-33 Intel 486DX2-66
|
||
|
||
LINPACK MFLOPS 3.5 1.6 3.1
|
||
kWhetstones/sec 14000 12300 24700
|
||
|
||
|
||
|
||
=============================================================================
|
||
Clock-cycle timings for coprocessor instructions on various coprocessor chips
|
||
=============================================================================
|
||
|
||
Speed of various coprocessor instructions, measured in clock cycles, as
|
||
captured by my program 87TIMES. Error is +/- one clock cycle, except for the
|
||
Intel 80287. Times for the 80287 were determined on a system with a 20 MHz
|
||
80386 and a 5 MHz Intel 80287. Therefore, times may differ from a genuine
|
||
80286/287 system, especially for those instructions that access an operand in
|
||
memory. Since the times are stated as the number of coprocessor clock cycles
|
||
used, the faster 386 which can execute four clock cycles where the 80287
|
||
executes one clock cycle may decrease memory access times as seen by the
|
||
coprocessor.
|
||
|
||
The CPU used in testing the 387 coprocessors was an Intel 386DX. Note that
|
||
due to the improved coprocessor interface of the Cyrix 486DLC the execution
|
||
time of most coprocessor instructions drops by 2-3 clock cycles when used
|
||
with this CPU.
|
||
|
||
|
||
Intel Intel Cyrix Cyrix C&T ULSI IIT Intel Intel
|
||
i486 RapidCAD 83D87 387+ 38700 83C87 3C87 387DX 80387
|
||
|
||
FLD1 4 3 14 14 14 18 24 23 26
|
||
FLDZ 4 3 14 14 14 18 24 23 31
|
||
FLDPI 7 8 14 15 14 18 24 38 45
|
||
FLDLG2 7 8 14 14 14 18 24 33 45
|
||
FLDL2T 7 8 14 14 14 19 24 38 45
|
||
FLDL2E 7 8 14 14 14 19 24 38 45
|
||
FLDLN2 7 8 14 14 14 19 24 38 45
|
||
FLD ST(0) 4 4 14 14 14 14 24 20 21
|
||
FST ST(1) 3 4 14 14 14 14 19 18 22
|
||
FSTP ST(0) 4 4 14 14 14 15 19 19 22
|
||
FSTP ST(1) 4 4 15 15 14 15 19 20 22
|
||
FLD ST(1) 4 4 14 14 14 14 24 18 21
|
||
FXCH ST(1) 4 4 14 20 14 19 24 24 27
|
||
FILD [Word] 12 16 33 37 32 42 38 47 62
|
||
FILD [DWord] 8 11 26 26 21 32 28 35 45
|
||
FILD [QWord] 9 15 30 30 25 36 32 34 54
|
||
FLD [DWord] 3 5 26 26 21 23 28 20 25
|
||
FLD [QWord] 3 7 30 30 25 27 32 24 35
|
||
FLD [TByte] 5 11 46 46 46 46 47 46 57
|
||
FBLD [TByte] 83 90 66 86 106 146 197 71 278
|
||
FIST [Word] 31 31 37 40 37 42 51 69 90
|
||
FIST [DWord] 29 30 35 40 35 40 49 66 84
|
||
FST [DWord] 7 7 35 37 32 40 33 37 40
|
||
FST [QWord] 8 9 43 43 39 47 40 45 51
|
||
FISTP [Word] 32 32 42 40 37 43 46 70 90
|
||
FISTP [DWord] 31 31 40 40 35 41 50 67 87
|
||
FISTP [QWord] 29 29 44 44 42 48 56 73 92
|
||
FSTP [DWord] 8 8 38 36 32 41 35 38 43
|
||
FSTP [QWord] 9 9 46 43 39 48 42 46 49
|
||
FSTP [TByte] 8 8 50 45 49 50 48 53 58
|
||
FBSTP [TByte] 170 172 98 98 114 129 218 144 533
|
||
FINIT 17 31 15 16 15 15 16 16 25
|
||
FCLEX 7 20 15 16 16 16 16 16 25
|
||
FCHS 7 8 14 15 14 14 19 30 33
|
||
FABS 5 5 14 15 14 14 19 30 33
|
||
FXAM 12 13 14 15 14 14 19 39 43
|
||
FTST 5 5 19 25 14 24 24 34 38
|
||
FSTENV 67 82 125 125 124 132 124 159 165
|
||
FLDENV 44 59 106 106 112 120 106 119 129
|
||
FSAVE 181 169 355 355 374 361 376 469 511
|
||
FRSTOR 130 203 358 358 385 372 371 420 456
|
||
FSTSW [mem] 4 5 14 14 14 14 14 14 17
|
||
FSTSW AX 3 4 12 12 11 11 11 11 14
|
||
FSTCW [mem] 4 5 14 14 13 13 13 14 18
|
||
FLDCW [mem] 4 11 26 26 31 32 27 32 36
|
||
FADD ST,ST(0) 8 9 19 20 19 19 24 24 32
|
||
FADD ST,ST(1) 9 9 19 20 19 18 24 20 32
|
||
FADD ST(1),ST 10 10 19 20 19 18 24 24 37
|
||
FADDP ST(1),ST 11 11 19 19 19 16 24 25 37
|
||
FADD [DWord] 9 10 25 28 22 23 23 21 34
|
||
FADD [QWord] 9 10 32 32 26 27 27 25 38
|
||
FIADD [Word] 20 21 34 34 33 40 40 52 80
|
||
FIADD [DWord] 20 21 27 28 27 30 30 37 61
|
||
FSUB ST(1),ST 10 10 19 20 19 19 24 24 38
|
||
FSUBR ST(1),ST 9 10 19 22 19 19 24 27 38
|
||
FSUBRP ST(1),ST 10 10 19 19 22 20 24 25 38
|
||
FSUB [DWord] 11 12 27 28 27 23 29 27 32
|
||
FSUB [QWord] 11 12 32 32 31 27 33 26 44
|
||
FISUB [Word] 21 21 34 34 34 40 40 52 80
|
||
FISUB [DWord] 21 22 27 28 27 29 30 40 60
|
||
FMUL ST,ST(1) 16 17 19 25 24 24 29 38 57
|
||
FMUL ST(1),ST 16 17 19 24 24 24 29 40 62
|
||
FMULP ST(1),ST 17 17 19 24 24 25 29 40 58
|
||
FIMUL [Word] 22 23 40 40 37 46 46 52 80
|
||
FIMUL [DWord] 22 23 27 28 27 36 35 45 68
|
||
FMUL [DWord] 11 12 27 28 27 28 29 25 45
|
||
FMUL [QWord] 14 15 32 32 31 32 33 37 61
|
||
FDIV ST,ST(0) 73 74 26 40 59 54 54 89 100
|
||
FDIV ST,ST(1) 73 74 36 45 59 54 54 77 100
|
||
FDIV ST(1),ST 73 74 36 45 59 55 54 78 102
|
||
FDIVR ST(1),ST 73 74 36 45 59 54 54 77 102
|
||
FDIVRP ST(1),ST 73 74 36 44 59 55 54 76 106
|
||
FIDIV [Word] 84 85 52 58 75 76 76 105 141
|
||
FIDIV [DWord] 84 85 45 46 65 65 65 101 123
|
||
FDIV [DWord] 73 74 45 46 63 56 59 77 101
|
||
FDIV [QWord] 73 74 50 50 67 60 63 78 103
|
||
FSQRT (0.0) 25 25 19 19 14 19 24 29 37
|
||
FSQRT (1.0) 83 84 36 74 54 89 59 109 132
|
||
FSQRT (L2T) 86 87 36 74 54 89 59 104 137
|
||
FXTRACT (L2T) 17 17 19 19 19 28 79 53 72
|
||
FSCALE (PI,5) 30 30 36 24 24 49 79 59 82
|
||
FRNDINT (PI) 31 31 19 29 24 34 29 49 82
|
||
FPREM (99,PI) 58 59 54 99 44 54 49 79 96
|
||
FPREM1(99,PI) 90 91 54 99 44 59 54 104 121
|
||
FCOM 5 6 15 20 19 25 19 29 32
|
||
FCOMP 6 6 15 19 19 25 19 30 33
|
||
FCOMPP 7 7 15 19 19 25 19 31 40
|
||
FICOM [Word] 16 17 34 34 33 46 34 58 76
|
||
FICOM [DWord] 16 16 21 28 21 35 23 45 57
|
||
FCOM [DWord] 5 6 21 28 22 23 23 27 34
|
||
FCOM [QWord] 5 8 27 32 25 27 27 31 39
|
||
FSIN (0.0) 24 24 14 99 14 19 24 39 43
|
||
FSIN (1.0) 310 313 114 164 144 494 219 509 596
|
||
FSIN (PI) 88 89 118 189 64 64 214 134 152
|
||
FSIN (LG2) 292 295 72 89 139 454 184 449 531
|
||
FSIN (L2T) 299 302 123 179 164 469 214 454 536
|
||
FCOS (0.0) 24 24 19 159 14 19 24 34 42
|
||
FCOS (1.0) 302 305 84 104 139 489 214 459 547
|
||
FCOS (PI) 88 89 154 254 64 64 224 199 232
|
||
FCOS (LG2) 300 303 108 149 139 454 194 504 583
|
||
FCOS (L2T) 307 310 159 239 164 469 224 509 601
|
||
FSINCOS (0.0) 25 25 14 19 19 18 34 38 55
|
||
FSINCOS (1.0) 353 356 124 174 254 493 419 538 636
|
||
FSINCOS (PI) 105 106 162 263 79 68 424 228 277
|
||
FSINCOS (LG2) 340 343 119 159 249 458 359 533 627
|
||
FSINCOS (L2T) 347 350 168 248 274 473 424 538 646
|
||
FPTAN (0.0) 25 25 14 19 19 18 29 38 46
|
||
FPTAN (1.0) 266 269 119 149 184 538 309 323 396
|
||
FPTAN (PI) 145 146 134 228 104 108 304 168 211
|
||
FPTAN (LG2) 244 246 94 129 179 498 274 298 363
|
||
FPTAN (L2T) 247 249 139 219 204 513 304 298 365
|
||
FPATAN (0.0) 38 39 19 24 19 20 29 95 93
|
||
FPATAN (1.0) 294 298 124 159 29 375 604 360 433
|
||
FPATAN (PI) 304 308 139 188 279 360 424 375 472
|
||
FPATAN (LG2) 290 293 128 154 269 365 379 375 448
|
||
FPATAN (L2T) 304 308 144 189 274 359 424 375 468
|
||
F2XM1 (0.0) 25 25 14 14 14 19 24 34 37
|
||
F2XM1 (LN2) 209 211 89 119 169 394 284 299 348
|
||
F2XM1 (LG2) 204 206 78 104 159 379 284 294 337
|
||
FYL2X (1.0) 60 61 36 39 24 75 94 115 127
|
||
FYL2X (PI) 294 297 108 163 249 450 359 395 504
|
||
FYL2X (LG2) 311 314 108 159 249 460 339 410 518
|
||
FYL2X (L2T) 293 296 108 164 249 439 359 390 501
|
||
FYL2XP1 (LG2) 334 337 99 169 234 460 284 435 538
|
||
|
||
|
||
|
||
80386 + 80386 + 80386 + 80386 +
|
||
Intel Intel Q387 Franke387 TP 6.0 EM87
|
||
8087 80287 Emulator Emulator Emulator Emulator
|
||
|
||
FLD1 26 55 51 481 422 1626
|
||
FLDZ 21 53 39 480 416 1646
|
||
FLDPI 26 55 51 486 443 1626
|
||
FLDLG2 26 56 51 486 423 1626
|
||
FLDL2T 26 55 51 486 440 1626
|
||
FLDL2E 26 53 52 486 423 1626
|
||
FLDLN2 26 55 52 486 441 1626
|
||
FLD ST(0) 31 55 57 493 362 1851
|
||
FST ST(1) 26 54 61 489 355 1931
|
||
FSTP ST(0) 26 54 46 507 358 2115
|
||
FSTP ST(1) 21 55 66 507 356 2116
|
||
FLD ST(1) 26 55 54 493 362 1852
|
||
FXCH ST(1) 21 57 80 497 486 2187
|
||
FILD [Word] 58 90 122 667 712 2259
|
||
FILD [DWord] 64 74 121 608 812 2164
|
||
FILD [QWord] 74 93 179 652 707 2971
|
||
FLD [DWord] 49 44 106 633 473 2077
|
||
FLD [QWord] 54 57 118 641 524 2336
|
||
FLD [TByte] 59 45 102 607 492 2063
|
||
FBLD [TByte] 309 310 736 2019 1512 17827
|
||
FIST [Word] 79 72 143 854 766 2418
|
||
FIST [DWord] 84 80 136 865 518 2325
|
||
FST [DWord] 89 85 124 686 441 2200
|
||
FST [QWord] 99 92 135 703 516 2481
|
||
FISTP [Word] 79 80 154 864 794 2620
|
||
FISTP [DWord] 79 81 144 879 541 2523
|
||
FISTP [QWord] 88 75 184 904 916 3226
|
||
FSTP [DWord] 89 75 133 713 467 2400
|
||
FSTP [QWord] 93 72 142 732 538 2678
|
||
FSTP [TByte] 49 21 111 685 467 2124
|
||
FBSTP [TByte] 528 472 1124 3305 1555 27013
|
||
FINIT 11 10 1079 742 641 1369
|
||
FCLEX 11 10 48 440 323 912
|
||
FCHS 21 54 45 460 354 1744
|
||
FABS 21 54 43 456 349 1738
|
||
FXAM 21 54 72 481 380 1551
|
||
FTST 51 75 70 585 386 2721
|
||
FSTENV 54 57 827 928 519 2104
|
||
FLDENV 48 50 780 1125 450 1631
|
||
FSAVE 214 244 3929 1949 976 2749
|
||
FRSTOR 209 227 2901 2182 657 2225
|
||
FSTSW [mem] 28 10 87 516 401 1189
|
||
FSTSW AX N/A 55 57 451 N/A N/A
|
||
FSTCW [mem] 28 10 74 506 359 1167
|
||
FLDCW [mem] 19 47 91 524 437 1584
|
||
FADD ST,ST(0) 86 128 136 643 706 2805
|
||
FADD ST,ST(1) 85 116 146 707 808 3093
|
||
FADD ST(1),ST 92 131 157 664 812 3146
|
||
FADDP ST(1),ST 92 129 164 704 799 3143
|
||
FADD [DWord] 105 122 221 874 969 3139
|
||
FADD [QWord] 115 122 232 888 1021 3396
|
||
FIADD [Word] 115 122 238 940 1211 3330
|
||
FIADD [DWord] 125 122 239 882 1297 3215
|
||
FSUB ST(1),ST 88 130 171 738 817 3156
|
||
FSUBR ST(1),ST 96 132 181 740 868 3004
|
||
FSUBRP ST(1),ST 99 132 193 733 805 3301
|
||
FSUB [DWord] 119 122 230 918 1018 3127
|
||
FSUB [QWord] 129 123 242 932 1070 3632
|
||
FISUB [Word] 115 123 268 977 1081 3802
|
||
FISUB [DWord] 125 125 289 940 980 4161
|
||
FMUL ST,ST(1) 145 151 297 810 1368 3924
|
||
FMUL ST(1),ST 145 151 296 817 1377 3962
|
||
FMULP ST(1),ST 148 168 304 840 1365 4164
|
||
FIMUL [Word] 132 151 384 1039 1517 4039
|
||
FIMUL [DWord] 141 151 383 980 1643 3976
|
||
FMUL [DWord] 125 123 345 948 1480 3445
|
||
FMUL [QWord] 175 192 387 991 1602 4416
|
||
FDIV ST,ST(0) 201 207 274 726 1536 9789
|
||
FDIV ST,ST(1) 203 218 299 808 1658 10332
|
||
FDIV ST(1),ST 207 214 299 825 1655 10342
|
||
FDIVR ST(1),ST 201 206 302 819 1806 10213
|
||
FDIVRP ST(1),ST 201 205 309 845 1803 10409
|
||
FIDIV [Word] 237 227 390 980 1779 11225
|
||
FIDIV [DWord] 246 227 411 944 1680 11572
|
||
FDIV [DWord] 229 226 352 893 1722 10577
|
||
FDIV [QWord] 236 227 391 993 1777 10829
|
||
FSQRT (0.0) 21 57 60 512 382 1755
|
||
FSQRT (1.0) 186 206 294 1106 2504 37836
|
||
FSQRT (L2T) 186 207 295 1398 2467 37925
|
||
FXTRACT (L2T) 51 56 155 726 571 3326
|
||
FSCALE (PI,5) 41 56 95 817 443 3194
|
||
FRNDINT (PI) 51 58 136 808 800 7092
|
||
FPREM (99,PI) 81 131 322 1696 941 4098
|
||
FPREM1(99,PI) N/A N/A 384 1625 N/A N/A
|
||
FCOM 56 75 155 582 483 2799
|
||
FCOMP 61 92 160 616 485 2983
|
||
FCOMPP 61 90 149 661 476 3198
|
||
FICOM [Word] 79 77 231 808 861 3654
|
||
FICOM [DWord] 89 77 231 750 964 3684
|
||
FCOM [DWord] 74 75 214 741 625 3643
|
||
FCOM [QWord] 74 76 205 754 667 3771
|
||
FSIN (0.0) N/A N/A 137 639 N/A N/A
|
||
FSIN (1.0) N/A N/A 997 4640 N/A N/A
|
||
FSIN (PI) N/A N/A 322 2488 N/A N/A
|
||
FSIN (LG2) N/A N/A 978 3911 N/A N/A
|
||
FSIN (L2T) N/A N/A 1005 3767 N/A N/A
|
||
FCOS (0.0) N/A N/A 182 740 N/A N/A
|
||
FCOS (1.0) N/A N/A 988 4777 N/A N/A
|
||
FCOS (PI) N/A N/A 337 2557 N/A N/A
|
||
FCOS (LG2) N/A N/A 976 4176 N/A N/A
|
||
FCOS (L2T) N/A N/A 1001 3905 N/A N/A
|
||
FSINCOS (0.0) N/A N/A 225 714 N/A N/A
|
||
FSINCOS (1.0) N/A N/A 1841 6049 N/A N/A
|
||
FSINCOS (PI) N/A N/A 1167 4091 N/A N/A
|
||
FSINCOS (LG2) N/A N/A 1525 5640 N/A N/A
|
||
FSINCOS (L2T) N/A N/A 1552 5405 N/A N/A
|
||
FPTAN (0.0) 41 58 90 752 8381 2324
|
||
FPTAN (1.0) 581 582 1182 6366 10817 29824
|
||
FPTAN (PI) 606 587 292 4388 12410 2300
|
||
FPTAN (LG2) 516 513 883 5939 12502 26770
|
||
FPTAN (L2T) 576 586 954 5723 12483 2301
|
||
FPATAN (0.0) 41 55 123 616 1208 10578
|
||
FPATAN (1.0) 736 736 171 1426 13446 34208
|
||
FPATAN (PI) 206 207 11115 2835 13305 46903
|
||
FPATAN (LG2) 756 736 11077 2490 13319 41312
|
||
FPATAN (L2T) 206 204 11117 2922 13364 50149
|
||
F2XM1 (0.0) 16 56 102 563 723 1722
|
||
F2XM1 (LN2) 631 624 905 4178 11070 33823
|
||
F2XM1 (LG2) 611 585 890 4798 11116 32163
|
||
FYL2X (1.0) 56 57 136 961 1214 4327
|
||
FYL2X (PI) 946 961 1008 8987 12858 40148
|
||
FYL2X (LG2) 1081 1038 1035 8933 12748 46821
|
||
FYL2X (L2T) 926 886 1089 8982 12712 38986
|
||
FYL2XP1 (LG2) 1026 1037 1154 10485 11867 44708
|
||
|
||
|
||
Clock-cycle timings for floating-point operations on Weitek coprocessors
|
||
------------------------------------------------------------------------
|
||
|
||
The Weitek 3167 and 4167 coprocessors only implement the basic arithmetic
|
||
functions (add, subtract, multiply, divide, square root) in hardware;
|
||
transcendental functions are implemented by means of a software library
|
||
supplied by Weitek which uses the basic hardware instructions to approximate
|
||
the transcendental functions (using polynomial and rational approximations).
|
||
The clock cycle timings for the transcendental functions are average values,
|
||
since execution time can differ with the value of argument. The speed of
|
||
transcendental functions for the 4167 is estimated based on the numbers in
|
||
[31,33], from which this timing information has been extracted.
|
||
|
||
|
||
Single-precision Double-precision
|
||
|
||
3167 4167 3167 4167
|
||
|
||
ABS 3 2 3 2
|
||
NEG 6 2 6 2
|
||
ADD 6 2 6 2
|
||
SUB 6 2 6 2
|
||
SUBR 6 2 6 2
|
||
MUL 6 2 10 3
|
||
DIVR 38 17 66 31
|
||
SQRT 60 17 118 31
|
||
SIN 146 ~50 292 ~100
|
||
COS 140 ~50 285 ~100
|
||
TAN 188 ~60 340 ~110
|
||
EXP 179 ~60 401 ~130
|
||
LOG 171 ~60 365 ~120
|
||
F->ASCII 1000 N/A 1700 N/A //
|
||
ASCII->F 1100 N/A 1800 N/A //
|
||
|
||
// rough average of the timings given for different numeric
|
||
formats by Weitek. Note that these conversions routines
|
||
do much more work than the FBLD and FBSTP instructions
|
||
provided by the 80x87 coprocessors. FBLD and FBSTP are
|
||
useful for conversion routines but quite a bit of additional
|
||
code is need for this purpose.
|
||
|
||
|
||
|
||
=============================================================================
|
||
Accuracy of calculations performed by a coprocessor / The IEEETEST program
|
||
=============================================================================
|
||
|
||
Among the 80x87 coprocessors, the IEEE-754 Standard for Binary Floating-Point
|
||
Arithmetic [10,11] was first fully implemented by Intel's 387 coprocessor [17].
|
||
Among other things, this means that the add, subtract, multiply, divide,
|
||
remainder, and square root operations always deliver the 'exact' result. By
|
||
'exact', the standard means that the coprocessor always delivers the machine
|
||
number closest to the real result, which may not always be representable
|
||
exactly in the available numeric format. The 80387 implements the single,
|
||
double, and double extended formats as specified in the IEEE standard, as
|
||
well as all functions required by it [17].
|
||
|
||
Note that earlier Intel coprocessors (the 8087 and the 80287) comply with a
|
||
draft version of the standard that differs from the final version. These
|
||
chips were developed before IEEE-754 was finally accepted in 1985. As with
|
||
the 80387, the basic arithmetic in the 8087 and the 80287 is 'exact' in the
|
||
sense that the computed result is always the machine number closest to the
|
||
real result. However, there are some differences regarding certain operands
|
||
like infinities, and some operations like the remainder are defined
|
||
differently than in the final version of the standard.
|
||
|
||
Some new instructions were introduced with the 80387, most notably the FSIN
|
||
and FCOS operations. The argument range for some transcendental function has
|
||
also been extended [17]. Note that the IEEE-754 standard says nothing about
|
||
the quality of the implementation of transcendental functions like sin, cos,
|
||
tan, arctan, log. Intel uses a modified CORDIC [18,19] technique to compute
|
||
the transcendental functions; Intel claims that maximum error in the 8087,
|
||
80287, and 80387 for all transcendental functions does not exceed two bits in
|
||
the mantissa of the double extended format, which features 64 mantissa bits
|
||
for an overall accuracy of approximately 19 decimal places [22,23]. This
|
||
claim has been independently verified by a competing vendor [13]. This means
|
||
that at least 62 of the 64 mantissa bits returned as a result by one of the
|
||
transcendental function instructions are guaranteed to be correct.
|
||
|
||
The Weitek Abacus 3167 and 4167 coprocessors are 'mostly compatible' with
|
||
IEEE-754 [31,32,33]. They support the single-precision and double precision
|
||
numeric formats described in the standard, as well as the four rounding modes
|
||
required by it. However, due to Weitek's desire for extremely high-speed
|
||
operation, some of the finer points of IEEE-754 have not been implemented.
|
||
One of the most notable omissions is the missing support for denormal
|
||
numbers; denormals are always flushed to zero on Weitek chips.
|
||
|
||
The 387 clone manufacturers all claim 100% compatibility with Intel's 80387,
|
||
so one would reasonably expect the same accuracy from their chips as from
|
||
Intel's. For example, on the packaging of the IIT 3C87 it states that "...the
|
||
requirements of ANSI/IEEE standards are fulfilled and exceeded". Cyrix states
|
||
that their 83D87 complies fully with the IEEE-754 standard [12], and in fact
|
||
delivers with their coprocessors diagnostic software that includes the
|
||
program IEEETEST. This program is based on the IEEE test vectors from the PhD
|
||
thesis of Dr. Jerome T. Coonen [9]. A test using the IEEE test vectors has
|
||
also been included into the RUNDIAG program on the Intel RapidCAD diagnostic
|
||
disk. Rather than performing random tests, the test vectors check specific
|
||
cases that may be hard to get right. Each test vector specifies the operation
|
||
to be performed, the operands, precision and rounding mode to be used, and
|
||
the result (including flags set) to be expected according to the IEEE-754
|
||
standard.
|
||
|
||
I ran IEEETEST on all the available coprocessors/FPUs. The Intel 486, Intel
|
||
RapidCAD, Intel 387, Intel 387DX, Cyrix 83D87, and the Cyrix 387+ passed with
|
||
no errors. The ULSI 83C87 showed some minor flaws in the FCOM, FDIV, FMUL,
|
||
and FSCALE operations, getting flag errors in about 1% of the tested cases,
|
||
but no computational errors. However, for the IIT 3C87, the IEEETEST program
|
||
showed flag *and* some computational errors (that is, wrong results) for all
|
||
tested operations except FXTRACT and FCHS. The Intel 8087 and 80287 show
|
||
numerous errors, but this it not surprising, since they do not comply with
|
||
IEEE-754 but with an earlier draft of that standard, so they do some things
|
||
differently than required by the final version of the standard. In particular
|
||
the Intel 8087/80287 do not feature the IEEE-754 compliant comparison (FUCOM)
|
||
and remainder (FPREM1) instructions available on the Intel 80387 and newer
|
||
coprocessors, so IEEETEST uses the non-compliant FCOM and FPREM instructions
|
||
on these processors. Lack of an IEEE-754 compliant comparison instruction also
|
||
causes a good deal of the errors in the 'Next After' test.
|
||
|
||
Since IEEETEST is written in Turbo Pascal, it was recompiled with the $E+
|
||
switch to enable use of the coprocessor emulator built into the TP 6.0 library.
|
||
Using the emulator, IEEETEST aborted in the following tests with a division
|
||
by zero error: 'Comparison', 'Division', 'Next After'. These tests were removed
|
||
from the suite and the remaining tests were performed. The public domain
|
||
emulator EM87 could be tested, but hung in the last test which checks the
|
||
implementation of the remainder operation. This problem occurred because EM87
|
||
incorrectly identifies itself as an 387 type coprocessor when run on an 80386.
|
||
This causes the 387 specific FUCOM instruction to be used in the 'Comparison'
|
||
and 'Next After' tests and the FPREM1 instruction to be used in the 'Remainder'
|
||
test. Apparently EM87 is not able to emulate these instructions and therefore
|
||
crashes upon trying to execute them. It is interesting to note how the error
|
||
profile of EM87 matches exactly that of the Intel 80287, so it can be assumed
|
||
that EM87 is a very good emulation of the 80287 when run on the 80286. The
|
||
Franke387 V2.4 emulator hangs in the following test performed by IEEETEST:
|
||
'Division', 'Multiplication', 'Scalb', 'Remainder'. The cause for these
|
||
failures is unknown.
|
||
|
||
|
||
This explanatory text is printed at the start of the IEEETEST program:
|
||
|
||
JT Coonen's 1984 UC Berkeley Ph.D. thesis centers around his activities
|
||
as a member of the floating-point working group that defined the IEEE
|
||
754-1985 Standard for Binary Floating-Point Arithmetic. Appendix C of
|
||
his thesis presents FPTEST, a Pascal program written by J Thomas and JT
|
||
Coonen. IEEETEST is a port of FPTEST and runs on PCs whose math
|
||
coprocessor accepts 80387-compatible floating-point instructions.
|
||
|
||
IEEETEST reads test vectors from the file TESTVECS and compares the
|
||
answer returned by the math coprocessor with the answer listed in the
|
||
test vector. If these answers differ an 'F' is displayed, otherwise a
|
||
'.'is displayed. Answers can differ due to two types of failures:
|
||
numeric failures or flag failures. Numeric failures occur when the
|
||
computed answer has the wrong value. Flag failures occur when the status
|
||
(invalid operation, divide by zero, underflow, overflow, inexact) is
|
||
incorrectly identified.
|
||
|
||
TESTVECS is the concatenation of unmodified versions of all the test
|
||
vectors distributed by UC Berkeley. The test data base is copyrighted by
|
||
UC Berkeley (1985) and is being distributed with their permission.
|
||
FPTEST and the test data base can be obtained by asking for 'IEEE-754
|
||
Test Vector' from UC Berkeley, Electrical Engineering and Computer
|
||
Science, Industrial Liaison Program, 479 Corey Hall, Berkeley, CA, 94720
|
||
(415)643-6687.
|
||
|
||
The initial version of this test data base for the proposed IEEE 754
|
||
binary floating-point standard (draft 8.0) was developed for Zilog, Inc.
|
||
and was donated to the floating-point working group for dissemination.
|
||
Errors in or additions to the distributed data base should be reported
|
||
to the agency of distribution, with copies to Zilog, Inc., 1315 Dell
|
||
Avenue, Campbell, CA, 95008.
|
||
|
||
|
||
IEEETEST output for Intel 80387, Intel 387DX (manufactured 91/49), Intel 486,
|
||
C&T 38700 (manufactured 92/19), Cyrix 83D87, Cyrix 387+ (manufactured 92/11),
|
||
and Intel RapidCAD (manufactured 92/05):
|
||
----------------------------------------------------------------------------
|
||
|
||
IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended
|
||
| TESTS | numeric TYPE OF FAILURE flag
|
||
Operation Code | Passed Failed | S D E | S D E
|
||
----------------------------------------------------------------------
|
||
Absolute Value A | 216 0 | 0 0 0 | 0 0 0
|
||
Addition + | 3528 0 | 0 0 0 | 0 0 0
|
||
Comparison C | 4320 0 | 0 0 0 | 0 0 0
|
||
Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0
|
||
Division / | 4311 0 | 0 0 0 | 0 0 0
|
||
Fraction Part F | 624 0 | 0 0 0 | 0 0 0
|
||
Logb L | 960 0 | 0 0 0 | 0 0 0
|
||
Multiplication * | 3978 0 | 0 0 0 | 0 0 0
|
||
Negation - | 216 0 | 0 0 0 | 0 0 0
|
||
Next After N | 2832 0 | 0 0 0 | 0 0 0
|
||
Round to Integer I | 558 0 | 0 0 0 | 0 0 0
|
||
Scalb S | 948 0 | 0 0 0 | 0 0 0
|
||
Square Root V | 744 0 | 0 0 0 | 0 0 0
|
||
Subtraction - | 3528 0 | 0 0 0 | 0 0 0
|
||
Remainder % | 2984 0 | 0 0 0 | 0 0 0
|
||
Totals | 31235 0 |
|
||
|
||
|
||
IEEETEST output for ULSI 83C87 (manufactured 91/48):
|
||
----------------------------------------------------
|
||
|
||
IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended
|
||
| TESTS | numeric TYPE OF FAILURE flag
|
||
Operation Code | Passed Failed | S D E | S D E
|
||
----------------------------------------------------------------------
|
||
Absolute Value A | 216 0 | 0 0 0 | 0 0 0
|
||
Addition + | 3528 0 | 0 0 0 | 0 0 0
|
||
Comparison C | 4312 8 | 0 0 0 | 0 0 8
|
||
Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0
|
||
Division / | 4250 61 | 0 0 0 | 28 28 5
|
||
Fraction Part F | 624 0 | 0 0 0 | 0 0 0
|
||
Logb L | 960 0 | 0 0 0 | 0 0 0
|
||
Multiplication * | 3936 42 | 0 0 0 | 19 19 4
|
||
Negation - | 216 0 | 0 0 0 | 0 0 0
|
||
Next After N | 2828 4 | 0 0 0 | 0 0 4
|
||
Round to Integer I | 558 0 | 0 0 0 | 0 0 0
|
||
Scalb S | 930 18 | 0 0 0 | 6 6 6
|
||
Square Root V | 744 0 | 0 0 0 | 0 0 0
|
||
Subtraction - | 3528 0 | 0 0 0 | 0 0 0
|
||
Remainder % | 2984 0 | 0 0 0 | 0 0 0
|
||
Totals | 31102 133 |
|
||
|
||
|
||
IEEETEST output for ULSI 83S87 (manufactured 92/17)
|
||
(data kindly supplied by Bengt Ask, f89ba@efd.lth.se):
|
||
------------------------------------------------------
|
||
|
||
IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended
|
||
| TESTS | numeric TYPE OF FAILURE flag
|
||
Operation Code | Passed Failed | S D E | S D E
|
||
----------------------------------------------------------------------
|
||
Absolute Value A | 216 0 | 0 0 0 | 0 0 0
|
||
Addition + | 3528 0 | 0 0 0 | 0 0 0
|
||
Comparison C | 4320 0 | 0 0 0 | 0 0 0
|
||
Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0
|
||
Division / | 4296 15 | 0 0 0 | 5 5 5
|
||
Fraction Part F | 624 0 | 0 0 0 | 0 0 0
|
||
Logb L | 960 0 | 0 0 0 | 0 0 0
|
||
Multiplication * | 3966 12 | 0 0 0 | 4 4 4
|
||
Negation - | 216 0 | 0 0 0 | 0 0 0
|
||
Next After N | 2828 4 | 0 0 0 | 0 0 4
|
||
Round to Integer I | 558 0 | 0 0 0 | 0 0 0
|
||
Scalb S | 930 18 | 0 0 0 | 6 6 6
|
||
Square Root V | 744 0 | 0 0 0 | 0 0 0
|
||
Subtraction - | 3528 0 | 0 0 0 | 0 0 0
|
||
Remainder % | 2984 0 | 0 0 0 | 0 0 0
|
||
Totals | 31102 45 |
|
||
|
||
|
||
IEEETEST output for IIT 3C87 (manufactured 92/20):
|
||
--------------------------------------------------
|
||
|
||
IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended
|
||
| TESTS | numeric TYPE OF FAILURE flag
|
||
Operation Code | Passed Failed | S D E | S D E
|
||
----------------------------------------------------------------------
|
||
Absolute Value A | 200 16 | 0 0 16 | 0 0 0
|
||
Addition + | 3336 192 | 0 0 128 | 0 0 96
|
||
Comparison C | 4224 96 | 0 0 96 | 0 0 0
|
||
Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0
|
||
Division / | 4159 152 | 0 0 124 | 0 0 116
|
||
Fraction Part F | 600 24 | 0 0 24 | 0 0 24
|
||
Logb L | 960 0 | 0 0 0 | 0 0 0
|
||
Multiplication * | 3702 276 | 0 0 248 | 0 0 100
|
||
Negation - | 200 16 | 0 0 16 | 0 0 0
|
||
Next After N | 2248 584 | 0 0 584 | 0 0 168
|
||
Round to Integer I | 542 16 | 0 0 4 | 0 0 16
|
||
Scalb S | 874 74 | 5 5 44 | 8 8 20
|
||
Square Root V | 688 56 | 0 0 56 | 0 0 56
|
||
Subtraction - | 3336 192 | 0 0 128 | 0 0 96
|
||
Remainder % | 2844 140 | 0 0 140 | 0 0 116
|
||
Totals | 29401 1834 |
|
||
|
||
|
||
IEEETEST output for Intel 80287 run with a 80386 CPU and Intel 8087:
|
||
--------------------------------------------------------------------
|
||
|
||
IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended
|
||
| TESTS | numeric TYPE OF FAILURE flag
|
||
Operation Code | Passed Failed | S D E | S D E
|
||
----------------------------------------------------------------------
|
||
Absolute Value A | 216 0 | 0 0 0 | 0 0 0
|
||
Addition + | 2886 642 | 16 16 112 | 174 174 174
|
||
Comparison C | 3612 708 | 136 136 136 | 228 228 228
|
||
Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0
|
||
Division / | 3777 534 | 18 18 37 | 169 169 165
|
||
Fraction Part F | 552 72 | 24 24 24 | 24 24 24
|
||
Logb L | 900 60 | 12 12 12 | 20 20 20
|
||
Multiplication * | 2944 1034 | 105 105 197 | 303 303 231
|
||
Negation - | 216 0 | 0 0 0 | 0 0 0
|
||
Next After N | 516 2316 | 168 168 332 | 764 764 764
|
||
Round to Integer I | 546 12 | 0 0 0 | 4 4 4
|
||
Scalb S | 663 285 | 45 43 26 | 102 98 46
|
||
Square Root V | 720 24 | 4 4 4 | 8 8 8
|
||
Subtraction - | 2886 642 | 16 16 112 | 174 174 174
|
||
Remainder % | 1490 1494 | 432 432 288 | 342 342 230
|
||
Totals | 23412 7823 |
|
||
|
||
|
||
IEEETEST output for EM87 coprocessor emulator run on an Intel 386 CPU:
|
||
----------------------------------------------------------------------
|
||
|
||
IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended
|
||
| TESTS | numeric TYPE OF FAILURE flag
|
||
Operation Code | Passed Failed | S D E | S D E
|
||
----------------------------------------------------------------------
|
||
Absolute Value A | 216 0 | 0 0 0 | 0 0 0
|
||
Addition + | 2886 642 | 16 16 112 | 174 174 174
|
||
Comparison C | 0 4320 | 1324 1324 1324 |1332 1332 1332
|
||
Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0
|
||
Division / | 3777 534 | 18 18 37 | 169 169 165
|
||
Fraction Part F | 552 72 | 24 24 24 | 24 24 24
|
||
Logb L | 900 60 | 12 12 12 | 20 20 20
|
||
Multiplication * | 2944 1034 | 105 105 197 | 303 303 231
|
||
Negation - | 216 0 | 0 0 0 | 0 0 0
|
||
Next After N | 348 2484 | 768 768 768 | 504 504 526
|
||
Round to Integer I | 546 12 | 0 0 0 | 4 4 4
|
||
Scalb S | 663 285 | 45 43 26 | 102 98 46
|
||
Square Root V | 720 24 | 4 4 4 | 8 8 8
|
||
Subtraction - | 2886 642 | 16 16 112 | 174 174 174
|
||
Remainder % | ######## not run since machine hangs #######
|
||
|
||
|
||
IEEETEST output for Franke387 coprocessor emulator run on an Intel 386:
|
||
-----------------------------------------------------------------------
|
||
|
||
IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended
|
||
| TESTS | numeric TYPE OF FAILURE flag
|
||
Operation Code | Passed Failed | S D E | S D E
|
||
----------------------------------------------------------------------
|
||
Absolute Value A | 152 64 | 0 0 8 | 24 24 8
|
||
Addition + | 1587 1941 | 178 178 722 | 508 508 616
|
||
Comparison C | 3696 624 | 208 208 208 | 4 4 108
|
||
Copy Sign @ | 1200 288 | 0 0 0 | 144 144 0
|
||
Division / | ######## not run since machine hangs #######
|
||
Fraction Part F | 624 0 | 0 0 0 | 0 0 0
|
||
Logb L | 908 52 | 0 0 16 | 16 16 4
|
||
Multiplication * | ######## not run since machine hangs #######
|
||
Negation - | 152 64 | 0 0 8 | 24 24 8
|
||
Next After N | 1404 1420 | 404 404 596 | 80 80 172
|
||
Round to Integer I | 514 44 | 4 4 20 | 8 8 16
|
||
Scalb S | ######## not run since machine hangs #######
|
||
Square Root V | 569 175 | 14 31 54 | 28 48 72
|
||
Subtraction - | 1827 1701 | 98 98 642 | 452 452 576
|
||
Remainder % | ######## not run since machine hangs #######
|
||
|
||
|
||
IEEETEST output for Q387 coprocessor emulator run on an Intel 386:
|
||
------------------------------------------------------------------
|
||
|
||
IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended
|
||
| TESTS | numeric TYPE OF FAILURE flag
|
||
Operation Code | Passed Failed | S D E | S D E
|
||
----------------------------------------------------------------------
|
||
Absolute Value A | 104 112 | 42 38 16 | 24 24 0
|
||
Addition + | 911 2617 | 746 637 637 | 672 672 380
|
||
Comparison C | 3180 1140 | 380 380 380 | 108 108 108
|
||
Copy Sign @ | 696 792 | 320 280 0 | 288 288 0
|
||
Division / | 900 3411 | 673 574 814 | 977 977 821
|
||
Fraction Part F | 348 276 | 154 82 40 | 24 24 24
|
||
Logb L | 656 304 | 136 100 36 | 24 24 12
|
||
Multiplication * | 1023 2955 | 759 663 857 | 670 670 442
|
||
Negation - | 86 130 | 44 38 32 | 24 24 0
|
||
Next After N | 464 2368 | 780 780 796 | 344 344 320
|
||
Round to Integer I | 273 285 | 95 74 52 | 72 72 68
|
||
Scalb S | 254 694 | 217 192 137 | 176 168 136
|
||
Square Root V | 128 616 | 192 180 147 | 196 196 188
|
||
Subtraction - | 911 2617 | 746 637 637 | 672 672 372
|
||
Remainder % | 558 2426 | 903 859 664 | 508 508 220
|
||
Totals | 10492 20743 |
|
||
|
||
|
||
IEEETEST output for TP 6.0 coprocessor emulator:
|
||
------------------------------------------------
|
||
|
||
IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended
|
||
| TESTS | numeric TYPE OF FAILURE flag
|
||
Operation Code | Passed Failed | S D E | S D E
|
||
----------------------------------------------------------------------
|
||
Absolute Value A | 168 48 | 16 16 16 | 16 8 0
|
||
Addition + | 1877 1651 | 294 290 336 | 496 456 416
|
||
Comparison C | ## not run - program aborts with div-by-0 ##
|
||
Copy Sign @ | 1392 96 | 48 48 0 | 48 0 0
|
||
Division / | ## not run - program aborts with div-by-0 ##
|
||
Fraction Part F | 588 36 | 12 0 24 | 0 0 0
|
||
Logb L | 888 72 | 24 24 24 | 12 12 12
|
||
Multiplication * | 2148 1830 | 332 310 528 | 520 360 352
|
||
Negation - | 160 48 | 16 16 16 | 16 8 0
|
||
Next After N | ## not run - program aborts with div-by-0 ##
|
||
Round to Integer I | 318 240 | 0 0 4 | 80 80 80
|
||
Scalb S | 564 384 | 108 100 76 | 112 88 56
|
||
Square Root V | 180 564 | 143 157 169 | 72 72 128
|
||
Subtraction - | 1877 1651 | 294 290 336 | 496 456 416
|
||
Remainder % | 1072 1912 | 652 672 524 | 336 288 216
|
||
|
||
|
||
|
||
|
||
Additional accuracy and compatibility tests
|
||
-------------------------------------------
|
||
|
||
To complement the checks done by IEEETEST, I also wrote the short programs
|
||
DENORMTS, RCTRL, PCTRL in Turbo Pascal 6.0 that test the following
|
||
coprocessor functions:
|
||
|
||
1. support for denormals in all precisions (single, double, extended)
|
||
2. support for the four IEEE rounding modes (up, down, nearest, chop)
|
||
3. support for precision control
|
||
|
||
Note that passing all tests is required for IEEE conformance, as well as 100%
|
||
compatibility with Intel's coprocessors. Precision control forces the results
|
||
of the FADD, FSUB, FMUL, FDIV, and FSQRT instruction to be rounded to the
|
||
specified precision (single, double, double extended). This feature is
|
||
provided to obtain compatibility with certain programming languages [17]. By
|
||
specifying lower precision, one effectively nullifies the advantages of
|
||
extended precision intermediate results.
|
||
|
||
The IEEE-754 standard for floating-point arithmetic demands that processors
|
||
and floating-point packages that can not store the result of operations
|
||
*directly* to single and double precision location must provide precision
|
||
control. The programs that test precision control and rounding control are
|
||
designed to return a different result for each of the modes for the same
|
||
sequence of operation.
|
||
|
||
The source code of the programs can be found in appendix A. The Intel 8087
|
||
and 80287 were not tested with DENORMTS since Turbo Pascal does not support
|
||
extended precision denormals on 8087/80287 processors, so the denormal test
|
||
fails anyway. (The 8087 and 287 pass the RCTRL and PCTRL tests without error,
|
||
however).
|
||
|
||
|
||
Test Results for the Intel 387, Intel 387DX, Intel 486, Intel RapidCAD,
|
||
Cyrix 83D87, Cyrix 387+, C&T 38700, and the EM87 emulator (on an 80386 system):
|
||
-------------------------------------------------------------------------------
|
||
|
||
Precision Control SINGLE 1.13311278820037842E+0000
|
||
DOUBLE 1.23456789006442125E+0000
|
||
EXTENDED 1.23456789012337585E+0000
|
||
|
||
Rounding Control NEAREST -1.23427629010100635E+0100
|
||
DOWN -1.23427623555772409E+0100
|
||
UP -1.23457760966801097E+0100
|
||
CHOP -1.23397493540770643E+0100
|
||
|
||
Denormal support
|
||
|
||
SINGLE denormals supported
|
||
SINGLE denormal prints as: 4.60943116855005E-0041
|
||
Denormal should be printed as 4.60943...E-0041
|
||
|
||
DOUBLE denormals supported
|
||
DOUBLE denormal prints as: 8.75000000000016E-0311
|
||
Denormal should be printed as 8.75...E-0311
|
||
|
||
EXTENDED denormals supported
|
||
EXTENDED denormal prints as: 1.31640625000000E-4934
|
||
Denormal should be printed as 1.3164...E-4934
|
||
|
||
|
||
Results for the ULSI 83C87:
|
||
---------------------------
|
||
|
||
Precision Control SINGLE 1.23456789012337585E+0000
|
||
DOUBLE 1.23456789012337585E+0000
|
||
EXTENDED 1.23456789012337585E+0000
|
||
|
||
Rounding Control NEAREST -1.23427629010100635E+0100
|
||
DOWN -1.23427623555772409E+0100
|
||
UP -1.23457760966801097E+0100
|
||
CHOP -1.23397493540770643E+0100
|
||
|
||
Denormal support
|
||
|
||
SINGLE denormals supported
|
||
SINGLE denormal prints as: 4.60943116855005E-0041
|
||
Denormal should be printed as 4.60943...E-0041
|
||
|
||
DOUBLE denormals supported
|
||
DOUBLE denormal prints as: 8.75000000000016E-0311
|
||
Denormal should be printed as 8.75...E-0311
|
||
|
||
EXTENDED denormals supported
|
||
EXTENDED denormal prints as: 1.31640625000000E-4934
|
||
Denormal should be printed as 1.3164...E-4934
|
||
|
||
|
||
Results for the IIT 3C87:
|
||
-------------------------
|
||
|
||
Precision Control SINGLE 1.13311278820037842E+0000
|
||
DOUBLE 1.23456789006442125E+0000
|
||
EXTENDED 1.23456789012337585E+0000
|
||
|
||
Rounding Control NEAREST -1.23427629010100635E+0100
|
||
DOWN -1.23427623555772409E+0100
|
||
UP -1.23457760966801097E+0100
|
||
CHOP -1.23397493540770643E+0100
|
||
|
||
Denormal support
|
||
|
||
SINGLE denormals supported
|
||
SINGLE denormal prints as: 4.60943116855005E-0041
|
||
Denormal should be printed as 4.60943...E-0041
|
||
|
||
DOUBLE denormals supported
|
||
DOUBLE denormal prints as: 8.75000000000016E-0311
|
||
Denormal should be printed as 8.75...E-0311
|
||
|
||
EXTENDED denormals not supported
|
||
|
||
|
||
Results for the Turbo Pascal 6.0 coprocessor emulator:
|
||
------------------------------------------------------
|
||
|
||
Precision Control SINGLE 1.23456789012351396E+0000
|
||
DOUBLE 1.23456789012351396E+0000
|
||
EXTENDED 1.23456789012351396E+0000
|
||
|
||
Rounding Control NEAREST -1.23457766383395931E+0100
|
||
DOWN -1.23457766383395931E+0100
|
||
UP -1.23457766383395931E+0100
|
||
CHOP -1.23457766383395931E+0100
|
||
|
||
Denormal support
|
||
|
||
SINGLE denormals not supported
|
||
DOUBLE denormals not supported
|
||
EXTENDED denormals not supported
|
||
|
||
|
||
Results for the Q387 coprocessor emulator:
|
||
------------------------------------------
|
||
|
||
Precision Control SINGLE 1.23456789012337614E+0000
|
||
DOUBLE 1.23456789012337614E+0000
|
||
EXTENDED 1.23456789012337614E+0000
|
||
|
||
Rounding Control NEAREST -1.23427621117212139E+0100
|
||
DOWN -1.23427621117212139E+0100
|
||
UP -1.23427621117212139E+0100
|
||
CHOP -1.23427621117212139E+0100
|
||
|
||
Denormal support
|
||
|
||
SINGLE denormals not supported
|
||
DOUBLE denormals not supported
|
||
EXTENDED denormals not supported
|
||
|
||
|
||
The test results show that the IIT 3C87 does not conform to the IEEE-754
|
||
floating-point standard in that it does not support denormals in double
|
||
extended precision. The ULSI 83C87 does not conform to that standard in that
|
||
it does not support precision control, but uses double extended precision for
|
||
all operations. The TP 6.0 emulator supports neither precision control,
|
||
rounding control nor support for any denormals, as does the Q387 emulator.
|
||
In addition, their basic arithmetic operations do not seem to conform to
|
||
the IEEE standard as the results of the test programs differ from that of
|
||
any result computed by a coprocessor for any mode.
|
||
|
||
|
||
|
||
================================================
|
||
Accuracy of transcendental function calculations
|
||
================================================
|
||
|
||
With regard to the accuracy of transcendental functions, Cyrix claims that
|
||
the relative error of the transcendental functions on its 83D87 coprocessor
|
||
never exceeds 0.5 ULP of the double extended format [13] (ULP = Unit in the
|
||
Last Place, numeric weight of the least significant mantissa bit). This means
|
||
that the maximum relative error is below 2**-64, while Intel's published
|
||
error limit for the 80387 is 2**-62. While Intel uses a modified CORDIC
|
||
algorithm [18,19] to compute the transcendental functions, Cyrix uses
|
||
rational approximations that utilize their chip's very fast array multiplier.
|
||
(For an explanation why this approach is superior to CORDIC with today's
|
||
technology, see [61].) Also, Cyrix uses an internal 75 bit data path for the
|
||
mantissa [15], so intermediate computations in the generation of
|
||
transcendental function values will enjoy some additional accuracy over the
|
||
64 bits provided by the double extended format. Using 75 mantissa bits also
|
||
provides an advantage over other coprocessors like the Intel 387DX and ULSI
|
||
83C87 which use only a 68 bit mantissa data path [58,59].
|
||
|
||
Note that a maximum relative error of 0.5 ULP for the Cyrix coprocessor does
|
||
not mean that it returns the 'exact' result (machine number closest to
|
||
infinitely precise result) all the time. Consider the case where the
|
||
infinitely precise result of a transcendental function falls nearly halfway
|
||
between two machine numbers. A relative error of 0.5 ULP can cause the result
|
||
to be either of the numbers after rounding, depending on the direction of the
|
||
error. But the 83D87 should deliver results that never differ from the
|
||
'exact' result by more than one ULP. Also note that the claim of relative
|
||
error being below 0.5 ULPs is slightly exaggerated; 0.6 ULPs would be a more
|
||
realistic error limit. Imagine that the infinitely precise result for some
|
||
argument to a transcendental was xxx..xxx1001... (where the xxx...xxx
|
||
represent the first 64 bits of the result), but that the coprocessor computes
|
||
the result as xxx..xxx0111 and then round this down to xxx..xxx0000. Then the
|
||
relative error is (1001b-0b)/1000b = 0.5625 ULPs.
|
||
|
||
I tested some of the transcendental functions of the Cyrix 387+ and found the
|
||
relative error to be always below 0.6 ULPs. Cyrix also claims that its
|
||
transcendental functions satisfy the monotonicity criterion [13], a claim not
|
||
made by any of the competitors, which does not mean that the transcendental
|
||
functions on the other 387-compatibles may not be monotonic, too.
|
||
Monotonicity means that for all x1 > x2, it always follows that f(x1) >=
|
||
f(x2) for an increasing function like sin on [0..pi/4]. Likewise, for a
|
||
decreasing function like cos on [0..pi/4], for all x1 > x2, it follows that
|
||
f(x1) <= f(x2).
|
||
|
||
As previously noted, the Weitek Abacus 3167 and 4167 coprocessors implement
|
||
only the basic arithmetic operations (add, subtract, negate, multiply,
|
||
divide, square root) in hardware. Transcendental functions are performed via
|
||
a software library provided by Weitek. For these library functions Weitek
|
||
claims a maximum relative error of 5 ULPs [31,33]. This means that the last
|
||
three bits in the mantissa of a double-precision result can be wrong. Note
|
||
that the Intel 387 and compatible math coprocessors generate the
|
||
transcendental functions with a small relative error with regard to the
|
||
*extended double precision* format. Thus, when rounded to double-precision,
|
||
their function values are nearly always 'exact'. The problem of 'double
|
||
rounding' prevents them to be 'exact' in 100% of all cases. 387 type
|
||
coprocessors in general have superior accuracy when compared with Weitek's
|
||
coprocesssors.
|
||
|
||
The test diskette distributed with early versions of the Cyrix 83D87
|
||
contained a program (TRANCK) that checks the accuracy of the transcendental
|
||
functions in the coprocessor against a more precise software arithmetic [16].
|
||
I used this program to compare the accuracy of the transcendental functions
|
||
on those 287/387/486 coprocessors/FPUs available to me. As TRANCK will not
|
||
accept negative numbers as interval limits, I tested each function on an
|
||
interval along the positive x-axis. The functions tested were F2XM1 (2**x-1),
|
||
FSIN (sine), FCOS (cosine), FPTAN (tangent), FPATAN (arctangent), FYL2X (y *
|
||
log2 (x)), and FYL2XP1 (y * log2 (x+1)). These are all the transcendental
|
||
functions implemented on the 80387. Note that the square root (FSQRT) is
|
||
*not* a transcendental function. For each function, 100,000 arguments were
|
||
evaluated, with the arguments uniformly distributed within the interval
|
||
tested.
|
||
|
||
The EM87 emulator could not be checked with TRANCK, since the multiple
|
||
precision package in TRANCK would always return with an error message
|
||
immediately. However, the Franke387 emulator could be tested.
|
||
|
||
|
||
In the test results below, the following statistics are detailed:
|
||
|
||
%wrong is the percentage of results that differ from the 'exact'
|
||
result (infinitely precise result rounded to 64 bits)
|
||
ULP_hi is the number of results where the returned result was
|
||
greater than the 'exact' (correctly rounded) result by
|
||
one ULP (the numeric weight of the last mantissa bit,
|
||
2**-63 to 2**-64 depending of the size of the number).
|
||
ULPs_hi is the number of results where the returned result was
|
||
greater than the 'exact' result by two or more ULPs.
|
||
ULP_lo is the number of results where the returned result was
|
||
smaller than the 'exact' (correctly rounded) result by
|
||
one ULP (the numeric weight of the last mantissa bit,
|
||
2**-63 to 2**-64 depending of the size of the number).
|
||
ULPs_lo is the number of results where the returned result was
|
||
smaller than the 'exact' result by two or more ULPs.
|
||
max ULP err is the maximum deviation of a returned result from the
|
||
'exact' answer expressed in ULPs.
|
||
|
||
Test results for accuracy of transcendental functions for double extended
|
||
precision as returned by the program TRANCK. 100,000 trials per function:
|
||
|
||
Franke387 V2.4 emulator
|
||
max
|
||
funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
|
||
|
||
SIN 0,pi/4 39.042 25301 708 13029 4 2
|
||
COS 0,pi/4 75.714 49827 25887 0 0 3
|
||
TAN 0,pi/4 76.976 14230 10029 24323 28394 9
|
||
ATAN 0,1 55.826 26028 1529 24044 4225 4
|
||
2XM1 0,0.5 96.717 0 0 47910 48807 5
|
||
YL2XP1 0,sqrt(2)-1 93.007 578 9 27416 65004 8
|
||
YL2X 0.1,10 62.252 16817 4712 37082 3641 2953
|
||
|
||
|
||
Microsoft's coprocessor emulator
|
||
(part of MS-C and MS-Fortran libraries)
|
||
max
|
||
funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
|
||
|
||
SIN 0,pi/4 N/A N/A N/A N/A N/A N/A
|
||
COS 0,pi/4 N/A N/A N/A N/A N/A N/A
|
||
TAN 0,pi/4 40.828 27764 1520 11445 99 2
|
||
ATAN 0,1 32.307 18893 485 12530 299 2
|
||
2XM1 0,0.5 52.163 8585 189 37745 5644 3
|
||
YL2XP1 0,sqrt(2)-1 88.801 4714 916 14239 68932 11
|
||
YL2X 0.1,10 36.598 13813 3272 13866 5647 11
|
||
|
||
|
||
INTEL 8087, 80287
|
||
max
|
||
funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
|
||
|
||
SIN 0,pi/4 N/A N/A N/A N/A N/A N/A
|
||
COS 0,pi/4 N/A N/A N/A N/A N/A N/A
|
||
TAN 0,pi/4 37.001 18756 524 17405 316 2
|
||
ATAN 0,1 9.666 6065 0 3601 0 1
|
||
2XM1 0,0.5 19.920 0 0 19920 0 1
|
||
YL2XP1 0,sqrt(2)-1 7.780 868 0 6912 0 1
|
||
YL2X 0.1,10 1.287 723 0 564 0 1
|
||
|
||
|
||
INTEL 80387
|
||
max
|
||
funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
|
||
|
||
SIN 0,pi/4 28.872 2467 0 26392 13 2
|
||
COS 0,pi/4 27.213 27169 35 9 0 2
|
||
TAN 0,pi/4 10.532 441 0 10091 0 1
|
||
ATAN 0,1 7.088 2386 0 4691 1 2
|
||
2XM1 0,0.5 32.024 0 0 32024 0 1
|
||
YL2XP1 0,sqrt(2)-1 22.611 3461 0 19150 0 1
|
||
YL2X 0.1,10 13.020 6508 0 6512 0 1
|
||
|
||
|
||
INTEL 387DX
|
||
max
|
||
funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
|
||
|
||
SIN 0,pi/4 28.873 2467 0 26393 13 2
|
||
COS 0,pi/4 27.121 27090 22 9 0 2
|
||
TAN 0,pi/4 10.711 457 0 10254 0 1
|
||
ATAN 0,1 7.088 2386 0 4691 1 2
|
||
2XM1 0,0.5 32.024 0 0 32024 0 1
|
||
YL2XP1 0,sqrt(2)-1 22.611 3461 0 19150 0 1
|
||
YL2X 0.1,10 13.020 6508 0 6512 0 1
|
||
|
||
|
||
ULSI 83C87
|
||
max
|
||
funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
|
||
|
||
SIN 0,pi/4 35.530 4989 6 30238 297 2
|
||
COS 0,pi/4 43.989 11193 675 31393 728 2
|
||
TAN 0,pi/4 48.539 18880 1015 26349 2295 3
|
||
ATAN 0,1 20.858 62 0 20796 0 1
|
||
2XM1 0,0.5 21.257 4 0 21253 0 1
|
||
YL2XP1 0,sqrt(2)-1 27.893 9446 0 18213 234 2
|
||
YL2X 0.1,10 13.603 9816 0 3787 0 1
|
||
|
||
|
||
IIT 3C87
|
||
max
|
||
funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
|
||
|
||
SIN 0,pi/4 18.650 11171 0 7479 0 1
|
||
COS 0,pi/4 7.700 3024 0 4676 0 1
|
||
TAN 0,pi/4 20.973 9681 0 11291 1 2
|
||
ATAN 0,1 19.280 13186 0 6094 0 1
|
||
2XM1 0,0.5 25.660 17570 0 8090 0 1
|
||
YL2XP1 0,sqrt(2)-1 45.830 23503 1896 19654 777 3
|
||
YL2X 0.1,10 10.888 5638 357 4845 48 3
|
||
|
||
|
||
C&T 38700DX
|
||
max
|
||
funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
|
||
|
||
SIN 0,pi/4 1.821 1272 0 549 0 1
|
||
COS 0,pi/4 23.358 12458 0 10901 0 1
|
||
TAN 0,pi/4 17.178 10725 0 6453 0 1
|
||
ATAN 0,1 9.359 7082 0 2277 0 1
|
||
2XM1 0,0.5 15.188 3039 0 12149 0 1
|
||
YL2XP1 0,sqrt(2)-1 19.497 12109 0 7388 0 1
|
||
YL2X 0.1,10 46.868 261 0 46607 0 1
|
||
|
||
|
||
CYRIX 83D87
|
||
max
|
||
funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
|
||
|
||
SIN 0,pi/4 1.554 1015 0 539 0 1
|
||
COS 0,pi/4 0.925 143 0 782 0 1
|
||
TAN 0,pi/4 4.147 881 0 3266 0 1
|
||
ATAN 0,1 0.656 229 0 427 0 1
|
||
2XM1 0,0.5 2.628 1433 0 1194 0 1
|
||
YL2XP1 0,sqrt(2)-1 3.242 825 0 2417 0 1
|
||
YL2X 0.1,10 0.931 256 0 675 0 1
|
||
|
||
CYRIX 387+
|
||
max
|
||
funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
|
||
|
||
SIN 0,pi/4 1.486 864 0 622 0 1
|
||
COS 0,pi/4 2.072 12 0 2060 0 1
|
||
TAN 0,pi/4 0.602 63 0 539 0 1
|
||
ATAN 0,1 0.384 12 0 372 0 1
|
||
2XM1 0,0.5 1.985 27 0 1958 0 1
|
||
YL2XP1 0,sqrt(2)-1 3.662 1705 0 1957 0 1
|
||
YL2X 0.1,10 0.764 367 0 397 0 1
|
||
|
||
|
||
INTEL RapidCAD, Intel 486
|
||
max
|
||
funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
|
||
|
||
SIN 0,pi/4 16.991 1517 0 15474 0 1
|
||
COS 0,pi/4 9.003 7603 0 1400 0 1
|
||
TAN 0,pi/4 10.532 441 0 10091 0 1
|
||
ATAN 0,1 7.078 2386 0 4691 1 2
|
||
2XM1 0,0.5 32.025 0 0 32025 0 1
|
||
YL2XP1 0,sqrt(2)-1 21.800 533 0 21267 0 1
|
||
YL2X 0.1,10 3.894 1879 0 2015 0 1
|
||
|
||
|
||
Discussion of the transcendental function tests
|
||
-----------------------------------------------
|
||
|
||
The test results above indicate that all 80x87 compatibles do not exceed
|
||
Intel's stated error bound of 3 ULPs for the transcendental functions.
|
||
However, some coprocessors are more accurate than others. Rating the
|
||
coprocessors according to the accuracy of their transcendental functions
|
||
gives the following list (highest accuracy first): Cyrix 387+, Cyrix 83D87,
|
||
Intel 486, Intel RapidCAD, Intel 80287(!), C&T 38700DX, Intel 387DX, Intel
|
||
80387, IIT 3C87, ULSI 83C87. The tests also show that the problems with
|
||
excessive inaccuracy of the transcendental functions in early versions of the
|
||
IIT coprocessors with errors of up to 8 ULPs [8] have been corrected.
|
||
(According to [56], certain problems with the FPATAN instruction on the IIT
|
||
3C87 occurring under the UNIX version of AutoCAD were corrected in June,
|
||
1990.)
|
||
|
||
Considering the coprocessor emulators, the Franke387 has acceptable accuracy
|
||
for the FSIN, FCOS, and FPATAN instructions, taking into consideration that
|
||
according to its documentation, Franke387 uses only 64 bits of precision for
|
||
the intermediate results, while coprocessors typically use 68 bits and more.
|
||
However, the larger error in the FPTAN, F2XM1, FYL2XP1, and especially the
|
||
FYL2X operations show that the emulator doesn't use state-of-the-art
|
||
algorithms, which ensure an error of only a very few ULPs even if no extra
|
||
precise intermediate results are available. Microsoft's emulator, meanwhile,
|
||
provides transcendental functions with rather good accuracy, except for the
|
||
logarithmic operations, which contain some minor flaws. The Q387 emulator,
|
||
which came out only recently and is the fastest emulator available, could
|
||
unfortunately not be tested since it caused TRANCK to abort with a GP (general
|
||
protection) fault for every input that I tried.
|
||
|
||
|
||
|
||
======================================================
|
||
Intel 387DX compatibility testing / The SMDIAG program
|
||
======================================================
|
||
|
||
Chips and Technologies has included the program SMDIAG on the V1.0 diagnostic
|
||
disk distributed with its SuperMATH 38700DX coprocessor. Its stated purpose
|
||
is to test the compatibility of the computational results and flag settings
|
||
returned by the C&T coprocessor with the Intel 387DX. However, the tests for
|
||
the transcendental functions seem to have been tweaked to let the C&T 38700DX
|
||
pass, while coprocessors like the Intel RapidCAD and the Cyrix 83D87 fail.
|
||
Also, SMDIAG shows failure in the FSCALE test for the Intel RapidCAD, Cyrix
|
||
83D87, Cyrix 387+, and ULSI 83C87, even though they return the correct result
|
||
according to Intel's documentation for the Intel 387DX (Intel's second
|
||
generation 387), which is indeed returned by the 387DX. (SMDIAG apparently
|
||
expects the result returned by the original Intel 80387.)
|
||
|
||
Note that chip manufacturers often do quite bug fixes, so it wouldn't be
|
||
surprising if somebody else, using different runs of the same manufacturer's
|
||
chip, came up with different results than the ones below. The Intel 387 alone
|
||
seems to have been produced in four different versions that can be told apart
|
||
by software, and Cyrix, ULSI, and IIT have manufactured at least two versions
|
||
each of their coprocessors. (The coprocessors I tested have the following
|
||
manufacturing dates stamped on them. Intel 387DX: 91/49, C&T 38700DX: 92/19,
|
||
Cyrix 387+: 92/11, Intel RapidCAD: 92/05, ULSI 83C87: 91/48, IIT 3C87:
|
||
92/20.)
|
||
|
||
Results of running the SMDIAG program on 387-compatible coprocessors
|
||
(p = passed, f = failed)
|
||
|
||
Intel Intel Intel Cyrix Cyrix IIT ULSI C&T
|
||
Test RapidCAD 387DX 80387 387+ 83D87 3C87 83C87 38700
|
||
|
||
1 (fstore) f p p p f f f p ##,%%
|
||
2 (fiall) p p p p p p f p
|
||
3 (faddsub) p p p p p p p p
|
||
4 (faddsub_nr) p p p p f f f p %%
|
||
5 (faddsub_cp) p p p p f f f p %%
|
||
6 (faddsub_dn) p p p p f f f p %%
|
||
7 (faddsub_up) p p p p f f f p %%,&&
|
||
8 (fmul) p p p p p f f p
|
||
9 (fdivn) p p p p p p p p
|
||
10 (fdiv) p p p p p p f p
|
||
11 (fxch) p p p p p p p p
|
||
12 (fyl2x) p p p f f f f p ++
|
||
13 (fyl2xp1) f p p f f f f p ++
|
||
14 (fsqrt) p p p p p p p p
|
||
15 (fsincos) f p p f f f f p ++
|
||
16 (fptan) p p p f p f f p ++
|
||
17 (fpatan) p p p f f f f p ++
|
||
18 (f2xm1) p p p f f f f p ++
|
||
19 (fscale) f f p f f f f p **
|
||
20 (fcom1) p p p p p f f p
|
||
21 (fprem) p p p p p p p p
|
||
22 (misc1) p p p p p f f p
|
||
23 (misc3) p p p p p p p p
|
||
24 (misc4) p p p p f f p p %%
|
||
|
||
failed modules: 4 1 0 7 12 16 17 0
|
||
|
||
|
||
## the failure of the Intel RapidCAD is caused by the fact that
|
||
it stores the value of BCD INDEFINITE differently from the
|
||
Intel 387DX. It uses FFFFC000000000000000, while the 387DX uses
|
||
FFFF8000000000000000. However, both encodings are valid according
|
||
to Intel's documentation, which defines the BCD INDEFINITE as
|
||
FFFFUUUUUUUUUUUUUUUU, where U is undefined. So failure of the
|
||
RapidCAD to deliver the same answer as the 387DX is not an
|
||
"error", just a very slight incompatibility.
|
||
** the FSCALE errors reported for the Intel 387DX, Intel RapidCAD,
|
||
Cyrix 83D87, Cyrix 387+, and ULSI 83C87 are due to a single
|
||
'wrong' result each returned by one of the FSCALE computations.
|
||
SMDIAG expects the result returned by the first generation
|
||
Intel 80387 (and, of course, the C&T 38700DX). However, this
|
||
result is wrong according to Intel's documentation and the
|
||
behavior was corrected in the second generation Intel 387DX.
|
||
Therefore, the Intel RapidCAD, Cyrix 83D87, Cyrix 387+, and ULSI
|
||
83C87 return the correct result compatible with the Intel 387DX.
|
||
%% Failures reported for the Cyrix 83D87 are due to the fact that it
|
||
converts pseudodenormals contained in its registers to normalized
|
||
numbers upon storing them to memory with the FSTP TBYTE PTR
|
||
instruction. Intel's processors store pseudodenormals without
|
||
'normalizing' them. This is an incompatibility, but not an error,
|
||
because both encodings will evaluate to the same value should
|
||
they be reused in a calculation.
|
||
&& Two of the failures reported for the Cyrix 83D87 are actual
|
||
errors where the Cyrix 83D87 fails to deliver the correct result.
|
||
1) control word = 0A7F (closure=proj., round=up, precision=53bit)
|
||
ST(0) = 0001 ABCEF9876542101
|
||
ST(1) = 0001 800000000345FFF
|
||
instruction: FSUBRP ST(1), ST
|
||
result should be: 0000 2BCEF987650EC800, status word = 3A30
|
||
83D87 returns: 0000 3BCEF987650EC000, status word = 3830
|
||
2) control word = 0A7F (closure=proj., round=up, precision=53bit)
|
||
ST(0) = 0001 ABCEF9876542101
|
||
ST(1) = 0001 800000000000000
|
||
instruction: FSUB ST, ST(1)
|
||
result should be: 0000 2BCEF98765432800, status word = 3A30
|
||
83D87 returns: 0000 3BCEF98765432000, status word = 3830
|
||
++ The failures for the test of transcendental functions are caused
|
||
by the tested coprocessor returning results that differ from the
|
||
ones returned by the Intel 387DX. On the Cyrix 83D87, Cyrix 387+,
|
||
and Intel RapidCAD, this is simply due to the improved accuracy
|
||
these coprocessors provide over the Intel 387DX. The failures of
|
||
the IIT 3C87 and ULSI 83C87 are mainly due to the lesser accuracy
|
||
in the transcendental functions of these coprocessors, but for
|
||
the IIT 3C87 an additional source of failures is its inability to
|
||
handle extended-precision denormals.
|
||
|
||
|
||
Another compatibility issue that has been discussed on Usenet is the behavior
|
||
of the math coprocessors under protected-mode operating systems. I have seen
|
||
postings claiming that coprocessors from ULSI, IIT, and Cyrix locked up the
|
||
machine when a protected mode operating system (several UNIX derivatives were
|
||
also mentioned) was run on them. However, there have also been reports that
|
||
several 486-based systems also have this problem, while others do not.
|
||
Therefore, I think most of these problems are caused by poor motherboard
|
||
design, especially wrong handling of error interrupts coming from the
|
||
coprocessor. There could also be bugs in the exception handlers of the
|
||
operating system.
|
||
|
||
|
||
|
||
==========
|
||
References
|
||
==========
|
||
|
||
[1] Schnurer, G.: Zahlenknacker im Vormarsch. c't 1992, Heft 4, Seiten 170-
|
||
186
|
||
|
||
[2] Curnow, H.J.; Wichmann, B.A.: A synthetic benchmark. Computer Journal,
|
||
Vol. 19, No. 1, 1976, pp. 43-49
|
||
|
||
[3] Wichmann, B.A.: Validation code for the Whetstone benchmark. NPL Report
|
||
DITC 107/88, National Physics Laboratory, UK, March 1988
|
||
|
||
[4] Curnow, H.J.: Wither Whetstone? The Synthetic Benchmark after 15 Years.
|
||
In: Aad van der Steen (ed.): Evaluating Supercomputers. London: Chapman
|
||
and Hall 1990
|
||
|
||
[5] Dongarra, J.J.: The Linpack Benchmark: An Explanation. In: Aad van der
|
||
Steen (ed.): Evaluating Supercomputers. London: Chapman and Hall 1990
|
||
[6] Dongarra, J.J.: Performance of Various Computers Using Standard Linear
|
||
Equations Software. Report CS-89-85, Computer Science Department,
|
||
University of Tennessee, March 11, 1992
|
||
|
||
[7] Huth, N.: Dichtung und Wahrheit oder Datenblatt und Test. Design &
|
||
Elektronik 1990, Heft 13, Seiten 105-110
|
||
|
||
[8] Ungerer, B.: Sockelfolger. c't 1990, Heft 4, Seiten 162-163
|
||
|
||
[9] Coonen, J.T.: Contributions to a Proposed Standard for Binary Floating-
|
||
Point Arithmetic Ph.D. thesis, University of California, Berkeley, 1984
|
||
|
||
[10] IEEE: IEEE Standard for Binary Floating-Point Arithmetic. SIGPLAN
|
||
Notices, Vol. 22, No. 2, 1985, pp. 9-25
|
||
|
||
[11] IEEE Standard for Binary Floating-Point Arithmetic. ANSI/IEEE Std 754-
|
||
1985. New York, NY: Institute of Electrical and Electronics Engineers
|
||
1985
|
||
|
||
[12] FasMath 83D87 Compatibility Report. Cyrix Corporation, Nov. 1989 Order
|
||
No. B2004
|
||
|
||
[13] FasMath 83D87 Accuracy Report. Cyrix Corporation, July 1990 Order No.
|
||
B2002
|
||
|
||
[14] FasMath 83D87 Benchmark Report. Cyrix Corporation, June 1990 Order No.
|
||
B2004
|
||
|
||
[15] FasMath 83D87 User's Manual. Cyrix Corporation, June 1990 Order No.
|
||
L2001-003
|
||
|
||
[16] Brent, R.P.: A FORTRAN multiple-precision arithmetic package. ACM
|
||
Transactions on Mathematical Software, Vol. 4, No. 1, March 1978, pp.
|
||
57-70
|
||
|
||
[17] 387DX User's Manual, Programmer's Reference. Intel Corporation, 1989
|
||
Order No. 231917-002
|
||
|
||
[18] Volder, J.E.: The CORDIC Trigonometric Computing Technique. IRE
|
||
Transactions on Electronic Computers, Vol. EC-8, No. 5, September 1959,
|
||
pp. 330-334
|
||
|
||
[19] Walther, J.S.: A unified algorithm for elementary functions. AFIPS
|
||
Conference Proceedings, Vol. 38, SJCC 1971, pp. 379-385
|
||
|
||
[20] Esser, R.; Kremer, F.; Schmidt, W.G.: Testrechnungen auf der IBM 3090E
|
||
mit Vektoreinrichtung. Arbeitsbericht RRZK-8803, Regionales
|
||
Rechenzentrum an der Universit"at zu K<>ln, Februar 1988
|
||
|
||
[21] McMahon, H.H.: The Livermore Fortran Kernels: A test of the numerical
|
||
performance range. Technical Report UCRL-53745, Lawrence Livermore
|
||
National Laboratory, USA, December 1986
|
||
|
||
[22] Nave, R.: Implementation of Transcendental Functions on a Numerics
|
||
Processor. Microprocessing and Microprogramming, Vol. 11, No. 3-4,
|
||
March-April 1983, pp. 221-225
|
||
|
||
[23] Yuen, A.K.: Intel's Floating-Point Processors. Electro/88 Conference
|
||
Record, Boston, MA, USA, 10-12 May 1988, pp. 48/5-1 - 48/5-7
|
||
|
||
[24] Stiller, A.; Ungerer, B.: Ausgerechnet. c't 1990, Heft 1, Seiten 90-92
|
||
|
||
[25] Rosch, W.L.: Handfeste Hilfe oder Seifenblase? PC Professionell, Juni
|
||
1991, Seiten 214-237
|
||
[26] Intel 80286 Hardware Reference Manual. Intel Corporation, 1987 Order
|
||
No.210760-002
|
||
|
||
[27] AMD 80C287 80-bit CMOS Numeric Processor. Advanced Micro Devices, June
|
||
1989 Order No. 11671B/0
|
||
|
||
[28] Intel RapidCAD(tm) Engineering CoProcessor Performance Brief. Intel
|
||
Corporation, 1992
|
||
|
||
[29] i486(tm) Microprocessor Performance Report. Intel Corporation, April
|
||
1990 Order No. 240734-001
|
||
|
||
[30] Intel486(tm) DX2 Microprocessor Performance Brief. Intel Corporation,
|
||
March 1992 Order No. 241254-001
|
||
|
||
[31] Abacus 3167 Floating-Point Coprocessor Data Book. Weitek Corporation,
|
||
July 1990 DOC No. 9030
|
||
|
||
[32] WTL 4167 Floating-Point Coprocessor Data Book. Weitek Corporation, July
|
||
1989 DOC No. 8943
|
||
|
||
[33] Abacus Software Designer's Guide. Weitek Corporation, September 1989 DOC
|
||
No. 8967
|
||
|
||
[34] Stiller, A.: Cache & Carry. c't 1992, Heft 6, Seiten 118-130
|
||
|
||
[35] Stiller, A.: Cache & Carry, Teil 2. c't 1992, Heft 7, Seiten 28-34
|
||
|
||
[36] Palmer, J.F.; Morse, S.P.: Die mathematischen Grundlagen der Numerik-
|
||
Prozessoren 8087/80287. M<>nchen: tewi 1985
|
||
|
||
[37] 80C187 80-bit Math Coprocessor Data Sheet. Intel Corporation, September
|
||
1989 Order No. 270640-003
|
||
|
||
[38] IIT-2C87 80-bit Numeric Co-Processor Data Sheet. IIT, May 1990
|
||
|
||
[39] Engineering note 4x4 matrix multiply transformation. IIT, 1989
|
||
|
||
[40] Tscheuschner, E.: 4 mal 4 auf einen Streich. c't 1990, Heft 3, Seiten
|
||
266-276
|
||
|
||
[41] Goldberg, D.: Computer Arithmetic. In: Hennessy, J.L.; Patterson, D.A.:
|
||
Computer Architecture A Quantitative Approach. San Mateo, CA: Morgan
|
||
Kaufmann 1990
|
||
|
||
[42] 8087 Math Coprocessor Data Sheet. Intel Corporation, October 1989, Order
|
||
No. 205835-007
|
||
|
||
[43] 8086/8088 User's Manual, Programmer's and Hardware Reference. Intel
|
||
Corporation, 1989 Order No. 240487-001
|
||
|
||
[44] 80286 and 80287 Programmer's Reference Manual. Intel Corporation, 1987
|
||
Order No. 210498-005
|
||
|
||
[45] 80287XL/XLT CHMOS III Math Coprocessor Data Sheet. Intel Corporation,
|
||
May 1990 Order No. 290376-001
|
||
|
||
[46] Cyrix FasMath(tm) 82S87 Coprocessor Data Sheet. Cyrix Coporation, 1991
|
||
Document 94018-00 Rev. 1.0
|
||
|
||
[47] IIT-3C87 80-bit Numeric Co-Processor Data Sheet. IIT, May 1990
|
||
|
||
[48] 486(tm)SX(tm) Microprocessor/ 487(tm)SX(tm) Math CoProcessor Data Sheet.
|
||
Intel Corporation, April 1991. Order No. 240950-001
|
||
|
||
[49] Schnurer, G.: Die gro"se Verlade. c't 1991, Heft 7, Seiten 55-57
|
||
|
||
[50] Schnurer, G.: Eine 4 f"ur alle. c't 1991, Heft 6, Seite 25
|
||
|
||
[51] Intel486(tm)DX Microprocessor Data Book. Intel Corporation, June 1991
|
||
Order No. 240440-004
|
||
|
||
[52] i486(tm) Microprocessor Hardware Reference Manual. Intel Corporation,
|
||
1990 Order No. 240552-001
|
||
|
||
[53] i486(tm) Microprocessor Programmer's Reference Manual. Intel
|
||
Corporation, 1990 Order No. 240486-001
|
||
|
||
[54] Ungerer, B.: Kalte H"ute. c't 1992, Heft 8, Seiten 140-144
|
||
|
||
[55] Ungerer, B.: Hei"se Sache. c't 1991, Heft 4, Seiten 104-108
|
||
|
||
[56] Rosch, W.L.: Handfeste Hilfe oder Seifenblase? PC Profesionell, Juni
|
||
1991, Seiten 214-237
|
||
|
||
[57] Niederkr"uger, W.: Lebendige Vergangenheit. c't 1990, Heft 12, Seiten
|
||
114-116
|
||
|
||
[58] ULSI Math*Co Advanced Math Coprocessor Technical Specification. ULSI
|
||
System, 5/92, Rev. E
|
||
|
||
[59] 387(tm)DX Math CoProcessor Data Sheet. Intel Corporation, September
|
||
1990. Order No. 240448-003
|
||
|
||
[60] 387(tm) Numerics Coprocessor Extension Data Sheet. Intel Corporation,
|
||
February 1989. Order No. 231920-005
|
||
|
||
[61] Koren, I.; Zinaty, O.: Evaluating Elementary Functions in a Numerical
|
||
Coprocessor Based on Rational Approximations. IEEE Transactions on
|
||
Computers, Vol. C-39, No. 8, August 1990, pp. 1030-1037
|
||
|
||
[62] 387(tm) SX Math CoProcessor Data Sheet. Intel Corporation, November 1989
|
||
Order No. 240225-005
|
||
|
||
[63] Frenkel, G.: Coprocessors Speed Numeric Operations. PC-Week, August 27,
|
||
1990
|
||
|
||
[64] Schnurer, G.; Stiller, A.: Auto-Matt. c't 1991, Heft 10, Seiten 94-96
|
||
|
||
[65] Grehan, R.: FPU Face-Off. Byte, November 1990, pp. 194-200
|
||
|
||
[66] Tang, P.T.P.: Testing Computer Arithmetic by Elementary Number Theory.
|
||
Preprint MCS-P84-0889, Mathematics and Computer Science Division,
|
||
Argonne National Laboratory, August 1989
|
||
|
||
[67] Ferguson, W.E.: Selecting math coprocessors. IEEE Spectrum, July 1991,
|
||
pp. 38-41
|
||
|
||
[68] Schnabel, J.: Viermal 387. Computer Pers"onlich 1991, Heft 22, Seiten
|
||
153-156
|
||
|
||
[69] Hofmann, J.: Starke Rechenknechte. mc 1990, Heft 7, Seiten 64-67
|
||
|
||
[70] Woerrlein, H.; Hinnenberg, R.: Die Lust an der Power. Computer Live
|
||
1991, Heft 10, Seiten 138-149
|
||
|
||
[71] email from Peter Forsberg (peterf@vnet.ibm.com), email from Alan Brown
|
||
(abrown@Reston.ICL.COM)
|
||
|
||
[72] email from Eric Johnson (johnsone%camax01@uunet.UU.NET), email from
|
||
Jerry Whelan (guru@stasi.bradley.edu), email from Arto Viitanen
|
||
(av@cs.uta.fi), email from Richard Krehbiel (richk@grebyn.com)
|
||
|
||
[73] email from Fred Dunlap (cyrix!fred@texsun.Central.Sun.COM)
|
||
|
||
[74] correspondence with Bengt Ask (f89ba@efd.lth.se)
|
||
|
||
[75] email from Thomas Hoberg (tmh@prosun.first.gmd.de)
|
||
|
||
[76] Microsoft Macro Assembler Programmer's Guide Version 6.0, Microsoft
|
||
Corporation, 1991. Document No. LN06556-0291
|
||
|
||
[77] FasMath EMC87 User's Manual, Rev. 2. Cyrix Corporation, February 1991
|
||
Order No. 90018-00
|
||
|
||
[78] Persson, C.: Die 32-Bit-Parade c't 1992, Heft 9, Seiten 150-156
|
||
|
||
[79] email from Duncan Murdoch (dmurdoch@mast.QueensU.CA)
|
||
|
||
|
||
|
||
========================
|
||
Manufacturer's addresses
|
||
========================
|
||
|
||
Intel Corporation
|
||
3065 Bowers Avenue
|
||
Santa Clara, CA 95051
|
||
USA
|
||
|
||
IIT Integrated Information Technology, Inc.
|
||
2540 Mission College Blvd.
|
||
Santa Clara, CA 95054
|
||
USA
|
||
|
||
ULSI Systems, Inc.
|
||
58 Daggett Drive
|
||
San Jose, CA 95134
|
||
USA
|
||
|
||
Chips & Technologies, Inc.
|
||
3050 Zanker Road
|
||
San Jose, CA 95134
|
||
USA
|
||
|
||
Weitek Corporation
|
||
1060 East Arques Avenue
|
||
Sunnyvale, CA 94086
|
||
USA
|
||
|
||
AMD Advanced Microdevices, Inc.
|
||
901 Thompson Place
|
||
P.O.B. 3453
|
||
Sunnyvale, CA 94088-3453
|
||
USA
|
||
|
||
Cyrix Corporation
|
||
P.O.B. 850118
|
||
Richardson, TX 75085
|
||
USA
|
||
|
||
|
||
|
||
===============================
|
||
Appendix A: Test program source
|
||
===============================
|
||
|
||
{$N+,E+}
|
||
PROGRAM PCtrl;
|
||
|
||
VAR B,c: EXTENDED;
|
||
Precision, L: WORD;
|
||
|
||
PROCEDURE SetPrecisionControl (Precision: WORD);
|
||
(* This procedure sets the internal precision of the NDP. Available *)
|
||
(* precision values: 0 - 24 bits (SINGLE) *)
|
||
(* 1 - n.a. (mapped to single) *)
|
||
(* 2 - 53 bits (DOUBLE) *)
|
||
(* 3 - 64 bits (EXTENDED) *)
|
||
|
||
VAR CtrlWord: WORD;
|
||
|
||
BEGIN {SetPrecisionCtrl}
|
||
IF Precision = 1 THEN
|
||
Precision := 0;
|
||
Precision := Precision SHL 8; { make mask for PC field in ctrl word}
|
||
ASM
|
||
FSTCW [CtrlWord] { store NDP control word }
|
||
MOV AX, [CtrlWord] { load control word into CPU }
|
||
AND AX, 0FCFFh { mask out precision control field }
|
||
OR AX, [Precision] { set desired precision in PC field }
|
||
MOV [CtrlWord], AX { store new control word }
|
||
FLDCW [CtrlWord] { set new precision control in NDP }
|
||
END;
|
||
END; {SetPrecisionCtrl}
|
||
|
||
BEGIN {main}
|
||
FOR Precision := 1 TO 3 DO BEGIN
|
||
B := 1.2345678901234567890;
|
||
SetPrecisionControl (Precision);
|
||
FOR L := 1 TO 20 DO BEGIN
|
||
B := Sqrt (B);
|
||
END;
|
||
FOR L := 1 TO 20 DO BEGIN
|
||
B := B*B;
|
||
END;
|
||
SetPrecisionControl (3); { full precision for printout }
|
||
WriteLn (Precision, B:28);
|
||
END;
|
||
END.
|
||
|
||
|
||
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
|
||
|
||
{$N+,E+}
|
||
PROGRAM RCtrl;
|
||
|
||
VAR B,c: EXTENDED;
|
||
RoundingMode, L: WORD;
|
||
|
||
|
||
PROCEDURE SetRoundingMode (RCMode: WORD);
|
||
(* This procedure selects one of four available rounding modes *)
|
||
(* 0 - Round to nearest (default) *)
|
||
(* 1 - Round down (towards negative infinity) *)
|
||
(* 2 - Round up (towards positive infinity) *)
|
||
(* 3 - Chop (truncate, round towards zero) *)
|
||
|
||
VAR CtrlWord: WORD;
|
||
|
||
BEGIN
|
||
RCMode := RCMode SHL 10; { make mask for RC field in control word}
|
||
ASM
|
||
FSTCW [CtrlWord] { store NDP control word }
|
||
MOV AX, [CtrlWord] { load control word into CPU }
|
||
AND AX, 0F3FFh { mask out rounding control field }
|
||
OR AX, [RCMode] { set desired precision in RC field }
|
||
MOV [CtrlWord], AX { store new control word }
|
||
FLDCW [CtrlWord] { set new rounding control in NDP }
|
||
END;
|
||
END;
|
||
|
||
BEGIN
|
||
FOR RoundingMode := 0 TO 3 DO BEGIN
|
||
B := 1.2345678901234567890e100;
|
||
SetRoundingMode (RoundingMode);
|
||
FOR L := 1 TO 51 DO BEGIN
|
||
B := Sqrt (B);
|
||
END;
|
||
FOR L := 1 TO 51 DO BEGIN
|
||
B := -B*B;
|
||
END;
|
||
SetRoundingMode (0); { round to nearest for printout }
|
||
WriteLn (RoundingMode, B:28);
|
||
END;
|
||
END.
|
||
|
||
|
||
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
|
||
|
||
{$N+,E+}
|
||
|
||
PROGRAM DenormTs;
|
||
|
||
VAR E: EXTENDED;
|
||
D: DOUBLE;
|
||
S: SINGLE;
|
||
|
||
BEGIN
|
||
WriteLn ('Testing support and printing of denormals');
|
||
WriteLn;
|
||
Write ('Coprocessor is: ');
|
||
CASE Test8087 OF
|
||
0: WriteLn ('Emulator');
|
||
1: WriteLn ('8087 or compatible');
|
||
2: WriteLn ('80287 or compatible');
|
||
3: WriteLn ('80387 or compatible');
|
||
END;
|
||
WriteLn;
|
||
S := 1.18e-38;
|
||
S := S * 3.90625e-3;
|
||
IF S = 0 THEN
|
||
WriteLn ('SINGLE denormals not supported')
|
||
ELSE BEGIN
|
||
WriteLn ('SINGLE denormals supported');
|
||
WriteLn ('SINGLE denormal prints as: ', S);
|
||
WriteLn ('Denormal should be printed as 4.60943...E-0041');
|
||
END;
|
||
WriteLn;
|
||
D := 2.24e-308;
|
||
D := D * 3.90625e-3;
|
||
IF D = 0 THEN
|
||
WriteLn ('DOUBLE denormals not supported')
|
||
ELSE BEGIN
|
||
WriteLn ('DOUBLE denormals supported');
|
||
WriteLn ('DOUBLE denormal prints as: ', D);
|
||
WriteLn ('Denormal should be printed as 8.75...E-0311');
|
||
END;
|
||
WriteLn;
|
||
E := 3.37e-4932;
|
||
E := E * 3.90625e-3;
|
||
IF E = 0 THEN
|
||
WriteLn ('EXTENDED denormals not supported')
|
||
ELSE BEGIN
|
||
WriteLn ('EXTENDED denormals supported');
|
||
WriteLn ('EXTENDED denormal prints as: ', E);
|
||
WriteLn ('Denormal should be printed as 1.3164...E-4934');
|
||
END;
|
||
END.
|
||
|
||
|
||
|
||
====================================
|
||
Appendix B: Benchmark program source
|
||
====================================
|
||
|
||
|
||
; FILE: APFELM4.ASM
|
||
; assemble with MASM /e APFELM4 or TASM /e APFELM4
|
||
|
||
|
||
CODE SEGMENT BYTE PUBLIC 'CODE'
|
||
ASSUME CS: CODE
|
||
|
||
PAGE ,120
|
||
|
||
PUBLIC APPLE87;
|
||
|
||
APPLE87 PROC NEAR
|
||
PUSH BP ; save caller's base pointer
|
||
MOV BP, SP ; make new frame pointer
|
||
PUSH DS ; save caller's data segment
|
||
PUSH SI ; save register
|
||
PUSH DI ; variables
|
||
LDS BX, [BP+04] ; pointer to parameter record
|
||
FINIT ; init 80x87 FSP->R0
|
||
FILD WORD PTR [BX+02] ; maxrad FSP->R7
|
||
FLD QWORD PTR [BX+08] ; qmax FSP->R6
|
||
FSUB QWORD PTR [BX+16] ; qmax-qmin FSP->R6
|
||
DEC WORD PTR [BX+04] ; ymax-1
|
||
FIDIV WORD PTR [BX+04] ; (qmax-qmin)/(ymax-1)FSP->R6
|
||
FSTP QWORD PTR [BX+16] ; save delta_q FSP->R7
|
||
FLD QWORD PTR [BX+24] ; pmax FSP->R6
|
||
FSUB QWORD PTR [BX+32] ; pmax-pmin FSP->R6
|
||
DEC WORD PTR [BX+06] ; xmax-1
|
||
FIDIV WORD PTR [BX+06] ; delta_p FSP->R6
|
||
MOV AX, [BX] ; save maxiter,[BX] needed for
|
||
MOV [BX+2], AX ; 80x87 status now
|
||
XOR BP, BP ; y=0
|
||
FLD QWORD PTR [BX+08] ; qmax FSP->R5
|
||
CMP WORD PTR [BX+40], 0 ; fast mode on 8087 desired ?
|
||
JE yloop ; no, normal mode
|
||
FSTCW [BX] ; save NDP control word
|
||
AND WORD PTR [BX], 0FCFFh; set PCTRL = single-precision
|
||
FLDCW [BX] ; get back NDP control word
|
||
yloop: XOR DI, DI ; x=0
|
||
FLD QWORD PTR [BX+32] ; pmin FSP->R4
|
||
xloop: FLDZ ; j**2= 0 FSP->R3
|
||
FLDZ ; 2ij = 0 FSP->R2
|
||
FLDZ ; i**2= 0 FSP->R1
|
||
MOV CX, [BX+2] ; maxiter
|
||
MOV DL, 41h ; mask for C0 and C3 cond.bits
|
||
iteration: FSUB ST, ST(2) ; i**2-j**2 FSP->R1
|
||
FADD ST, ST(3) ; i**2-j**2+p = i FSP->R1
|
||
FLD ST(0) ; duplicate i FSP->R0
|
||
FMUL ST(1), ST ; i**2 FSP->R0
|
||
FADD ST, ST(0) ; 2i FSP->R0
|
||
FXCH ST(2) ; 2*i*j FSP->R0
|
||
FADD ST, ST(5) ; 2*i*j+q = j FSP->R0
|
||
FMUL ST(2), ST ; 2*i*j FSP->R0
|
||
FMUL ST, ST(0) ; j**2 FSP->R0
|
||
FST ST(3) ; save j**2 FSP->R0
|
||
FADD ST, ST(1) ; i**2+j**2 FSP->R0
|
||
FCOMP ST(7) ; i**2+j**2 > maxrad? FSP->R1
|
||
FSTSW [BX] ; save 80x87 cond.codeFSP->R1
|
||
TEST BYTE PTR [BX+1], DL ; test carry and zero flags
|
||
LOOPNZ iteration ; until maxiter if not diverg.
|
||
MOV DX, CX ; number of loops executed
|
||
NEG CX ; carry set if CX <> 0
|
||
ADC DX, 0 ; adjust DX if no. of loops<>0
|
||
|
||
; plot point here (DI = X, BP = y, DX has the color)
|
||
|
||
FSTP ST(0) ; pop i**2 FSP->R2
|
||
FSTP ST(0) ; pop 2ij FSP->R3
|
||
FSTP ST(0) ; pop j**2 FSP->R4
|
||
FADD ST,ST(2) ; p=p+delta_p FSP->R4
|
||
INC DI ; x:=x+1
|
||
CMP DI, [BX+6] ; x > xmax ?
|
||
JBE xloop ; no, continue on same line
|
||
FSTP ST(0) ; pop p FSP->R5
|
||
FSUB QWORD PTR [BX+16] ; q=q-delta_q FSP->R5
|
||
INC BP ; y:=y+1
|
||
CMP BP, [BX+4] ; y > ymax ?
|
||
JBE yloop ; no, picture not done yet
|
||
|
||
groesser: POP DI ; restore
|
||
POP SI ; register variables
|
||
POP DS ; restore caller's data segm.
|
||
POP BP ; save caller's base pointer
|
||
RET 4 ; pop parameters and return
|
||
APPLE87 ENDP
|
||
|
||
CODE ENDS
|
||
|
||
END
|
||
|
||
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
|
||
|
||
UNIT Time;
|
||
|
||
INTERFACE
|
||
|
||
FUNCTION Clock: LONGINT; { same as VMS; time in milliseconds }
|
||
|
||
|
||
IMPLEMENTATION
|
||
|
||
FUNCTION Clock: LONGINT; ASSEMBLER;
|
||
ASM
|
||
PUSH DS { save caller's data segment }
|
||
XOR DX, DX { initialize data segment to }
|
||
MOV DS, DX { access ticker counter }
|
||
MOV BX, 46Ch { offset of ticker counter in segm.}
|
||
MOV DX, 43h { timer chip control port }
|
||
MOV AL, 4 { freeze timer 0 }
|
||
PUSHF { save caller's int flag setting }
|
||
STI { allow update of ticker counter }
|
||
LES DI, DS:[BX] { read BIOS ticker counter }
|
||
OUT DX, AL { latch timer 0 }
|
||
LDS SI, DS:[BX] { read BIOS ticker counter }
|
||
IN AL, 40h { read latched timer 0 lo-byte }
|
||
MOV AH, AL { save lo-byte }
|
||
IN AL, 40h { read latched timer 0 hi-byte }
|
||
POPF { restore caller's int flag }
|
||
XCHG AL, AH { correct order of hi and lo }
|
||
MOV CX, ES { ticker counter 1 in CX:DI:AX }
|
||
CMP DI, SI { ticker counter updated ? }
|
||
JE @no_update { no }
|
||
OR AX, AX { update before timer freeze ? }
|
||
JNS @no_update { no }
|
||
MOV DI, SI { use second }
|
||
MOV CX, DS { ticker counter }
|
||
@no_update:NOT AX { counter counts down }
|
||
MOV BX, 36EDh { load multiplier }
|
||
MUL BX { W1 * M }
|
||
MOV SI, DX { save W1 * M (hi) }
|
||
MOV AX, BX { get M }
|
||
MUL DI { W2 * M }
|
||
XCHG BX, AX { AX = M, BX = W2 * M (lo) }
|
||
MOV DI, DX { DI = W2 * M (hi) }
|
||
ADD BX, SI { accumulate }
|
||
ADC DI, 0 { result }
|
||
XOR SI, SI { load zero }
|
||
MUL CX { W3 * M }
|
||
ADD AX, DI { accumulate }
|
||
ADC DX, SI { result in DX:AX:BX }
|
||
MOV DH, DL { move result }
|
||
MOV DL, AH { from DL:AX:BX }
|
||
MOV AH, AL { to }
|
||
MOV AL, BH { DX:AX:BH }
|
||
MOV DI, DX { save result }
|
||
MOV CX, AX { in DI:CX }
|
||
MOV AX, 25110 { calculate correction }
|
||
MUL DX { factor }
|
||
SUB CX, DX { subtract correction }
|
||
SBB DI, SI { factor }
|
||
XCHG AX, CX { result back }
|
||
MOV DX, DI { to DX:AX }
|
||
POP DS { restore caller's data segment }
|
||
END;
|
||
|
||
|
||
BEGIN
|
||
Port [$43] := $34; { need rate generator, not square wave}
|
||
Port [$40] := 0; { generator as prog. by some BIOSes }
|
||
Port [$40] := 0; { for timer 0 }
|
||
END. { Time }
|
||
|
||
|
||
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
|
||
|
||
{$A+,B-,R-,I-,V-,N+,E+}
|
||
PROGRAM PeakFlop;
|
||
|
||
USES Time;
|
||
|
||
TYPE ParamRec = RECORD
|
||
MaxIter, MaxRad, YMax, XMax: WORD;
|
||
Qmax, Qmin, Pmax, Pmin: DOUBLE;
|
||
FastMod: WORD;
|
||
PlotFkt: POINTER;
|
||
FLOPS:LONGINT;
|
||
END;
|
||
|
||
VAR Param: ParamRec;
|
||
Start: LONGINT;
|
||
|
||
|
||
{$L APFELM4.OBJ}
|
||
|
||
PROCEDURE Apple87 (VAR Param: ParamRec); EXTERNAL;
|
||
|
||
|
||
BEGIN
|
||
WITH Param DO BEGIN
|
||
MaxIter:= 50;
|
||
MaxRad := 30;
|
||
YMax := 30;
|
||
XMax := 30;
|
||
Pmin :=-2.1;
|
||
Pmax := 1.1;
|
||
Qmin :=-1.2;
|
||
Qmax := 1.2;
|
||
FastMod:= Word (FALSE);
|
||
PlotFkt:= NIL;
|
||
Flops := 0;
|
||
END;
|
||
Start := Clock;
|
||
Apple87 (Param); { executes 104002 FLOP }
|
||
Start := Clock - Start; { elapsed time in milliseconds }
|
||
WriteLn ('Peak-MFLOPS: ', 104.002 / Start);
|
||
END.
|
||
|
||
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
|
||
|
||
; FILE: M4X4.ASM
|
||
;
|
||
; assemble with TASM /e M4X4 or MASM /e M4X4
|
||
|
||
CODE SEGMENT BYTE PUBLIC 'CODE'
|
||
|
||
ASSUME CS:CODE
|
||
|
||
PUBLIC MUL_4x4
|
||
PUBLIC IIT_MUL_4x4
|
||
|
||
|
||
FSBP0 EQU DB 0DBh, 0E8h ; declare special IIT
|
||
FSBP1 EQU DB 0DBh, 0EBh ; instructions
|
||
FSBP2 EQU DB 0DBh, 0EAh
|
||
F4X4 EQU DB 0DBh, 0F1h
|
||
|
||
|
||
;---------------------------------------------------------------------
|
||
;
|
||
; MUL_4x4 multiplicates a four-by-four matrix by an array of four
|
||
; dimensional vectors. This operation is needed for 3D transformations
|
||
; in graphics data processing. There are arrays for each component of
|
||
; a vector. Thus there is an ; array containing all the x components,
|
||
; another containing all the y components and so on. Each component is
|
||
; an 8 byte IEEE floating-point number. Two indices into the array of
|
||
; vectors are given. The first is the index of the vector that will be
|
||
; processed first, the second is the index of the vector processed
|
||
; last.
|
||
;
|
||
;---------------------------------------------------------------------
|
||
|
||
MUL_4x4 PROC NEAR
|
||
|
||
AddrX EQU DWORD PTR [BP+24] ; address of X component array
|
||
AddrY EQU DWORD PTR [BP+20] ; address of Y component array
|
||
AddrZ EQU DWORD PTR [BP+16] ; address of Z component array
|
||
AddrW EQU DWORD PTR [BP+12] ; address of W component array
|
||
AddrT EQU DWORD PTR [BP+8] ; addr. of 4x4 transform. mat.
|
||
F EQU WORD PTR [BP+6] ; first vector to process
|
||
K EQU WORD PTR [BP+4] ; last vector to process
|
||
RetAddr EQU WORD PTR [BP+2] ; return address saved by call
|
||
SavdBP EQU WORD PTR [BP+0] ; saved frame pointer
|
||
SavdDS EQU WORD PTR [BP-2] ; caller's data segment
|
||
|
||
PUSH BP ; save TURBO-Pascal frame ptr
|
||
MOV BP, SP ; new frame pointer
|
||
PUSH DS ; save TURBO-Pascal data segmnt
|
||
|
||
MOV CX, K ; final index
|
||
SUB CX, F ; final index - start index
|
||
JNC $ok ; must not
|
||
JMP $nothing ; be negative
|
||
$ok: INC CX ; number of elements
|
||
|
||
MOV SI, F ; init offset into arrays
|
||
SHL SI, 1 ; each
|
||
SHL SI, 1 ; element
|
||
SHL SI, 1 ; has 8 bytes
|
||
|
||
LDS DI, AddrT ; addr. of transformation mat.
|
||
FLD QWORD PTR [DI] ; load a[0,0] = R7
|
||
FLD QWORD PTR [DI+8] ; load a[0,1] = R6
|
||
|
||
$mat_mul: LES BX, AddrX ; addr. of x component array
|
||
FLD QWORD PTR ES:[BX+SI] ; load x[a] = R5
|
||
LES BX, AddrY ; addr. of y component array
|
||
FLD QWORD PTR ES:[BX+SI] ; load y[a] = R4
|
||
LES BX, AddrZ ; addr. of z component array
|
||
FLD QWORD PTR ES:[BX+SI] ; load z[a] = R3
|
||
LES BX, AddrW ; addr. of w component array
|
||
FLD QWORD PTR ES:[BX+SI] ; load w[a] = R2
|
||
|
||
FLD ST(5) ; load a[0,0] = R1
|
||
FMUL ST, ST(4) ; a[0,0] * x[a] = R1
|
||
FLD ST(5) ; load a[0,1] = R0
|
||
FMUL ST, ST(4) ; a[0,1] * y[a] = R0
|
||
FADDP ST(1), ST ; a[0,0]*x[a]+a[0,1]*y[a]=R1
|
||
FLD QWORD PTR [DI+16] ; load a[0,2] = R0
|
||
FMUL ST, ST(3) ; a[0,2] * z[a] = R0
|
||
FADDP ST(1), ST ; a[0,0]*x[a]...a[0,2]*z[a]=R1
|
||
FLD QWORD PTR [DI+24] ; load a[0,3] = R0
|
||
FMUL ST, ST(2) ; a[0,3] * w[a] = R0
|
||
FADDP ST(1), ST ; a[0,0]*x[a]...a[0,3]*w[a]=R1
|
||
LES BX, AddrX ; get address of x vector
|
||
FSTP QWORD PTR ES:[BX+SI] ; write new x[a]
|
||
|
||
FLD QWORD PTR [DI+32] ; load a[1,0] = R1
|
||
FMUL ST, ST(4) ; a[1,0] * x[a] = R1
|
||
FLD QWORD PTR [DI+40] ; load a[1,1] = R0
|
||
FMUL ST, ST(4) ; a[1,1] * y[a] = R0
|
||
FADDP ST(1), ST ; a[1,0]*x[a]+a[1,1]*y[a]=R1
|
||
FLD QWORD PTR [DI+48] ; load a[1,2] = R0
|
||
FMUL ST, ST(3) ; a[1,2] * z[a] = R0
|
||
FADDP ST(1), ST ; a[1,0]*x[a]...a[1,2]*z[a]=R1
|
||
FLD QWORD PTR [DI+56] ; load a[1,3] = R0
|
||
FMUL ST, ST(2) ; a[1,3] * w[a] = R0
|
||
FADDP ST(1), ST ; a[1,0]*x[a]...a[1,3]*w[a]=R1
|
||
LES BX, AddrY ; get address of y vector
|
||
FSTP QWORD PTR ES:[BX+SI] ; write new y[a]
|
||
|
||
FLD QWORD PTR [DI+64] ; load a[2,0] = R1
|
||
FMUL ST, ST(4) ; a[2,0] * x[a] = R1
|
||
FLD QWORD PTR [DI+72] ; load a[2,1] = R0
|
||
FMUL ST, ST(4) ; a[2,1] * y[a] = R0
|
||
FADDP ST(1), ST ; a[2,0]*x[a]+a[2,1]*y[a]=R1
|
||
FLD QWORD PTR [DI+80] ; load a[2,2] = R0
|
||
FMUL ST, ST(3) ; a[2,2] * z[a] = R0
|
||
FADDP ST(1), ST ; a[2,0]*x[a]...a[2,2]*z[a]=R1
|
||
FLD QWORD PTR [DI+88] ; load a[2,3] = R0
|
||
FMUL ST, ST(2) ; a[2,3] * w[a] = R0
|
||
FADDP ST(1), ST ; a[2,0]*x[a]...a[2,3]*w[a]=R1
|
||
LES BX, AddrZ ; get address of z vector
|
||
FSTP QWORD PTR ES:[BX+SI] ; write new z[a]
|
||
|
||
FLD QWORD PTR [DI+96] ; load a[3,0] = R1
|
||
FMULP ST(4), ST ; a[3,0] * x[a] = R5
|
||
FLD QWORD PTR [DI+104] ; load a[3,1] = R1
|
||
FMULP ST(3), ST ; a[3,1] * y[a] = R4
|
||
FLD QWORD PTR [DI+112] ; load a[3,2] = R1
|
||
FMULP ST(2), ST ; a[3,2] * z[a] = R3
|
||
FLD QWORD PTR [DI+120] ; load a[3,3] = R1
|
||
FMULP ST(1), ST ; a[3,3] * w[a] = R2
|
||
FADDP ST(1), ST ; a[3,3]*w[a]+a[3,2]*z[a]=R3
|
||
FADDP ST(1), ST ; a[3,3]*w[a]...a[3,1]*y[a]=R4
|
||
FADDP ST(1), ST ; a[3,3]*w[a]...a[3,0]*x[a]=R5
|
||
LES BX, AddrW ; get address of w vector
|
||
FSTP QWORD PTR ES:[BX+SI] ; write new w[a]
|
||
|
||
ADD SI, 8 ; new offset into arrays
|
||
DEC CX ; decrement element counter
|
||
JZ $done ; no elements left, done
|
||
JMP $mat_mul ; transform next vector
|
||
|
||
$done: FSTP ST(0) ; clear
|
||
FSTP ST(0) ; FPU stack
|
||
$nothing: POP DS ; restore TP data segment
|
||
POP BP ; restore TP frame pointer
|
||
RET 24 ; pop parameters and return
|
||
|
||
MUL_4X4 ENDP
|
||
|
||
|
||
;---------------------------------------------------------------------
|
||
;
|
||
; IIT_MUL_4x4 multiplicates a four-by-four matrix by an array of four
|
||
; dimensional vectors. This operation is needed for 3D transformations
|
||
; in graphics data processing. There are arrays for each component of
|
||
; a vector. Thus there is an array containing all the x components,
|
||
; another containing all the y components and so on. Each component is
|
||
; an 8 byte IEEE floating-point number. Two indices into the array of
|
||
; vectors are given. The first is the index of the vector that will be
|
||
; processed first, the second is the index of the vector processed
|
||
; last. This subroutine uses the special instructions only available
|
||
; on IIT coprocessors to provide fast matrix multiply capabilities.
|
||
; So make sure to use it only on IIT coprocessors.
|
||
;
|
||
;---------------------------------------------------------------------
|
||
|
||
IIT_MUL_4x4 PROC NEAR
|
||
|
||
AddrX EQU DWORD PTR [BP+24] ; address of X component array
|
||
AddrY EQU DWORD PTR [BP+20] ; address of Y component array
|
||
AddrZ EQU DWORD PTR [BP+16] ; address of Z component array
|
||
AddrW EQU DWORD PTR [BP+12] ; address of W component array
|
||
AddrT EQU DWORD PTR [BP+8] ; addr. of 4x4 transf. matrix
|
||
F EQU WORD PTR [BP+6] ; first vector to process
|
||
K EQU WORD PTR [BP+4] ; last vector to process
|
||
RetAddr EQU WORD PTR [BP+2] ; return address saved by call
|
||
SavdBP EQU WORD PTR [BP+0] ; saved frame pointer
|
||
SavdDS EQU WORD PTR [BP-2] ; caller's data segment
|
||
Ctrl87 EQU WORD PTR [BP-4] ; caller's 80x87 control word
|
||
|
||
PUSH BP ; save TURBO-Pascal frame ptr
|
||
MOV BP, SP ; new frame pointer
|
||
PUSH DS ; save TURBO-Pascal data seg.
|
||
SUB SP, 2 ; make local variabe
|
||
FSTCW [Ctrl87] ; save 80x87 ctrl word
|
||
LES SI, AddrT ; ptr to transformation matrix
|
||
FINIT ; initialize coprocessor
|
||
FSBP2 ; set register bank 2
|
||
FLD QWORD PTR ES:[SI] ; load a[0,0]
|
||
FLD QWORD PTR ES:[SI+32] ; load a[1,0]
|
||
FLD QWORD PTR ES:[SI+64] ; load a[2,0]
|
||
FLD QWORD PTR ES:[SI+96] ; load a[3,0]
|
||
FLD QWORD PTR ES:[SI+8] ; load a[0,1]
|
||
FLD QWORD PTR ES:[SI+40] ; load a[1,1]
|
||
FLD QWORD PTR ES:[SI+72] ; load a[2,1]
|
||
FLD QWORD PTR ES:[SI+104] ; load a[3,1]
|
||
FINIT ; initialize coprocessor
|
||
FSBP1 ; set register bank 1
|
||
FLD QWORD PTR ES:[SI+16] ; load a[0,2]
|
||
FLD QWORD PTR ES:[SI+48] ; load a[1,2]
|
||
FLD QWORD PTR ES:[SI+80] ; load a[2,2]
|
||
FLD QWORD PTR ES:[SI+112] ; load a[3,2]
|
||
FLD QWORD PTR ES:[SI+24] ; load a[0,3]
|
||
FLD QWORD PTR ES:[SI+56] ; load a[1,3]
|
||
FLD QWORD PTR ES:[SI+88] ; load a[2,3]
|
||
FLD QWORD PTR ES:[SI+120] ; load a[3,3]
|
||
|
||
; transformation matrix loaded
|
||
|
||
MOV AX, F ; index of first vector
|
||
MOV DX, K ; index of last vector
|
||
|
||
MOV BX, AX ; index 1st vector to process
|
||
MOV CL, 3 ; component has 8 (2**3) bytes
|
||
SHL BX, CL ; compute offset into arrays
|
||
|
||
FINIT ; initialize coprocessor
|
||
FSBP0 ; set register bank 0
|
||
|
||
$mat_loop:LES SI, AddrW ; addr. of W component array
|
||
FLD QWORD PTR ES:[SI+BX] ; W component current vector
|
||
LES SI, AddrZ ; addr. of Z component array
|
||
FLD QWORD PTR ES:[SI+BX] ; Z component current vector
|
||
LES SI, AddrY ; addr. of Y component array
|
||
FLD QWORD PTR ES:[SI+BX] ; Y component current vector
|
||
LES SI, AddrX ; addr. of X component array
|
||
FLD QWORD PTR ES:[SI+BX] ; X component current vector
|
||
F4X4 ; mul 4x4 matrix by 4x1 vector
|
||
INC AX ; next vector
|
||
MOV DI, AX ; next vector
|
||
SHL DI, CL ; offset of vector into arrays
|
||
|
||
FSTP QWORD PTR ES:[SI+BX] ; store X comp. of curr. vect.
|
||
LES SI, AddrY ; address of Y component array
|
||
FSTP QWORD PTR ES:[SI+BX] ; store Y comp. of curr. vect.
|
||
LES SI, AddrZ ; address of Z component array
|
||
FSTP QWORD PTR ES:[SI+BX] ; store Z comp. of curr. vect.
|
||
LES SI, AddrW ; address of W component array
|
||
FSTP QWORD PTR ES:[SI+BX] ; store W comp. of curr. vect.
|
||
|
||
MOV BX, DI ; ofs nxt vect. in comp. arrays
|
||
CMP AX, DX ; nxt vector past upper bound?
|
||
JLE $mat_loop ; no, transform next vector
|
||
FLDCW [Ctrl87] ; restore orig 80x87 ctrl word
|
||
|
||
ADD SP, 2 ; get rid of local variable
|
||
POP DS ; restore TP data segment
|
||
POP BP ; restore TP frame pointer
|
||
RET 24 ; pop parameters and return
|
||
IIT_MUL_4x4 ENDP
|
||
|
||
CODE ENDS
|
||
|
||
END
|
||
|
||
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
|
||
|
||
{$N+,E+}
|
||
|
||
PROGRAM Trnsform;
|
||
|
||
USES Time;
|
||
|
||
CONST VectorLen = 8190;
|
||
|
||
TYPE Vector = ARRAY [0..VectorLen] OF DOUBLE;
|
||
VectorPtr = ^Vector;
|
||
Mat4 = ARRAY [1..4, 1..4] OF DOUBLE;
|
||
|
||
VAR X: VectorPtr;
|
||
Y: VectorPtr;
|
||
Z: VectorPtr;
|
||
W: VectorPtr;
|
||
T: Mat4;
|
||
K: INTEGER;
|
||
L: INTEGER;
|
||
First: INTEGER;
|
||
Last: INTEGER;
|
||
Start: LONGINT;
|
||
Elapsed:LONGINT;
|
||
|
||
PROCEDURE MUL_4X4 (X, Y, Z, W: VectorPtr;
|
||
VAR T: Mat4; First, Last: INTEGER); EXTERNAL;
|
||
PROCEDURE IIT_MUL_4X4 (X, Y, Z, W: VectorPtr;
|
||
VAR T: Mat4; First, Last: INTEGER); EXTERNAL;
|
||
|
||
{$L M4X4.OBJ}
|
||
|
||
BEGIN
|
||
WriteLn ('Test8087 = ', Test8087);
|
||
New (X);
|
||
New (Y);
|
||
New (Z);
|
||
New (W);
|
||
FOR L := 1 TO VectorLen DO BEGIN
|
||
X^ [L] := Random;
|
||
Y^ [L] := Random;
|
||
Z^ [L] := Random;
|
||
W^ [L] := Random;
|
||
END;
|
||
X^ [0] := 1;
|
||
Y^ [0] := 1;
|
||
Z^ [0] := 1;
|
||
W^ [0] := 1;
|
||
FOR K := 1 TO 4 DO BEGIN
|
||
FOR L := 1 TO 4 DO BEGIN
|
||
T [K, L] := (K-1)*4 + L;
|
||
END;
|
||
END;
|
||
First := 0;
|
||
Last := 8190;
|
||
Start := Clock;
|
||
MUL_4X4 (X, Y, Z, W, T, First, Last);
|
||
{ IIT_MUL_4X4 (X, Y, Z, W, T, First, Last); }
|
||
Elapsed := Clock - Start;
|
||
WriteLn ('Number of vectors: ', Last-First+1);
|
||
WriteLn ('Time: ', Elapsed, ' ms');
|
||
WriteLn ('Equivalent to ', (28.0*(Last-First+1)/1e6)/
|
||
(Elapsed*1e-3):0:4, ' MFLOPS');
|
||
WriteLn;
|
||
WriteLn ('Last vector:');
|
||
WriteLn;
|
||
WriteLn (X^[Last]);
|
||
WriteLn (Y^[Last]);
|
||
WriteLn (Z^[Last]);
|
||
WriteLn (W^[Last]);
|
||
END
|