520 lines
30 KiB
Plaintext
520 lines
30 KiB
Plaintext
From: S_JUFFA@IRAV1.ira.uka.de (|S| Norbert Juffa)
|
||
Newsgroups: comp.sys.ibm.pc.hardware,comp.sys.intel
|
||
Subject: Performance comparison Intel 386DX, Cyrix 486DLC, C&T 38600DX, Intel R
|
||
Message-ID: <1j69c9INN8qg@iraul1.ira.uka.de>
|
||
Date: 15 Jan 1993 12:06:33 GMT
|
||
Organization: University of Karlsruhe, FRG
|
||
Lines: 511
|
||
|
||
Performance Comparison Intel 386DX, Intel RapidCAD, C&T 38600DX, Cyrix 486DLC
|
||
|
||
This document, containing descriptions and a benchmark comparison of the
|
||
Intel 386DX, Intel RapidCAD, C&T 38600DX, and the Cyrix 486DLC has been
|
||
put together for the benefit of the net.community. I believe, but cannot
|
||
guarantee, that all data given herein is correct. If you would like to
|
||
report errors in this document, or have suggestions for its enhancement,
|
||
feel free to contact me at the following email address:
|
||
|
||
S_JUFFA@IRAVCL.IRA.UKA.DE
|
||
|
||
You can also write to me at my snail mail address:
|
||
|
||
Norbert Juffa
|
||
Wielandstr. 14
|
||
7500 Karlsruhe 1
|
||
Germany
|
||
|
||
Corrections pertaining to spelling or grammatical errors are also encouraged,
|
||
especially from those net.users that are native speakers of (American) English.
|
||
|
||
|
||
This is the second version of this document. Thanks to the people that helped
|
||
improve it by their comments on the previous version:
|
||
|
||
Alex Martelli (martelli@cadlab.sublink.org),
|
||
Fred Dunlap (cyrix!fred@texsun.Central.Sun.COM)
|
||
|
||
A special thanks for editing this article goes to:
|
||
|
||
David Ruggiero (osiris@halcyon.halcyon.com)
|
||
|
||
|
||
---------------------------------------------------------------------------
|
||
|
||
In 1992, three new CPUs were introduced into the PC marketplace. Each has the
|
||
capability to replace the standard Intel 386DX CPU and provide higher
|
||
performance for existing 386 systems. These new chips are the Intel RapidCAD,
|
||
the Chips&Technologies (C&T) Super386 38600DX, and the Cyrix 486DLC. At the
|
||
moment, all of these are only available in 33 MHz versions, but C&T and Cyrix
|
||
have announced 40 MHz versions for delivery in early 1993. (Of course you could
|
||
try to push existing 33 MHz chips to 40 MHz, but I do not recommend this, as my
|
||
experience shows that operation at the higher frequency tends to be unreliable,
|
||
with the possibility of the machine locking up unexpectedly.) While the Intel
|
||
RapidCAD is marketed only as an end-user product (which means PC manufacturers
|
||
will not ship systems with the RapidCAD already installed), both C&T and Cyrix
|
||
sell their products directly to the end user as well as to PC motherboard
|
||
manufacturers. Cyrix also manufactures the 486SLC, which is a replacement for
|
||
the Intel/AMD 386SX CPUs. C&T has announced a 38600SX chip, but it will not
|
||
ship before sometime in 1993, if at all.
|
||
|
||
The Intel RapidCAD
|
||
------------------
|
||
|
||
The Intel RapidCAD is, in rough terms, an Intel 486DX with the 8 kB on-chip
|
||
cache removed and a 386 pinout. To software, it appears to be a 386DX with a
|
||
math coprocessor, as the 486-specific instructions have been removed from its
|
||
instruction set. It is marketed by Intel as "the ultimate coprocessor" and, as
|
||
the name implies, its main purpose is to replace the 386DX in existing systems
|
||
and thereby boost the performance of floating-point intensive applications such
|
||
as CAD software, spreadsheets, and math packages (e.g. SPSS, Mathematica).
|
||
|
||
RapidCAD is delivered as a set of two chips. RapidCAD-1 is a 132-pin PGA (pin
|
||
grid array) chip that goes into the CPU socket and replaces the 386DX. It
|
||
contains the CPU and the FPU (floating-point unit). RapidCAD-2 is a 68-pin PGA
|
||
chip that goes into the 387 coprocessor (or EMC) socket; it contains a simple
|
||
PLA whose only purpose it to generate the FERR signal for the motherboard
|
||
logic: this provides full PC-compatible handling of floating point exceptions.
|
||
Many CPU instructions execute in one clock cycle in the RapidCAD, just as on
|
||
the Intel 486. The RapidCAD is constrained, however, by use of the standard
|
||
386DX bus interface, since every bus cycle takes two CPU clock cycles. This
|
||
means that instructions may be executed by the RapidCAD faster than they can be
|
||
fetched from memory. But as most FPU instructions take longer to execute than
|
||
CPU instructions, they are not greatly slowed down by this slow bus interface,
|
||
and most of them execute in the same time as in the Intel 486DX. Therefore, the
|
||
Intel RapidCAD provides higher overall floating-point performance than any
|
||
existing 386/387 combination. Results from the SPEC benchmarks, a common
|
||
workstation UNIX benchmark suite, show that the RapidCAD provides 85% more
|
||
floating point and 15% more integer performance than an Intel 386DX/387DX
|
||
combination at the same clock frequency.
|
||
|
||
Power consumption of the RapidCAD is typically 3500 mW at 33 MHz. The current
|
||
street price of the 33 MHz version is ~300 US$.
|
||
|
||
The C&T 38600DX
|
||
---------------
|
||
|
||
The Chips&Technologies 38600DX is designed to be 100% compatible to the Intel
|
||
386DX CPU. Unlike AMD's Am386 CPUs (which use microcode that is identical to
|
||
the Intel 386DX's microcode), the C&T 38600DX uses new microcode developed
|
||
within C&T using "clean-room" techniques. C&T even included the 386DX's
|
||
"undocumented" LOADALL386 instruction in the instruction set to provide full
|
||
compatibility with the Intel chip.
|
||
|
||
Some instructions execute faster on the 38600DX than on the 386DX. C&T also
|
||
produces a 38605DX CPU that includes a 512-byte instruction cache and provides
|
||
further performance increases over the Intel part. The 38605DX needs a bigger
|
||
socket (144-pin PGA) and is therefore *not* pin-compatible with the 386DX.
|
||
|
||
In my tests, I found that the 38600DX has severe problems with the CPU-
|
||
coprocessor communication, which causes its floating-point performance to drop
|
||
below the level provided by the Intel 386DX/Intel 387DX for most programs. This
|
||
problem exists with all available 387 compatible coprocessors (ULSI 83C87, IIT
|
||
3C87, Cyrix EMC87, Cyrix 83D87, Cyrix 387+, C&T 38700, and the Intel 387DX). (A
|
||
net.acquaintance also did tests with the 38600DX and arrived at similar
|
||
results. He contacted C&T and they said they were aware of the problem.)
|
||
|
||
Typical power consumption of the 40 MHz 38600DX is 1650 mW, which is less than
|
||
the typical power consumption of the Intel 386DX at 33 MHz (2000 mW). The
|
||
38600DX currently sells for ~US$ 80 for the 33 MHz version.
|
||
|
||
The Cyrix 486DLC
|
||
----------------
|
||
|
||
The Cyrix 486DLC is the latest entry into the market of 386DX replacements. It
|
||
features a 486SX compatible instruction set, a 1 kB on-chip cache, and a 16x16-
|
||
bit hardware multiplier. The RISC-like execution unit of the 486DLC executes
|
||
many instructions in a single clock cycle. The hardware multiplier multiplies
|
||
16-bit quantities in 3 clock cycles, as compared to 12-25 cycles on a standard
|
||
Intel 386DX. This is especially useful in address calculations (code from non-
|
||
optimizing compilers may contain many MUL instructions for array accesses) and
|
||
for software floating-point arithmetic (e.g., a coprocessor emulator). The 1 kB
|
||
cache helps the 486DLC to overcome the limitations of the 386 bus interface,
|
||
although its hit rate averages only about 65% under normal program conditions.
|
||
|
||
In existing 386 systems, DMA transfers (such as those performed by a SCSI
|
||
controller or a sound board) may force the internal cache of the 486DLC to be
|
||
flushed as the only means available in a 386 system to enforce consistency
|
||
between the contents of the on-chip cache and external memory. This stems from
|
||
the fact that the 386 bus interface was designed without provisions for an
|
||
on-chip cache. This problem can reduce the performance of the 486DLC in systems
|
||
that do a sizeable amount of (bus master) DMA. Cyrix has, however, defined some
|
||
additional cache control signals for some of the 486DLC lines; they can be used
|
||
to improve communications between the on-chip cache and a larger external cache
|
||
or main memory and prevent this problem. Although current 386 systems ignore
|
||
these signals (since they are not defined for the Intel 386DX), future systems
|
||
designed with the Cyrix chip in mind may take advantage of them and thereby
|
||
gain increased performance.
|
||
|
||
The Cyrix cache is a unified data/instruction write-through type and can be
|
||
configured as either a direct-mapped or 2-way set associative cache. It permits
|
||
definition of up to four non-cacheable regions, which are particularly useful
|
||
if a system has memory mapped peripherals (e.g., the Weitek math coprocessor).
|
||
For compatibility reasons, the cache is disabled after a processor reset
|
||
and must be enabled with the help of a small program provided by Cyrix. This
|
||
prevents problems with BIOSes that are not "486DLC aware". (I am certain that
|
||
future versions of the AMI BIOS and other BIOSes will take the 486DLC into
|
||
consideration and directly support the Cyrix chip's cache.)
|
||
|
||
The 486DLC will not work correctly with all math coprocessors in all
|
||
circumstances, with protected mode multitasked environments (e.g. MS-Windows
|
||
386-enhanced mode) being especially critical. Using the 486DLC with the Cyrix
|
||
EMC87, Cyrix 83D87 (chips produced prior to November 1991), and IIT 3C87, I
|
||
have been able to completely lock up the machine due to synchronization
|
||
problems between the CPU and the coprocessor while executing the FSAVE or
|
||
FRSTOR instructions (which are used to save and restore the coprocessor status
|
||
during task switches). According to Cyrix, this problem only occurs with the
|
||
first revision of the 486DLC and is fixed on newer ones. To be on the safe
|
||
side, the 486DLC should best be used with the Cyrix 387+ (its "Europe-only"
|
||
name) or with the identical Cyrix 83D87 (US-bound chips manufactured after
|
||
October 1991): these are not only are the highest performing 387 coprocessors
|
||
on the market, but they also work properly even with the first generation
|
||
486DLC.
|
||
|
||
If you already have a Cyrix 83D87 coprocessor and want to know whether it is
|
||
the old or new type, I recommend you use my COMPTEST program, available as
|
||
CTEST257.ZIP via anonymous ftp from garbo@uwasa.fi and other fine ftp servers.
|
||
If COMPTEST reports a 387+, you either have the 387+ or the identical newer
|
||
version of the 83D87 installed and can use any version of the Cyrix 486DLC
|
||
without problems. If you believe that you may have problems with a 486DLC/387
|
||
combination, I suggest you contact Cyrix technical support (1-800-FAS-MATH in
|
||
the US).
|
||
|
||
Power consumption of a 40 MHz 486DLC is typically 2800 mW. The 486DLC sells for
|
||
~115 US$ for the 33 MHz version, according to its German distributor.
|
||
|
||
Tests and benchmarks
|
||
--------------------
|
||
|
||
HW configuration: 33.3/40 MHz motherboard with Forex chip set and AMI BIOS. 128
|
||
kB zero- wait-state, direct mapped, write-through CPU cache with one write
|
||
buffer, 4 bytes per cache line, and 4 clock cycles penalty for a cache line
|
||
miss. 8 MB of main memory with an average access time of 1.6 wait states. Cyrix
|
||
EMC87 in 387 compatibility mode as math coprocessor. (This and the Cyrix 83D87
|
||
/ 387+ are the fastest coprocessors available for use with the
|
||
386DX/486DLC/38600DX). Conner 3204F hard disk, 203 MB capacity, IDE interface
|
||
(CORETEST throughput 1100 kB/s, seek time 16 ms). Diamond SpeedSTAR HiColor,
|
||
ISA bus SVGA card using Tseng's ET4000 chip, 1 MB DRAM as frame buffer, *no*
|
||
accelerator. The switches on the card were set for fastest reliable operation,
|
||
with a throughput of 6500 bytes/ms at 40 MHz and 5400 bytes/ms at 33.3 MHz.
|
||
|
||
SW configuration: MS-DOS 5.0, MS Windows 3.1, HyperDisk 4.32 disk cache program
|
||
in write-back mode, using 2 MB of extended memory, 386MAX 6.01 used as memory
|
||
manager and DPMI provider in some benchmarks. Latest Tseng (Colorview) driver
|
||
for Windows 3.1 at 1024x768x256, using the 8514 fonts.
|
||
|
||
|
||
For the Whetstone, Dhrystone, WINTACH, DODUC, LINPACK, LLL, and Savage
|
||
benchmarks, *higher* numbers indicate *faster* performance.
|
||
|
||
For the MAKE RTL, MAKE TRANCK, and String-Test benchmarks, *lower* numbers
|
||
indicate *faster* performance.
|
||
|
||
|
||
Intel C&T Intel Cyrix Cyrix
|
||
33.3 MHz 386DX 38600DX RapidCAD 486DLC 486DLC
|
||
cache off cache on
|
||
integer
|
||
|
||
Whetstone [kWhets/s] 447 585 563 695 803
|
||
Dhrystone (C) [Dhry./s] 11688 11819 12357 14150 15488
|
||
Dhrystone (Pas) [Dhry./s]10455 10877 10751 12154 13858
|
||
String-Test [ms] 459 453 441 347 327
|
||
MAKE RTL [s] 51.32 47.10 46.34 43.45 39.13
|
||
MAKE TRANCK [s] 62.42 55.47 55.37 53.64 46.12
|
||
WINTACH [overall RPM] 4.85 4.90 5.49 5.53 6.14
|
||
|
||
float
|
||
|
||
DODUC [Rapidity index] 79.0 76.4 150.3 89.4 90.7
|
||
LINPACK [MFLOPS] 0.2808 0.2707 0.4578 0.3158 0.3438
|
||
LLL [MFLOPS] 0.3352 0.3537 0.6083 0.3816 0.4139
|
||
Whetstone [kWhets/s] 2540 2340 3990 2908 3061
|
||
Savage [function eval/s] 71685 53191 72464 88757 93897
|
||
|
||
|
||
Intel C&T Intel Cyrix Cyrix
|
||
40.0 MHz 386DX 38600DX RapidCAD 486DLC 486DLC
|
||
cache off cache on
|
||
integer
|
||
|
||
Whetstone [kWhets/s] 536 702 676 835 963
|
||
Dhrystone (C) [Dhry./s] 14128 14116 14836 16987 18750
|
||
Dhrystone (Pas) [Dhry./s]12490 13067 12890 14573 16624
|
||
String-Test [ms] 384 377 368 289 273
|
||
MAKE RTL [s] 43.46 40.11 39.84 37.25 33.54
|
||
MAKE TRANCK [s] 53.00 47.59 47.07 45.36 39.00
|
||
WINTACH [overall RPM] 5.65 5.73 6.41 6.46 7.23
|
||
|
||
float
|
||
|
||
DODUC [Rapidity index] 94.9 77.5 180.3 105.1 106.6
|
||
LINPACK [MFLOPS] 0.3324 0.3260 0.5418 0.3789 0.4131
|
||
LLL [MFLOPS] 0.4025 0.4204 0.7263 0.4562 0.4956
|
||
Whetstone [kWhets/s] 3061 2632 4798 3505 3677
|
||
Savage [function eval/s] 86083 49587 86957 106762 112360
|
||
|
||
|
||
To complete the picture, I ran the CPU/FPU standard benchmarks on an Intel
|
||
486DX running at 33.3/40 MHz. Since the 486 machine was configured with a
|
||
different hard disk than the 386 system, and the compilers and tools installed
|
||
on the 386 machine were not present, the MAKE benchmarks could unfortunately
|
||
not be included in the tests.
|
||
|
||
486DX, 256 kBytes CPU cache (write-thru), 8 MB of RAM, AMI-BIOS, Diamond
|
||
SpeedSTAR HiColor (VIDSPEED thoughput: 6500 bytes/ms at 40 MHz and 5400
|
||
bytes/ms at 33.3 MHz), MS-DOS 5.0, MS-Windows 3.1:
|
||
|
||
integer 33.3 MHz 40 Mhz
|
||
|
||
Whetstone [kWhets/s] 707 848
|
||
Dhrystone (C) [Dhry./s] 19394 23265
|
||
Dhrystone (Pas) [Dhry./s] 16978 20368
|
||
String-Test [ms] 333 279
|
||
WINTACH 8.59 10.14
|
||
|
||
float
|
||
|
||
DODUC [Rapidity index] 184.0 220.7
|
||
LINPACK [MFLOPS] 0.6682 0.8204
|
||
LLL [MFLOPS] 0.9387 1.1110
|
||
Whetstone [kWhets/s] 5143 6195
|
||
Savage [function eval/s] 82192 98522
|
||
|
||
|
||
Conclusions
|
||
-----------
|
||
|
||
The Cyrix 486DLC is the 386DX replacement with the highest integer performance.
|
||
With the internal cache enabled, integer performance of the 486DLC can be up to
|
||
80% higher compared with a Intel 386DX at the same clock frequency, with the
|
||
average speed gain for integer applications being 35%. Enabling the internal
|
||
cache provides about 5-15% more performance than with the cache disabled for
|
||
both integer and floating-point applications. Floating-point applications are
|
||
accelerated by about 15%-30% if the Cyrix 486DLC (with cache enabled) is used
|
||
instead of the Intel 386DX. Compared with the Intel 486DX, the Cyrix 486DLC
|
||
provides about 70% of the integer performance and about 50% of the floating
|
||
point performance at the same clock frequency.
|
||
|
||
The Intel RapidCAD is the 386DX replacement that provides the highest floating
|
||
point performance. It can speed up most floating-point intensive programs by
|
||
60%-90% compared with the fastest Intel 386DX/math coprocessor combination; it
|
||
provides nearly 75% of the floating-point performance of a Intel 486DX at the
|
||
same clock frequency. Integer performance increases by an average 15% by using
|
||
the RapidCAD instead of the standard Intel 386DX, with the maximum performance
|
||
gain being 35%.
|
||
|
||
The Chips&Technologies 38600DX has a slightly higher integer performance than
|
||
the Intel 386DX, with the speedup ranging from 0%-30% and an average speedup on
|
||
the order of 10%.
|
||
|
||
|
||
Description of benchmarks
|
||
-------------------------
|
||
DHRYSTONE [9] is a synthetic benchmark developed by R. Weicker from Siemens in
|
||
1984. The frequency of operations and data types used by Dhrystone are modeled
|
||
after statistics collected for 'typical' programs that are written in a HLL
|
||
(high level language) such as C or Pascal that do not use floating-point
|
||
arithmetic. Thus there is a certain distribution of global and local variables
|
||
being used in procedures, there is a certain percentage of use of records
|
||
(structs) or strings out of the total number of variables accessed and there is
|
||
a certain percentage of procedure calls and if-statements out of the total
|
||
lines of codes executed. All these percentages match the statistics used in the
|
||
development of Dhrystone quite closely. To preserve the typical distributions,
|
||
the measurement rules for Dhrystone forbid function inlining (a frequently used
|
||
optimization technique optionally performed by most optimizing compilers). The
|
||
current version of Dhrystone is 2.1, and this is the version used for my tests.
|
||
Version 2.1 [10] differs from previous versions in that it contains additional
|
||
code that prevents optimizing compilers from throwing out most of the code by
|
||
dead code elimination (since Dhrystone has no input file, many expressions can
|
||
be computed at compile time), thus artificially inflating performance numbers.
|
||
The Dhrystone benchmark exists in equivalent Ada, Pascal and C versions, which
|
||
are all available from netlib@ornl.gov. Dhrystone measures the time to execute
|
||
its main loop and sets this time into relation with a fixed reference time, the
|
||
result being a performance index given in "Dhrystones per second". I used two
|
||
Dhrystone executables. One was compiled using the non-optimizing Turbo Pascal
|
||
6.0 compiler, the other one was compiled with the MS-C 7.0 compiler using the
|
||
large memory model and maximum optimization but without the use of function
|
||
inlining. Because the Turbo Pascal executable uses more memory operands and the
|
||
MS-C executable uses more register-to-register operations, the speedup for the
|
||
executables on different CPUs can be noticeably different.
|
||
|
||
WINTACH is a public domain benchmark program by Texas Instruments that measures
|
||
the speed of graphics output for four typical MS-Windows applications: a word
|
||
processor, a spreadsheet, a CAD program, and a paint program. TI has collected
|
||
profiles for each class of applications under 'typical' user loads. The four
|
||
parts of the WINTACH benchmark were then modeled after these profiles, so that
|
||
the percentage of certain GDI (graphics device interface) calls found in the
|
||
profiles is also present in the benchmark. WINTACH determines the performance
|
||
of each program part relative to the performance of a standard VGA card in a 20
|
||
MHz 386DX machine. It also combines the four values into an overall relative
|
||
performance index, which is reported in my tables. Using this single number to
|
||
represent graphics performance is justified in this case since the four partial
|
||
results do *not* deviate much from the average for the SVGA tested (a Diamond
|
||
SpeedSTAR HiColor). I used a frame buffer card for the test because for these
|
||
graphics card type, graphics performance depends solely on the performance of
|
||
the CPU, which handles all accesses to the display memory. On an accelerated
|
||
card, some of the low-level operations are delegated to the accelerator chip
|
||
and the influence of the CPU speed on the graphics performance is not seen so
|
||
clearly. For this test, I used the latest Tseng Windows 3.1 driver (Colorview)
|
||
at a resolution of 1024x768x256 with the large 8514 fonts (WINTACH code C8).
|
||
|
||
MAKE RTL is not a single program. Rather it is a build of a complete project,
|
||
my own run-time library for Turbo Pascal 6.0. The project consists of about 200
|
||
source files with a total size of ~650 kByte. Most of the source files are
|
||
assembly language (.ASM) files, while there are also some Pascal (.PAS) files.
|
||
Building the complete project using MAKE, about 200 binary files (.OBJ and
|
||
.TPU) are produced with a combined size of ~300 kB. The programs used during
|
||
the build are MAKE 3.5, TPC 6.01, and TASM 2.01, all by Borland, Inc. All files
|
||
are read from and written to the hard disk, using a 2 MB write back disk cache
|
||
provided by the HyperDisk 4.32 program. This eliminates most of the I/O
|
||
overhead. No memory manager was installed during the test. The time reported is
|
||
the time from starting the MAKE utility to the reappearance of the DOS prompt.
|
||
MAKE RTL can be thought of being typical of applications that operate on a lot
|
||
of small files, and use only integer instructions.
|
||
|
||
MAKE TRANCK is a project build for a project consisting of two assembly
|
||
language modules, two C modules and one FORTRAN module. The combined size of
|
||
the source files is approximately 120 kBytes. All modules are compiled
|
||
(assembled) and combined into a single program linking with both, the C and the
|
||
FORTRAN libraries. Programs used are MAKE 3.5 by Borland and MASM 6.0, MS-C
|
||
7.0, MS-FORTRAN 5.0, and LINK 5.3, all by Microsoft, Inc. The C compiler and
|
||
the linker run in protected mode and require a DPMI host to be present. 386MAX
|
||
6.01 by Qualitas was installed for this purpose. To minimize I/O overhead, the
|
||
HyperDisk 4.32 disk caching program was installed, with 2 MByte of extended
|
||
memory allocated to the cache. The modules are compiled with compiler switches
|
||
set for maximum optimization, and most of the time for the project build is
|
||
spent in the optimizing stages of the compilers.
|
||
|
||
STRING-TEST is a simple benchmark that tests the string handling functions
|
||
built into the Turbo Pascal language. Nearly all of the execution time is spent
|
||
in repeated execution of the STOS, CMPS, SCAS, and MOVS instructions. The data
|
||
and code fit into about five kBytes of memory. In cached systems, almost all
|
||
memory accesses will be cache hits. The program was written and compiled with
|
||
Turbo Pascal 6.0 and linked with my own run-time library, which provides much
|
||
faster string functions than the run-time library delivered by Borland.
|
||
Compiler switches were set for fastest execution. The time reported is the time
|
||
to complete the whole benchmark in milliseconds, with an accuracy of +/-1
|
||
millisecond.
|
||
|
||
LLL is short for Lawrence Livermore Loops [8], a set of kernels taken from real
|
||
floating point extensive programs. Some of these loops are vectorizable, but
|
||
since we don't deal with vector processors here, this doesn't matter. For this
|
||
test, LLL was adapted from the FORTRAN original in [7] to Turbo Pascal. By
|
||
variable overlaying (similar to FORTRAN's EQUIVALENCE statement) memory
|
||
allocation for data was reduced to 64 kB, so all data fits into a single 64 kB
|
||
segment. The older version of LLL is used here which contains 14 loops. There
|
||
also exists a newer, more elaborate version consisting of 24 kernels. The
|
||
kernels in LLL exercise only multiplication and addition. The MFLOPS rate
|
||
reported is the average of the MFLOPS rate of all 14 kernels as reported by the
|
||
LLL program. LLL and Whetstone results (see below) are reported as returned by
|
||
my COMPTEST test program in which they have been included as a measure of
|
||
coprocessor/FPU performance. COMPTEST has been compiled under Turbo Pascal 6.0
|
||
with all 'optimizations' on and using my own run-time library, which gives
|
||
higher performance than the one included with TP 6.0. My library is available
|
||
as TPL60N17.ZIP from garbo.uwasa.fi and ftp-sites that mirror this site. All
|
||
floating point variables in the program where declared as DOUBLE.
|
||
|
||
LINPACK [4] is a well known floating-point benchmark that also heavily
|
||
exercises the memory system. Linpack operates on large matrices and takes up
|
||
about 570 kB in the version used for this test. This is about the largest
|
||
program size a pure DOS system can accommodate. Linpack was originally designed
|
||
to estimate performance of BLAS, a library of FORTRAN subroutines that handles
|
||
various vector and matrix operations. Note that vendors are free to supply
|
||
optimized (e.g. assembly language versions) of BLAS. Linpack uses two routines
|
||
from BLAS which are thought to be typical of the matrix operations used by
|
||
BLAS. Both routines only use addition/subtraction and multiplication. The
|
||
FORTRAN source code for Linpack can be obtained from the automated mail server
|
||
netlib@ornl.gov. Linpack was compiled using MS FORTRAN 5.0 in the HUGE memory
|
||
model (which can handle data structures larger than 64 kB) and with compiler
|
||
switches set for maximum optimization. Linpack performs the same test
|
||
repeatedly. The number reported is the maximum MFLOPS rate returned by Linpack.
|
||
All floating point variables in the program were declared as DOUBLE. Linpack
|
||
MFLOPS ratings for a great number of machines are contained in [5]. This
|
||
PostScript document is also available from netlib@ornl.gov.
|
||
|
||
WHETSTONE [1,2,3] is a synthetic benchmark based on statistics collected about
|
||
the use of certain control and data structures in programs written in high
|
||
level languages. Based on these statistics, Whetstone tries to mirror a
|
||
'typical' HLL program. Whetstone performance is expressed by how many
|
||
theoretical 'whetstone' instructions are executed per second. It was originally
|
||
implemented in ALGOL. Unlike LLL and Linpack, Whetstone not only uses addition
|
||
and multiplication but exercises all basic arithmetic operations as well as
|
||
some transcendental functions. Whetstone performance depends on the speed of
|
||
the coprocessor as well as on the speed of the CPU, while LLL and Linpack place
|
||
a heavier burden on the coprocessor/FPU. There exist an old and a new version
|
||
of Whetstone. Note that results from the two versions can differ by as much as
|
||
20% for the same test configuration. For this test, the new version in Pascal
|
||
from [2] was used. It was compiled with Turbo Pascal 6.0 and my own library
|
||
(see above) with all 'optimizations' on. For the integer test, software fp
|
||
arithmetic using the REAL type was utilized. Using the software arithmetic
|
||
exercises only the CPU, in particular the execution of MOV, SHL, SHR, RCR,
|
||
ADD, SUB, MUL, and DIV instructions. For the floating point test, the hardware
|
||
floating-point arithmetic of the coprocessor was used and computations were
|
||
performed using the DOUBLE type.
|
||
|
||
SAVAGE tests the performance of transcendental function evaluation. It is
|
||
basically a small loop in which the sin, cos, arctan, ln, exp, and sqrt
|
||
functions are combined in a single expression. While sin, cos, arctan, and sqrt
|
||
can be evaluated directly with a single 387 coprocessor instruction each, ln
|
||
and exp need additional preprocessing for argument reduction and result
|
||
conversion. According to [11], the Savage benchmark was devised by Bill Savage,
|
||
and is distributed by: The Wohl Engine Company, Ltd., 8200 Shore Front Parkway,
|
||
Rockaway Beach, NY 11693, USA. Usually, Savage is programmed to make 250,000
|
||
passes though the loop. Here only 10,000 loops are executed for a total of
|
||
60,000 transcendental function evaluations. The result is expressed in function
|
||
evaluations per second. SAVAGE source code was taken from [6] and compiled with
|
||
Turbo Pascal 6.0 and my own run-time library (see above).
|
||
|
||
DODUC [12] is a modified application program by Nhuan Doduc that is also part
|
||
of the SPEC benchmark suite. It is a nuclear safety analysis code that
|
||
simulates the time evolution of a thermohydraulic modelization for a nuclear
|
||
reactor's component and the benchmark is created from the computational kernel
|
||
of the original application. The benchmark consist of 5323 FORTRAN statement
|
||
lines and the executable size is about 350 kBytes. Almost all processing is
|
||
done using double precision floating-point values. There is not much array
|
||
processing. Instead the program has an iterative structure with an abundance of
|
||
short branches and small loops. The version used for this test was compiled
|
||
with the highly optimizing NDP FORTRAN V3.0 compiler from Microway, which
|
||
generates a 32-bit mode executable that runs in protected mode using a
|
||
protected mode loader provided by Microway. The execution time of the DODUC
|
||
program is set into relation with fixed reference times and the result is
|
||
scaled to give a "Rapidity index".
|
||
|
||
|
||
References
|
||
----------
|
||
|
||
[1] Curnow, H.J.; Wichmann, B.A.: A synthetic benchmark.
|
||
Computer Journal, Vol. 19, No. 1, 1976, pp. 43-49
|
||
[2] Wichmann, B.A.: Validation code for the Whetstone benchmark.
|
||
NPL Report DITC 107/88, National Physics Laboratory, UK, March 1988
|
||
[3] Curnow, H.J.: Wither Whetstone? The Synthetic Benchmark after 15 Years.
|
||
In: Aad van der Steen (ed.): Evaluating Supercomputers.
|
||
London: Chapman and Hall 1990
|
||
[4] Dongarra, J.J.: The Linpack Benchmark: An Explanation.
|
||
In: Aad van der Steen (ed.): Evaluating Supercomputers.
|
||
London: Chapman and Hall 1990
|
||
[5] Dongarra, J.J.: Performance of Various Computers Using Standard Linear
|
||
Equations Software.
|
||
Report CS-89-85, Computer Science Department, University of Tennessee,
|
||
March 11, 1992
|
||
[6] Huth, N.: Dichtung und Wahrheit oder Datenblatt und Test.
|
||
Design & Elektronik 1990, Heft 13, Seiten 105-110
|
||
[7] Esser, R.; Kremer, F.; Schmidt, W.G.: Testrechnungen auf der IBM 3090E mit
|
||
Vektoreinrichtung.
|
||
Arbeitsbericht RRZK-8803, Regionales Rechenzentrum an der Universit"at zu
|
||
K"oln, Februar 1988
|
||
[8] McMahon, H.H.: The Livermore Fortran Kernels: A test of the numerical
|
||
performance range.
|
||
Technical Report UCRL-53745, Lawrence Livermore National
|
||
Laboratory, USA, December 1986
|
||
[9] Weicker, R.P.: Dhrystone: A Synthetic Systems Programming Benchmark.
|
||
Communications of the ACM, Vol. 27, No. 10, October 1984, pp. 1013-1030
|
||
[10] Weicker, R.P.: Dhrystone Benchmark: Rationale for Version 2 and
|
||
Measurement Rules.
|
||
SIGPLAN Notices. Vol. 23, No. 8, August 1988, pp. 49-62
|
||
[11] FasMath 83D87 Benchmark Report. Cyrix Corporation, June 1990
|
||
Order No. B2004
|
||
[12] Doduc, Nhuan: Fortran Execution Time Benchmark.
|
||
Unpublished manuscript, version 45, available from ndoduc@framentec.fr
|
||
|