1101 lines
45 KiB
Plaintext
1101 lines
45 KiB
Plaintext
---------------------------------
|
||
|
||
OPTIMIZE.TXT
|
||
|
||
---------------------------------
|
||
|
||
Some useful informations about how to optimize code on a 386/486/Pentium
|
||
|
||
This initially was a posting on the Internet made by
|
||
|
||
Michael Kunstelj.
|
||
Contact at : virteam@smoke.canberra.edu.au
|
||
or
|
||
mick@cairo.anu.edu.au
|
||
|
||
Initially it included 486 and Pentium optimizations with some "holes"
|
||
(like the profiling instructions), i filled the Pentium "holes"
|
||
then refined some exaplanations (uh! maybe i made 'em worse! :} ).
|
||
Then added some not-so-obviuos cache optimization techniques
|
||
and other infos about specific optimizations you can make on 386.
|
||
|
||
Lorenzo Micheletto
|
||
N.B.
|
||
I added some generic examples about how a cpu works
|
||
BUT they are just examples, intel cpus work in slightly different ways.
|
||
|
||
N.B./2
|
||
Always check in the books for things you are not sure
|
||
there may be typos here.
|
||
|
||
----------------------------------
|
||
Brief intro about how a CPU works
|
||
----------------------------------
|
||
|
||
Now a brief intro on the innards Intel CPUs.
|
||
|
||
Most processors these days have within in them a system termed a "pipeline".
|
||
The 486 / 586 are certainly within this category. However, most people
|
||
aren't quite sure what exactly a pipeline is, here follows a complete
|
||
explanation:
|
||
|
||
The execution of a machine code instruction can be roughtly split
|
||
in five stages: [n.b. this is a generic pipeline]
|
||
1) FETCH instruction opcode and immediate data
|
||
2) DECODE opcode and align immediata data into temporary registers
|
||
3) CALCULATE ADDRESS OF OPERANDS and load 'em into memory.
|
||
(calculate the address of memory locations to access
|
||
and select the register operands to use)
|
||
(sometimes called DECODE 2)
|
||
4) EXECUTE instruction (loading memory operands if needed)
|
||
& write results to INTERNAL registers
|
||
5) WRITEBACK memory-operand results to memory
|
||
|
||
Every stage takes ONE OR MORE cpu cycles to be completed, but usually
|
||
a modern cpu needs just one cpu cycle for every execution stage
|
||
(excluding stages 1 and 5 that have to deal with memory/cache access
|
||
and stage 4 when "complex" microcoded instructions are executed).
|
||
|
||
A cpu-core ( the "real" cpu, not cache support nor the other things
|
||
that are usually integrated into a modern cpu) has five "specialized"
|
||
units, one for every stage of instruction execution, plus a "register file"
|
||
(and ultra-fast on-chip memory where internal register values
|
||
are stored) and a CONTROL UNIT.
|
||
|
||
The control unit "coordinates" the action of all the other units
|
||
so they can work together.
|
||
|
||
Here follows an extended ascii drawing
|
||
of ONE on the many ways these units can be connected
|
||
(use a ms-dos text editor to see this, or convert it to the
|
||
Windows line-drawing character set if you are using a Windows editor):
|
||
|
||
MEMORY INPUT ("memory load" unit)
|
||
<20> <20><><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>Ŀ
|
||
<20><><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>><3E><><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>Ŀ<<3C><><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>><3E> <20>
|
||
<20>ſ <20> FETCH <20> <20> <20>
|
||
<20><><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>ٳ<EFBFBD>><3E><><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>Ŀ <20> <20>
|
||
<20> (instruction pointer)<29> <20> <20> <20>
|
||
<20> <20> <20><><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>Ŀ<<3C><> <20> <20>
|
||
<20> <20> <20> DECODE <20><<3C><><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>><3E> <20>
|
||
<20> <20> <20><><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>Ŀ <20> <20>
|
||
<20> <20> <20> <20> <20>
|
||
<20> <20> <20><><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>Ŀ<<3C><> <20> CONTROL <20>
|
||
<20> <20><> <20> ADDRESS C. <20><<3C><><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>><3E> <20>
|
||
<20> <20><><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>><3E><><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>Ŀ <20> UNIT <20>
|
||
<20> <20> <20><> <20> <20> <20>
|
||
<20><><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>Ŀ <20><>><3E><><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>Ŀ<<3C><> <20> <20>
|
||
<20> REGISTER FILE <20><><EFBFBD><EFBFBD><EFBFBD><EFBFBD>><3E> EXECUTE <20><<3C><><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>><3E> <20>
|
||
<20><><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><<3C><><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>Ŀ <20> <20>
|
||
<20> <20> <20>
|
||
<20><><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>Ŀ<<3C><> <20> <20>
|
||
<20> WRITEBACK <20><<3C><><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>><3E> <20>
|
||
<20><><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>Ŀ <20> <20>
|
||
<20> <20><><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>
|
||
MEMORY OUTPUT <<3C><><EFBFBD><EFBFBD>
|
||
("memory store" unit)
|
||
|
||
If you look the drawing, the control unit needs to communicate
|
||
with all the other units to make 'em work together.
|
||
Usually the "control unit" is scattered in little units that coordinates
|
||
intruction and data flow between adiacent units, but you can think
|
||
at it as a single indipendent unit.
|
||
|
||
The five execution units are what gives the "raw" cpu power
|
||
but the control unit is what lets you fully utilize 'em.
|
||
|
||
Let's suppose every unit performs one operation in one step.
|
||
|
||
With a "cheap" control unit, to execute a complete machine language
|
||
instruction you first enable FETCH, then you enable DECODE
|
||
and so on until WRITEBACK is completed and the control unit
|
||
enables FETCH again to execute the next instruction.
|
||
|
||
This is what happens into an "absolutely not pipelined" cpu
|
||
with "one cycle" stages, the fastest instructions takes 5 cycles
|
||
but the control unit is very simple
|
||
(just a circular shift register that enables one unit at a time).
|
||
|
||
Of course this is the worst case, nobody is so jerk to build
|
||
a cpu like that.
|
||
|
||
Every cpu stage-execution unit is a stand alone thing, it has its hardware
|
||
and its "temporary registers" so it is capable to operate "alone".
|
||
|
||
So, IF THE CONTROL UNIT can SINCHRONIZE the operations
|
||
of all the stage-execution units, it can make 'em WORK IN PIPELINE
|
||
(like in a factory, where every worker/robot in a production line
|
||
a) receives a partially refined product from the previous worker in line
|
||
b) performs some simple operations to refine the product a little more
|
||
c) pass the result to the next worker in line
|
||
).
|
||
A cpu with such a control unit is called a pipelined CPU.
|
||
|
||
If you have to execute in sequence instructions A,B,C,D,E
|
||
on a pipelined processor ....
|
||
|
||
While the WRITEBACK unit is performing stage 5 of A
|
||
the EXECUTION unit can perform stage 4 of B
|
||
the ADDRESS CALCULATE unit can perform stage 3 of C
|
||
the DECODE unit can perform stage 2 of D
|
||
and the FETCH unit can perform stage 1 of E
|
||
|
||
So if you start the cpu at instruction A
|
||
it looks like that instruction A takes 5 cycles
|
||
while instructions B,C,D,E (immediately one stage behind) looks like they take
|
||
JUST ONE CYCLE!!!
|
||
|
||
So if you execute a SEQUENCE of 8 instructions
|
||
on a "not pipelined" processor with a five stage "pipe"
|
||
it takes 40 steps to execute the sequence
|
||
while on a pipelined processor this takes 5+7=12 steps!!!
|
||
( 5 steps for the first instruction to get thru the pipeline
|
||
then 7 steps for the other instructions immediately "one step" behind)
|
||
And the more instructions are executed in sequence, the more it looks like
|
||
every instruction takes "just" one cycle to execute.
|
||
|
||
|
||
This is of course the optimal situation, when the processor pipeline
|
||
is filled and WHEN EVERY INSTRUCTION DOES NOT USE MEMORY/REGISTER OPERANDS
|
||
"still under processing" IN SUCCESSIVE EXECUTION-STAGES!!!!!!
|
||
|
||
-------------------------------------------------------------------------------
|
||
THINGS A CPU MUST CHECK TO MAKE A PIPELINE WORK CORRECTLY
|
||
|
||
A pipelined processor control-unit must "check" :
|
||
A) at stage 1 (FETCH)
|
||
IF in stage 2..4 there are JUMP instructions
|
||
[ they change the program counter
|
||
(the same register used by the fetch unit)
|
||
]
|
||
THEN stage 1 must "wait" until the jump is completed
|
||
and then "restarts", loading the memory location pointed by the
|
||
new program counter value.
|
||
|
||
Because the jump opcode must pass thru the decode stage ...
|
||
... if the DECODE unit "detects" a jump opcode, it makes the FETCH unit
|
||
"stops" AT LEAST for two cycles until the jump is performed.
|
||
This of course means the processor pipeline "wastes" some cpu cycles.
|
||
|
||
A 486 handles jumps in a smarter way, when the FETCH unit loads
|
||
a jump instruction, it goes on assuming the jump WON'T BE TAKEN
|
||
(it read the instructions following the jump).
|
||
If the jump is taken, the values in the DECODE 1 and DECODE 2 units
|
||
are "discarded" and the FETCH unit starts loading from the new
|
||
program counter value.
|
||
This explains why on 486 relative jumps (the fastest jumps) takes
|
||
3 cycles if taken (two cycles "lost")
|
||
1 cycle if not taken (no cycles lost)
|
||
|
||
A "smarter" control unit can recognize in advance unconditional jumps
|
||
and start immediately loading opcodes from the new program counter
|
||
location
|
||
AND/OR try to "predict" the jump destination of conditional jumps
|
||
(like the Pentium does).
|
||
|
||
|
||
B) at step 3 (ADDRESS CALCULATION)
|
||
IF in stage 3 a register needed for address calculation is under processing
|
||
in stage 4 (EXECUTE)
|
||
THEN the stage 3 unit generates a "do nothing" control sequence
|
||
for one cpu cycle.
|
||
|
||
C) memory access conflicts
|
||
These can be caused by lots of different things, they are usually
|
||
resolved by the memory load/store units.
|
||
|
||
Told in a short way, in a fully pipelined processor
|
||
when the pipeline is "working on a sequence"
|
||
the first instruction in the sequence takes the full execution time
|
||
while the other instructions takes a shorter time (usually one cycle)
|
||
because they are immediately behind, this means they "look faster"
|
||
(and in fact they are, because the pipeline fully utilizes all the functional
|
||
units it has).
|
||
|
||
If some instructions "modify" values needed in the previous cpu stages
|
||
the control unit stops the unit needing those values (and the others behind it)
|
||
and inserts "do nothing" codes into the successive units in the processor pipe
|
||
to make things work as expected (this is called a PIPELINE STALL
|
||
or a PIPELINE FLUSH)
|
||
|
||
This explains why usually a taken jump takes more time than one that's not taken
|
||
(when a jump is performed you must flush the pipeline).
|
||
|
||
The newer processors includes JUMP PREDICTION UNITS that handles
|
||
jump faster, but the less you jump, the better.
|
||
|
||
------------------------------------------------------------------------------
|
||
PIPELINING IN INTEL PROCESSORS:
|
||
|
||
The 386 is a "partially pipelined" processor with some optimizations.
|
||
It checks for jumps ( A check) in a quite slow way
|
||
and most of times can execute every stage in one cycle.
|
||
It cannot perform the address generation check (B check)
|
||
so it must execute stage 3 and 4 in "not pipelined mode".
|
||
This explains why THE FASTEST 386 INSTRUCTIONS TAKE TWO CYCLES
|
||
( stage 3 unit must wait stage 4 to complete before managing
|
||
the next instructions).
|
||
|
||
The 486 has an improved EXECUTION unit (stage 4), improved memory access
|
||
AND its control unit can perform the address generation check
|
||
so that it can pipe stage 3 and 4 if there are no
|
||
register/memory conflicts with two successive instructions.
|
||
This explains why lots of 486 instructions take just one cycle
|
||
(if the pipeline is filled and no pipeline stalls are produced).
|
||
What's more, pipeline reloading is improved, so even jumps have less
|
||
impact on execution time.
|
||
It also has a 4Kbyte internal cache that boosts memory access
|
||
and allows big gains from strip-mining optimizations.
|
||
Another nice thing is the BURST READ access that accelerates memory reads
|
||
when it needs to update its internal cache.
|
||
|
||
PIPELINED,SUPERSCALAR & BRANCH PREDITING CPU: THE PENTIUM
|
||
|
||
Well, what's beyound a fully pipelined processor like the 486 ?
|
||
There is a 586/Pentium processor, that's SUPER SCALAR (more than one pipeline)
|
||
and BRANCH PREDICTING (it has a specialized unit that annotates the
|
||
position and the result of some og the most recent jumps, and so
|
||
it can try to "guess" from where the next instruction after the jump execution
|
||
will be loaded, if it guess correcly a pipeline stall is avoided, else
|
||
it stalls)(this helps speeding up the execution of nested loops).
|
||
|
||
The 586 can execute UP TO TWO integer operations in a single step
|
||
(because it has TWO pipelines, but with some limitations)
|
||
it can BURST READ/WRITE and has TWO indipendent 8k caches
|
||
(Harvard style CPU to cache architecture) this means you can
|
||
strip-mine to the max. if strips fit into the 8Kbyte data cache.
|
||
|
||
Well, now you know how they work, let's see how to make 'em work to the max.
|
||
|
||
------------------------------------------------------------------------------
|
||
SPEEDING UP 486/586 CODE.
|
||
------------------------------------------------------------------------------
|
||
|
||
-------------------------------
|
||
ADDRESS GENERATION STALLS (AGI)
|
||
-------------------------------
|
||
An Address Generation Interlock occurs when a register which is
|
||
currently being used as the base or an index was the destination component
|
||
of a previous instruction (the stage 3 to 4 pipeline stall i described above)
|
||
|
||
For example,:
|
||
|
||
add edx, 4
|
||
// AGI, ONE CLOCK STALL
|
||
mov esi, [edx]
|
||
|
||
For the 486, AGI's can only occur on adjacent instructions.
|
||
|
||
On the 586 or Pentium instructions up to 3 locations away can cause an AGI.
|
||
( i.e. an instruction waiting to execute step 4 on pipeline 1
|
||
may need to wait the result produced by an instruction
|
||
still executing on pipeline 2)
|
||
|
||
pipeline step pipe1 pipe2
|
||
addres_calculate/fetch_operand C D |
|
||
execute B A |
|
||
V
|
||
|
||
If C needs the results of A, it must stall pipeline 1 for one cycle
|
||
to wait for A to "exit from" pipeline 2.
|
||
If C needs the results of D, it must stall pipeline 1 for two cycles
|
||
to wait for D to "exit from" pipeline 2.
|
||
|
||
An example of 3 instruction away AGI:
|
||
|
||
add esi, 4
|
||
pop ebx
|
||
dec ebx
|
||
mov edx, [esi]
|
||
|
||
Takes 4 clocks on a 486. On a 586, the move command must wait for the
|
||
add command, thus AGI stalling for one clock. The code above then
|
||
would run in three clock cycles on a 586.
|
||
|
||
Remember that even instructions that read or write implicitly to registers
|
||
cause AGI stalls:
|
||
|
||
mov esp,ebp
|
||
pop ebp.
|
||
|
||
is executed as...
|
||
|
||
mov esp,ebp
|
||
[agi] ; pop uses the esp as a pointer
|
||
pop ebp
|
||
|
||
----------------------------
|
||
INSTRUCTION DECODING STALLS:
|
||
----------------------------
|
||
|
||
When decoding an instruction with an IMMEDIATE VALUE
|
||
AND
|
||
either an INDEX OFFSET or an IMMEDIATE DISPLACEMENT
|
||
486's have a one clock penalty. (586's do not)
|
||
|
||
Example:
|
||
|
||
mov spam, 248
|
||
mov dword ptr [esp+4], 1
|
||
|
||
This happens because the decode stage of the pipeline can read and decode
|
||
ONE "immediate" value at a time
|
||
while an immediate value and an immediate offset are TWO immediate values.
|
||
The 586 instead has not such problems (when decoding) (execution is different)
|
||
|
||
---------------------------------
|
||
REGISTER ACCESS CPU STALLS:
|
||
---------------------------------
|
||
|
||
486's have a 1 clock penalty when modifying the lower word of a DWORD register
|
||
that's used immediately after it, 586's do not.
|
||
Example:
|
||
mov al,0
|
||
mov [ebp], eax
|
||
(a one clock penalty on 486, no penalty on a 586)
|
||
|
||
|
||
-------------------------------
|
||
CODE ALIGNMENT
|
||
-------------------------------
|
||
|
||
To speed up the instructions, alignment of code should be on the
|
||
beginning of cache boundaries.
|
||
(32 bytes on 586, 16 bytes on 486)
|
||
|
||
Thus, start your routine on the start of the cache page.
|
||
This way, when someone calls the routine, the processor
|
||
will just have to load a cache line to get the first sequence to feed
|
||
into the pipeline, and while they are executing it will have
|
||
enough time to load the other cache lines needed.
|
||
|
||
This will speed up the processing of all the instructions within that page.
|
||
The effect of this speed up on the 586 is not a dramatic as is it is on the 486
|
||
because the 586 has bigger caches and faster access to memory
|
||
so a cache line load/store has less impact on cpu.
|
||
|
||
------------------------------
|
||
DATA ALIGNMENT
|
||
------------------------------
|
||
|
||
Misaligned access in the data cache causes an extra 3 cycles on both the
|
||
486 and 586.
|
||
|
||
Ways to speed up data:
|
||
For DWORD data, alignment should be on a 4-byte boundary.
|
||
For WORD data, alignment should be on a 2 byte boundary for the 486,
|
||
and simply within the 4-byte page for the 586.
|
||
For 8 byte data (64 bits), data should be aligned on a 8-byte boundary
|
||
so it is possible to use BURST READS (and burst writes too, on a Pentium).
|
||
|
||
And yes, on many applications with tight inner loops, these things do
|
||
accumulate to make a noticeable speed-up.
|
||
|
||
-------------------------------------------------------------------------------
|
||
SPEEDING UP REGISTER AND OPERAND USAGE
|
||
-------------------------------------------------------------------------------
|
||
|
||
Use the EAX as much as possible.
|
||
Many instructions are 1 byte shorter when using EAX.
|
||
That's less code you have to move around and slightly less internal processing.
|
||
Use the DS register as much as possible for roughly the same reason,
|
||
In the case of several references being made to a variable addressed with a
|
||
displacement, load the displacement into a register.
|
||
|
||
Try to use the ESP register to reference the stack in lower levels of your
|
||
subroutines (faster than push/pop things and leaves EBP free for other
|
||
uses)
|
||
|
||
|
||
-------------------------------------------------------------------------------
|
||
HANDY INFO ON SPEEDING UP INTEGER INSTRUCTIONS
|
||
-------------------------------------------------------------------------------
|
||
|
||
1. Avoid using complex instructions like LEAVE, ENTER, LOOP, string
|
||
instructions etc Simpler instructions will get the job done faster.
|
||
Simpler instructions have been getting faster with every new processor
|
||
that has come along.
|
||
|
||
2. With 8-bit operands, do your best to use the byte opcodes, rather than
|
||
using the 32 bit instructions with sign and zero extended modes.
|
||
Internal sign extension is expensive, and speed increases can be found by
|
||
simplifying the process as much as possible.
|
||
|
||
3. LEA's generally increase the chance of AGI's. However, LEA's can be
|
||
advantageous because:
|
||
* In many cases an LEA instruction may be used to replace constant
|
||
multiply instructions. (a sequence of LEA, add and shift for example)
|
||
* LEA may be used as a three/four operand addition instruction.
|
||
LEA ECX, [EAX+EBX*4+ARRAY_NAME]
|
||
* Can be advantageous to avoid copying a register when both operands to
|
||
an ADD are being used after the ADD as LEA need not overwrite its
|
||
operands.
|
||
|
||
The general rule is that the "generic"
|
||
|
||
LEA A,[B+C*INDEX+DISPLACEMENT]
|
||
|
||
where A can be a register or a memory location and B,C are registers
|
||
and INDEX=1,2,4,8
|
||
and DISPLACEMENT = 0 ... 4*1024*1024*1024
|
||
or (if performing signed int operations)
|
||
-2*1024*1024*1024 ... + (2*1024*1024*1024 -1 )
|
||
|
||
replaces the "generic" worst-case sequence
|
||
|
||
MOV X,C ; X is a "dummy" register
|
||
MOV A,B
|
||
MUL X,INDEX ;actually SHL X, (log2(INDEX))
|
||
ADD A,DISPLACEMENT
|
||
ADD A,X
|
||
|
||
So using LEA you can actually "pack" up to FIVE instructions into one
|
||
Even counting a "worst case" of TWO OR THREE AGIs caused by the LEA
|
||
this is very fast compared to "normal" code.
|
||
What's more, cpu registers are precious, and using LEA
|
||
you don't need a dummy "X" register to preserve the value of B and C.
|
||
|
||
4. Zero-Extension with short ints terror tales
|
||
|
||
The MOVZX takes four cycles to execute due to due zero-extension
|
||
wobblies. A better way to load a byte into a register is by:
|
||
xor eax,eax
|
||
mov al,memory
|
||
|
||
As the xor just clears the top parts of EAX, the xor may be placed on the
|
||
OUTSIDE of a loop that uses just byte values.
|
||
The 586 shows greater response to such actions.
|
||
|
||
It is recommended that 16 bit data be accessed with the MOVSX and
|
||
MOVZX if you cannot place the XOR on the outside of the loop.
|
||
|
||
N.B. Do the "replacement" only for movsx/zx inside loops.
|
||
|
||
5. When comparing a value in a register with 0, use the TEST command.
|
||
|
||
TEST operands by ANDing the operands together without spending any
|
||
internal time worrying about a destination register.
|
||
Use test when comparing the result of a boolean AND command with an
|
||
immediate constant for equality or inequality if the register is EAX.
|
||
You can also use it for zero testing.
|
||
(i.e. test ebx,ebx sets the zero flag if ebx is zero)
|
||
|
||
6. Address Calculations
|
||
|
||
Pull address calculations into load and store instructions. Memory
|
||
reference instructions have 4 operands: a relocatable time segment base, a
|
||
base register, a scaled index register and a displacement. Often several
|
||
integer instructions can be eliminated by fully using the operands of
|
||
memory addresses. (more on this later)
|
||
|
||
7. INTEGER DIVIDE
|
||
In most cases, an Integer Divide is preceded by a CDQ instruction.
|
||
This is as divide instructions use EDX:EAX as the dividend and CDQ sets
|
||
up EDX.
|
||
It is better to copy EAX into EDX, then arithmetic-right-shift EDX 31 places
|
||
to sign extend.
|
||
The copy/shift instructions take the same number of clocks as CDQ,
|
||
however, on 586's allows two other instructions to execute at the same
|
||
time. If you know the value is a positive, use XOR EDX,EDX.
|
||
|
||
8. INTEGER MULTIPLY
|
||
The integer multiply by an immediate can usually be replaced with a faster
|
||
and simpler series of shifts, subs, adds and lea's.
|
||
As a rule of thumb when 6 or fewer bits are set in the binary representation
|
||
of the constant, it is better to look at other ways of multiplying and not use
|
||
INTEGER MULTIPLY. (the thumb value is 8 on a 586)
|
||
A simple way to do it is to shift and add for each bit set, or use LEA.
|
||
|
||
Here the LEA instruction comes in as major cpu booster, for example:
|
||
LEA ECX,[EDX*2] ; multiply EDX by 2 and store result into ECX
|
||
LEA ECX,[EDX+EDX*2] ; multiply EDX by 3 and store result into ECX
|
||
LEA ECX,[EDX*4] ; multiply EDX by 4 and store result into ECX
|
||
LEA ECX,[EDX+EDX*4] ; multiply EDX by 5 and store result into ECX
|
||
LEA ECX,[EDX*8] ; multiply EDX by 8 and store result into ECX
|
||
LEA ECX,[EDX+EDX*9] ; multiply EDX by 9 and store result into ECX
|
||
|
||
And you can combine leas too!!!!
|
||
lea ecx,[edx+edx*2] ;
|
||
lea ecx,[ecx+ecx*8] ; ecx <-- edx*27
|
||
(of course, if you can, put three instructions between the two LEA
|
||
so even on Pentiums, no AGIs will be produced).
|
||
|
||
9. Clearing Registers
|
||
Using XOR reg,reg is fast but sets up conditions codes.
|
||
A slightly slower way to do it of course is to mov reg,0
|
||
which preserves condition codes.
|
||
|
||
10. Avoid ENTER, instead, try something like:
|
||
PUSH EBP
|
||
mov ebp, esp
|
||
sub esp, BYTE_COUNT
|
||
|
||
11. JUMP INSTRUCTIONS
|
||
Jump instruction come in two forms, one real near that jumps between -127
|
||
and 128 of the current location, and a 32 bit version that does a full jump.
|
||
The short form is quite a bit faster, however unfortunately many compilers
|
||
put long jumps where short jumps would suffice.
|
||
To ensure that short jumps can be used (when you know it is possible),
|
||
explicitly specify the destination as being byte length
|
||
(i.e use jmp short instead of plain jmp, if you can)
|
||
|
||
12. Task Switching
|
||
Task Switching is evil. It is slow, real slow. Avoid task switching too
|
||
often, as more and more of your time will be spent in processing the task
|
||
switch.
|
||
For faster task switching, perform you task switching in software. This
|
||
allows a smaller processor state to be saved and restored. Anything that
|
||
shaves time off 75+ clock cycles isn't all that bad in my book.
|
||
|
||
13. Minimize segment register loads and the use of far pointers as dearly
|
||
much as you can. If you are doing a lot of processing on a small piece of
|
||
data far away, it might be faster just to copy the data to nearby and work
|
||
on it there (by the way, this is a good reason to use flat protected mode).
|
||
|
||
...and....
|
||
|
||
14. THINK ABOUT WHAT YOU WANT TO DO
|
||
All the other optimisations of Unrolling Loops, moving the invariant data
|
||
etc still apply. That, however, is more an algorithmic problem for you, but
|
||
still keep it firmly in mind.
|
||
|
||
_______________________________________________________________________________
|
||
586 Specific Optimizations
|
||
-------------------------------------------------------------------------------
|
||
|
||
The 586Pent has a five stage pipeline structure.
|
||
However, the pipeline is split into two different pipelines, known
|
||
as Pipeline U and Pipeline V. This allows two instructions to be
|
||
executed in parallel and thus presents the opportunity of
|
||
executing two/instuctions per clock.
|
||
The U pipeline is very much like the 486 pipeline, in that it can handle
|
||
the entire set of instructions. Pipeline V on the other hand can
|
||
handle only "simple" instructions.
|
||
The fast parts of the U and V pipelines are possible as much of the
|
||
functions are hardwired not microcoded.
|
||
|
||
Anyway, I've blurted on about that for a bit, but what you want to know
|
||
is "How to I get two instructions running in one clock/cycle?"
|
||
|
||
Lament no further, here are the criteria:
|
||
|
||
1. Both instructions must be simple (see below)
|
||
2. There must not be a read-after-write or write-after-read
|
||
register/flag dependencies between them.
|
||
3. Neither can have a displacement and an immediate
|
||
4. Instructions with prefixes can only occurr in the U-pipe
|
||
(except for branches on condition jz,jc, etc.etc.)
|
||
|
||
The following are considered "simple" instructions,
|
||
the ALU means any ALU command (ADD etc):
|
||
|
||
1. Mov reg, reg/mem/immed
|
||
2. mov mem, reg/imm
|
||
3. ALU reg, reg/mem/immed
|
||
4. ALU mem, reg/immed
|
||
5. inc reg/mem
|
||
6. dec reg/mem
|
||
7. push reg/mem
|
||
8. pop reg
|
||
9. nop
|
||
10. lea reg/mem
|
||
11. Jmp/ Call / Jcc near
|
||
|
||
N.B Remember only the U pipe can perform SHIFTS/ROTATIONS
|
||
so "couple" every shift/rol with a simple instruction
|
||
before and after to be sure every pipeline is filled.
|
||
|
||
Note, however, than while both pipelines can perform ALU
|
||
instructions concurrently, this is not the case when
|
||
one ALU instruction requires the output of the
|
||
other pipeline ALU instruction.
|
||
Example: ALU instruction in pipe U and
|
||
performing ADC in pipe V.
|
||
|
||
There are two exceptions to this rule:
|
||
1) In the case of a compare/branch combination.
|
||
2) With push/pop combinations (because of optimized stack access)
|
||
|
||
--------------------------
|
||
Branch Prediction
|
||
-------------------------
|
||
|
||
Another nice optimisation with the 586 hardware is that whenever there is
|
||
a possibility of a jump, for example, Jg, the 586 can automatically
|
||
start "reading ahead" code from the jump location just in case the the
|
||
Jump is accepted. Remember the CODE alignment thing above? Well,
|
||
it couldn't hurt your grandmother to align the labels on the code pages.
|
||
|
||
------------------------------------------
|
||
Amazing new 586 Optimizations and speedups
|
||
------------------------------------------
|
||
|
||
The 586 has a lot of really nice optimizations and kick-ass
|
||
features in addition to what I've mentioned above, unfortunately,
|
||
this information is proprietary and will only be given
|
||
to people who sign non-disclosure agreements and shell out a
|
||
$$ or two... Bummer...
|
||
(Meethinks, 586 clone-makers won't be too happy about that... :-) )
|
||
Compiler/OS makers will probably be the purchasers.
|
||
Quite a few of these optimisations won't work on anything less than
|
||
a 586, so don't sweat too much about it. Hey, just write
|
||
for 386/486/586 and you'll be ok.
|
||
|
||
Hovewer, i found a nice article on July 1994 Byte about some
|
||
"secret weapons of Pentium coders"....
|
||
|
||
============================================================================
|
||
PENTIUM'S PROFILING REGISTERS:
|
||
|
||
Pentiums have a set of 64bit "machine specific registers" (MSR)
|
||
that are accessed by way of the RDMSR (read MSR) and WRMSR (write MSR)
|
||
instructions.
|
||
|
||
THESE ARE PROTECTED MODE, RING 0 INSTRUCTIONS (CPU LEVEL 0)
|
||
( can work in a 386Powered programs only if executing under VCPI
|
||
XMS or "no V86 manager"
|
||
or if you find a way to "toggle" DPMI from ring 3 to ring 0)
|
||
(I think they can work even if you boot ms-dos in real mode
|
||
and use the proper instruction prefixes needed to execute
|
||
32bit instructions in real mode).
|
||
|
||
-----------------------------------------------------------
|
||
RDMSR Read MSR
|
||
Copies the MSR register indexed by ECX
|
||
into the 64bit pair EDX:EAX
|
||
|
||
RDMSR macro
|
||
db 0Fh,032h
|
||
endm
|
||
-----------------------------------------------------------
|
||
WRMSR Write MSR
|
||
Copies into the MSR register indexed by ECX
|
||
the value contained into the 64bit pair EDX:EAX
|
||
|
||
Macro to insert it into code if your assembler
|
||
does not support Pentiums:
|
||
|
||
WRMSR macro
|
||
db 0Fh,030h
|
||
endm
|
||
-----------------------------------------------------------
|
||
|
||
|
||
Intel Pentium user manuals, documents only MSR 0, 1 and 0Eh
|
||
and states that MSR 3, 0Fh and above 13h are reserved and illegal.
|
||
|
||
So, what's left? And what's useful to you ?
|
||
|
||
MSR 10h is the counter of CPU CYCLES since last power-up.
|
||
That value can also be read with the RDTSC (Read time stamp counter)
|
||
instruction)
|
||
RDTSC is 0Fh,031h in bytes and must be executed while in ring 0 too.
|
||
|
||
MSR 10h is the most precise timer available on the Pentium
|
||
because it "ticks" at the CPU frequency.
|
||
|
||
Then comes MSR 11h,12h and 13h there you will find the hot stuff
|
||
|
||
The lower 32bits of MSR 11h are actually two INDEXES
|
||
INTO THE CPU PROFILING REGISTERS!!!!
|
||
|
||
The profiling registers are CPU EVENT TIMERS AND COUNTERS
|
||
they can keep track of what's going on inside and outside
|
||
your super-duper processor, they can detect nearly everything the CPU does.
|
||
|
||
The first 16bits of MSR11h selects
|
||
the profiling register accessible thru MSR 12h.
|
||
The second 16bits of MSR11h selects
|
||
the profiling register accessible thru MSR 13h.
|
||
|
||
Here comes the format of the "profiling register indexes":
|
||
bit 0..5 : type of the profiling register to access
|
||
bit 6 : Set if you want to monitor the events in cpu ring 0,1,2 (system)
|
||
bit 7 : Set if you want to monitor the events in cpu ring 3 (user level)
|
||
bit 8 : 0 = access count-of-hardware-events
|
||
1 = access count-of-total-cpu-cycles used to process the cumulated
|
||
events.
|
||
(i'm not sure of this, maybe 0 means count time and 1 count events)
|
||
bit 9..15: UNKNOWN, DO NOT MODIFY
|
||
|
||
|
||
|
||
Profiling registers types:
|
||
INDEX NAME
|
||
0 data write
|
||
1 data read
|
||
2 data TLB miss (translation look-aside buffer)
|
||
3 data read miss
|
||
4 data write miss
|
||
5 write (hit) to M or E state lines
|
||
6 data cache lines written back
|
||
7 data cache snoops
|
||
8 data cache snoop hits
|
||
9 memory accesses in both pipes
|
||
0Ah bank conflict
|
||
0Bh misaligned data memory reference
|
||
0Ch code read
|
||
0Dh code TLB miss
|
||
0Eh code cache miss
|
||
0Fh segment load
|
||
10h ????
|
||
11h ????
|
||
12h branches
|
||
13h BTB hits (Branch Target Buffer)
|
||
14h taken branch OR BTB hit
|
||
15h pipeline flushes
|
||
16h instructions executed
|
||
17h instructions executed in V-pipe
|
||
18h bus utilization (clocks)
|
||
19h pipeline stalled by write backup
|
||
1Ah pipeline stalled by data memory write
|
||
1Bh pipeline stalled by write to E or M line
|
||
1Ch locked bus cycles
|
||
1Dh i/o read or write cycles
|
||
1Eh non cacheable memory references
|
||
1Fh AGI (Address Generation Interlock)
|
||
20h ????
|
||
21h ????
|
||
22h FPU operations
|
||
23h breakpoint 0 match
|
||
24h breakpoint 1 match
|
||
25h breakpoint 2 match
|
||
26h breakpoint 3 match
|
||
27h hardware interrupts
|
||
28h data read or data write
|
||
29h data read miss or data write miss
|
||
|
||
So if you want to profile things:
|
||
0) Set properly the environment of the test
|
||
(cache empty, cache filled with all the routine code, etc. etc.)
|
||
1) Set MSR 11h to the two counters you want to read
|
||
2) Read MSR 12h,13h to get the initial values,
|
||
3) Execute the code you want to profile
|
||
4) Read the final values of MSR 12h,13h
|
||
(without setting MSR 11h again this time).
|
||
5) Calculate the difference between the initial and final values.
|
||
|
||
This way if you are not sure how to optimize things
|
||
you can actually test how they work and what's going on.
|
||
|
||
USING THE PROFILING REGISTER YOU CAN "TEST ON THE ROAD"
|
||
YOUR CODE DOWN TO THE LAST CPU CYCLE!!!!!
|
||
|
||
-------------------------------------------------------------------------------
|
||
386 Optimization
|
||
-------------------------------------------------------------------------------
|
||
|
||
Well, nobody buys a 386 now, right?
|
||
But still lots of people has one....
|
||
So if you wanna be 386 friendly remember .....
|
||
|
||
|
||
--------------------------
|
||
DECODE-AFTER-JUMP OVERHEAD
|
||
--------------------------
|
||
|
||
When a jump is performed, a 386 spends some extra time decoding
|
||
the next instruction to execute and flushing the pipeline.
|
||
|
||
THE FIRST INSTRUCTION requires exactly ONE EXTRA CPU CYCLE
|
||
for every COMPONENT ( prefixes, opcode, mod/rm , sib ,offset, immediate value )
|
||
of the instruction. Notice i said COMPONENT!!!
|
||
An 8bit displacement or a 32bit one, count always as an 1 extra cycle.
|
||
|
||
So, in 32bit mode (where no prefixes are needed for 32bit instructions):
|
||
|
||
loopme: inc esi ; opcode
|
||
mov [edi+1234h],eax ; opcode, mod/rm , immediate offset
|
||
dec ecx ; opcode
|
||
jne short loopme ; opcode short_displacement
|
||
|
||
IS FASTER THAN:
|
||
|
||
loopme: mov [edi+1234h],eax
|
||
inc esi
|
||
dec ecx
|
||
jne short loopme
|
||
|
||
Because placing first the three component mov instruction
|
||
adds 2 cycles to the instruction execution time, weird, uh?
|
||
But if you remember this thing, you can boost 386 code a lot.
|
||
|
||
By the way, remember that "pipeline overhead" is not so obvious to calculate.
|
||
Look at this:
|
||
add eax,ebx ; opcode, mod/rm
|
||
add eax,1234h ; opcode, immediate_value
|
||
stosd ; opcode <--- these "slow" instruction pipes-in faster
|
||
pop eax ; opcode <--- so if you can, put 'em first
|
||
|
||
------------------
|
||
SHORT INSTRUCTIONS
|
||
------------------
|
||
|
||
Short instructions are loaded and decoded faster than longer ones
|
||
and since 386 has no internal cache and less memory access bandwidth
|
||
than a plain 486, this helps a lot.
|
||
|
||
Well that's all for a 386.
|
||
|
||
-------------------------------------------------------------------------------
|
||
CACHE OPTIMIZATION TECHNIQUES (THEY CAN WORK ON ANY CPU!!!!!!!!!!!!)
|
||
-------------------------------------------------------------------------------
|
||
|
||
Usually "code optimization" is cpu-based, but there more things
|
||
up in the sky and down on earth than just a cpu... the cache for example!
|
||
|
||
Well, the difference between a cache hit and a cache miss
|
||
means lots of cpu cycles lost, so better hit the cached locations to the max.
|
||
|
||
386 usually have and external 64k ... 128k cache (mine has none, sigh! :( )
|
||
486 have a 4k internal cache and a 64k...512k (usually 256k) second level cache
|
||
Pentiums have an Harvard 8k code + 8k data cache
|
||
plus a 256k..1Mbyte second level cache.
|
||
|
||
Use "short" loops so there is more probability that the code to execute
|
||
will reside fully into the cache.
|
||
|
||
Then remember you can count on an external cache of at least 64k
|
||
(usually 128k..256k for a 486).
|
||
|
||
So, if you have to process big arrays or big bitmaps with multiple passages
|
||
do not "scan" all the array for every pass, use a "strip by strip"
|
||
strategy instead so every "strip" fully fits into the cache
|
||
and is ready for the second and third pass without cache misses.
|
||
|
||
This technique is called STRIP-MINING, you can include into your program
|
||
a routine that checks the optimal "strip size" trying a multi-pass test
|
||
on 64k,128k,256k,512k strips and "complete array"
|
||
and then sets the optimal "strip size" to use
|
||
when perfoming "strip-mining" routines.
|
||
|
||
On a Pentium you can use 8k "strips" so your "strip mined" routines
|
||
will work with the full speed of the internal data cache
|
||
(remember the internal cache works at full cpu speed while
|
||
the secondary cache may be runnin at half that).
|
||
|
||
The advantage of strip mining is caused by the fact that
|
||
the additional jumping/looping needed to process arrays in "strips"
|
||
of adiacent memory locations that can fit together into the cache
|
||
is A LOT LESS than the time caused by a single cache miss.
|
||
|
||
-------------------------------------------------------------------------------
|
||
NOT SO OBVIOUS OPTIMIZATIONS
|
||
-------------------------------------------------------------------------------
|
||
|
||
----------------------
|
||
COMPLEX INSTRUCTIONS
|
||
----------------------
|
||
Intel says complex instruction are evil on 486 and up, but there are
|
||
exceptions... expecially if want to make things run smooth on 386 too.
|
||
|
||
STRING instructions are FASTER on 386, expecially if they are the first
|
||
in a loop.
|
||
|
||
If you have to move data in 32bit chunks you'd better use REP MOVSD if you can
|
||
because it alone replaces this "simple instructions only"
|
||
super-optimized sequence:
|
||
|
||
rep_loop:
|
||
mov eax,[esi]
|
||
add esi,4 ; or -4
|
||
mov [edi],eax
|
||
add edi,4 ; or -4
|
||
dec ecx
|
||
jne rep_loop
|
||
|
||
REP MOVSD takes 2+7*n cycles on a 386 or 486
|
||
|
||
While the "optimized" sequence uses EAX
|
||
and takes [(2+4+2+2+2+2+7)*n - 4 ] = [21*n - 4] cycles on a 386
|
||
and [ (1+1+1+1+1+3)*n - 2 ] = [ 8*n - 2] cycles on a 486
|
||
|
||
Cache-line aligning the "optimized" code on a Pentium
|
||
i think it takes [ 3*n ] cycles.
|
||
So if 486s are your base system don't throw away REP MOVSD
|
||
|
||
Also remember that REP MOVSD take 2 bytes instead of 13 bytes
|
||
does not need to use EAX (one more register free for other things)
|
||
and it does not use the Pentium's BTB (Branch Target Buffer)
|
||
so you can be sure it does not miss and the outer loops
|
||
with entries in the BTB can be one level deeper.
|
||
|
||
What's more, the AMD K5 automatically splits a complex instruction
|
||
into an optimized sequence of simple instructions
|
||
and issues them to fully utilize the four execution units it has.
|
||
Guess what it means for poor old movsd :)
|
||
|
||
<Intel bashing ON>
|
||
Heck! I think those Intel engineers lost a big thing when they
|
||
decided to not improve MOVS/LODS/STOS. A "blitter" unit directly
|
||
controlled by string instructions (with "fetch ahead","bufferize"
|
||
and "auto align" capabilities) could be a great thing for us.
|
||
|
||
Think about this: a REP MOVSB/W/D "translated" to
|
||
a "head" byte move (to align things)
|
||
a "central" qword move (burst read/burst writes)
|
||
and a "tail" byte move (to align things).
|
||
And all this without trashing the processor data cache!!!!
|
||
Can you immagine the boost your code may get?
|
||
Heck! Sun engineers ADDED "blitter" support in their new 64bit sparc cpus
|
||
because they seen their software moving data a lot, why Intel ones did not?
|
||
|
||
On a Pentium you can get the optimal 2-cycle memory to memory MOV
|
||
by way of pipelining, but this cannot take advantage of the fact
|
||
that when performing REP MOVS the processor KNOWS AHEAD how many locations
|
||
will be moved (read: on a "smarter" cpu this can help optimize to the
|
||
max the processor-to-memory bandwidth and avoid caching overhead)
|
||
nor it can take advantage of the full power of burst read/writes
|
||
neither it can take advantage of the fact that WHILE THE STRING MOVE
|
||
IS GOING ON, the processor pipelines can execute OTHER instructions after
|
||
the MOVS and "not dealing" with the memory range "under transfert"
|
||
or with ECX,ESI,EDI and Direction Flag.
|
||
I hope they will take note of this for P7.
|
||
|
||
<Intel bashing OFF>
|
||
|
||
-------------------------------------------------------------
|
||
THE ADDRESSING MODE ADVANTAGE
|
||
-------------------------------------------------------------
|
||
Don't be fooled by the current Riscy trend, be cooler than the rest.
|
||
Some Intel "complex addressing" modes are really powerful
|
||
if you just have enough creativity.
|
||
Lets suppose you need to add together data from "four"
|
||
streams of data....
|
||
|
||
A riscy (risc-like) way to do this could be...
|
||
; 486 timings when looping
|
||
riscy_way:
|
||
mov eax,[esi] ; 1
|
||
add esi,4 ; 1
|
||
add eax,[edx] ; 2
|
||
add edx,4 ; 1
|
||
add eax,[ebx] ; 2
|
||
add ebx,4 ; 1
|
||
add eax,[ecx] ; 2
|
||
add ecx,4 ; 1
|
||
mov [edi],eax ; 1
|
||
add edi,4 ; 1
|
||
dec ebp ; 1
|
||
jne riscy_way ; 3
|
||
; loop cycles = 17
|
||
|
||
Now lets see the "intelly" way! :)
|
||
Let's suppose the "counter" EBP won't exceed 2^31 -1
|
||
we can do the following ...
|
||
|
||
; move pointers ahead ...
|
||
lea esi,[esi+ebp*4]
|
||
lea edx,[edx+ebp*4]
|
||
lea ebx,[ebx+ebp*4]
|
||
lea ecx,[ecx+ebp*4]
|
||
lea edi,[edi+ebp*4]
|
||
neg ebp ; negate increment
|
||
; then you can fully use the address unit ALU
|
||
|
||
; 486 timing when looping
|
||
intelly_way:
|
||
mov eax,[esi+ebp*4] ; 1
|
||
add eax,[edx+ebp*4] ; 2
|
||
add eax,[ebx+ebp*4] ; 2
|
||
add eax,[ecx+ebp*4] ; 2
|
||
mov [edi+ebp*4],eax ; 1
|
||
inc ebp ; 1
|
||
jnz intelly_way ; 3
|
||
; loop cycles = 12
|
||
|
||
|
||
On a Pentium, "riscy" and "intelly" runs at nearly the same speed
|
||
BUT ON A 486, the "riscy" way runs 30% slower than "intelly" !!!!
|
||
|
||
This means that using the "riscy" code
|
||
your 486 will look like a slow cow compared to a Pentium
|
||
while using the "intelly" code your 486 will look good enough
|
||
( not to mention this helps to make the difference between
|
||
"needs a Pentium" and "this flies on 486 too!!").
|
||
|
||
-------------------------------------------------------------
|
||
32bit ACCESS SPEAKS FOR ITSELF, just remember to fully use it
|
||
-------------------------------------------------------------
|
||
|
||
Everywhere you can, use 32bit instructions
|
||
and if you can, align data on 32bit.
|
||
|
||
For example, let's say you have to manipulate lots of strings,
|
||
if you align them on dword boundaries (including string lenght)
|
||
with null byte paddings, you can boost processing time MORE THAN 8 TIMES!!!!
|
||
|
||
The same thing applies to manipulating 8bit bitmaps and "software" sound mixing.
|
||
|
||
Into 386mixer coded a routine that mixed simultaneously 4 sound channels
|
||
running only on internal register and mixing up to 4 samples at a time
|
||
from the 4 channels (look at the "intelly" example above).
|
||
|
||
If you need speed, you can even tollerate small calculation errors
|
||
or reduced precision
|
||
and manipulate successive 8,16bit values in a single 32bit instruction.
|
||
|
||
Let's say you have and array of unsigned words and you need to multiply them
|
||
for a constant, say 45, how can you do that ?
|
||
If you know the values won't overflow the 16 bit they use
|
||
(if they overflow you will have to choose another method), you can do the
|
||
following:
|
||
mov ecx,count
|
||
mov esi,start_address
|
||
handle_two_words:
|
||
mov eax,[esi] ; load two words at a time
|
||
; an AGI happens, but i can't eliminate it
|
||
lea eax,[eax+eax*4] ; multiply by 5
|
||
add esi,4 ; increment pointer, avoid 486 AGI
|
||
; reduce by 1 the Pentium AGIs
|
||
lea eax,[eax+eax*8] ; then multiply by 9
|
||
dec ecx ; decrement counter & keep Pentium at full power
|
||
mov [esi],eax ;
|
||
jne handle_two_words
|
||
|
||
And if you have very big arrays, you can even unroll two or three times
|
||
the loop to further speed up this code.
|
||
|
||
------------------
|
||
SELF COMPILED CODE
|
||
------------------
|
||
|
||
Sometimes you need to execute lots of times the same "jump filled"
|
||
instruction sequence, and you know ahead what jumps will be taken
|
||
and what won't.
|
||
What's worse if there are lots of conditional jumps (Jcc) "in sequence"
|
||
they may be enough to overflow the capability of branch-prediction units!!!
|
||
|
||
So, what can you do?
|
||
My answer to this is a SELF-COMPILER!!!
|
||
A small finite state machine that instead of crunching data directly
|
||
IT GENERATES SUPER-SPECIALIZED ROUTINES that will crunch the data in the
|
||
fastest way the code generator knows.
|
||
|
||
I've done a thing like that for the _PageFLip1 routine in 386video.asm.
|
||
At mode initialization a code generator generates
|
||
all the 256 blit-skip sequences the _PageFLip1 routines may need
|
||
when copying "modified" pixels on screen.
|
||
|
||
This way a call is performed only after 32 blitted pixels, instead of jumping
|
||
every 2..4 pixels (or every pixels in the worst case situations).
|
||
|
||
By the way, this is NOT self-modifying code
|
||
this is an "integrated code compiler".
|
||
|
||
------------------------------------------------------------------------------
|
||
I hope this information has been helpful for you.
|
||
Now make some coffee, brush your teeth and phone up your girlfriend and
|
||
go and see a movie. This document will be here when you get back, and I
|
||
imagine there is only so much of this you can take in one day.... :-) :-) :-)
|
||
|
||
Live long and code well...
|
||
|
||
Regards,
|
||
|
||
Michael Kunstelj.
|
||
|
||
Regards from me too,
|
||
|
||
Lorenzo Micheletto
|
||
|