292 lines
9.6 KiB
Plaintext
292 lines
9.6 KiB
Plaintext
From: olson@umbc.edu (Bryan G. Olson)
|
|
Newsgroups: sci.crypt
|
|
Subject: A Quick IDEA, was: Speed of DES/IDEA implementations
|
|
Date: 7 Dec 1993 21:49:41 -0500
|
|
|
|
A while ago I posted a message claiming a speed of 238,000
|
|
bytes/sec for an implementation of IDEA on a 33Mh 486. Below is
|
|
an explanation and some code to show how it works. The basic
|
|
trick should be useful on many (but not all) processors. I
|
|
expect only those familiar with IDEA and its reference
|
|
implementation will be able to follow the discussion. See:
|
|
|
|
Lai, Xueja and Massey, James L. A Proposal for a New Block
|
|
Encryption Standard, Eurocrypt 90
|
|
|
|
For those who have been asking for the code, sorry I kept
|
|
putting it off. I wanted to get it out of Turbo Pascal
|
|
ideal-mode, but I never had the time.
|
|
|
|
Colin Plum wrote IDEA-386 code which is included in PGP
|
|
2.3a and uses the same tricks. I don't know who's is
|
|
faster, but I expect they will be very close. Now
|
|
here's how it's done.
|
|
|
|
A major bottleneck in software IDEA is the mul() routine, which
|
|
is used 34 times per 64 bit block. The routine performs
|
|
multiplication in the multiplicative group mod 2^16+1. The two
|
|
factors are each in a 16 bit word, and the output is also in a 16
|
|
bit word. Note that 0 is not a member of the multiplicative
|
|
group and 2^16 does not fit in 16 bits. We therefor use the 0
|
|
word to represent 2^16. Now group elements map one to one onto
|
|
all possible 16 bit words, since 2^16+1 is prime.
|
|
|
|
Here is (essentially) the reference implementation from [Lai].
|
|
|
|
|
|
unsigned mul( unsigned a, unsigned b ) {
|
|
long int p ;
|
|
long unsigned q ;
|
|
if( a==0 ) p= 0x00010001 - b ;
|
|
else if( b==0 ) p= 0x00010001 - a ;
|
|
else {
|
|
q= a*b;
|
|
p= (q & 0xffff) - (q>>16)
|
|
if( p<0 ) p= p + 0x00010001 ;
|
|
}
|
|
return (unsigned)(p & 0xffff) ;
|
|
}
|
|
|
|
|
|
Note the method of reducing a 32 bit word modulo 2^16-1. We
|
|
subtract the high word from the low word, and add the modulus
|
|
back if the result is less than 0. [Lai] contains a proof that
|
|
this works, and you can convince yourself fairly easily.
|
|
|
|
To speed up this routine, we note that the tests for a=0 and b=0
|
|
will rarely be false. With the possible exception of the first 2
|
|
of the 34 multiplications, 0 should be no more likely than any of
|
|
the other 65535 numbers. Note that if (and only if) either a or
|
|
b is 0 then q will also be 0, and we can check for this in one
|
|
instruction if our processor sets a zero flag for multiplication
|
|
(as the 68000 does but 80x86 does not).
|
|
|
|
Fortunately p will also be zero after the subtraction if and only
|
|
if either a or b is 0. Proof: r will be zero when the high order
|
|
word of q equals the low order word, and that happens when q is
|
|
divisible by 00010001 hex. Since 00010001h = 2^16+1 is prime,
|
|
this happens if either a or b is a multiple of 2^16+1, and 0 is
|
|
the only such multiple which will fit in a 16 bit word.
|
|
|
|
The speed-up strategy is to proceed under the assumption that a
|
|
and b are not 0, check to be sure in one instruction, and
|
|
recompute if the assumption was wrong. Here's some 8086
|
|
assembler code:
|
|
|
|
mov ax, [a]
|
|
mul [b] ; ax is implied. q is now in DX AX
|
|
sub ax, dx ; mod 2^16+1
|
|
jnz not0 ; Jump if neither op was 0. Usually taken.
|
|
|
|
mov ax, 1 ; recompute result knowing one op is 0.
|
|
sub ax, [a]
|
|
sub ax, [b]
|
|
jmp out ; Just jump over adding the carry.
|
|
not0:
|
|
adc ax, 0 ; If r<0 add 1, otherwise do nothing.
|
|
out: ; Result is now in ax
|
|
|
|
|
|
Note that when r<0 we add 1 instead of 2^16+1 since the 2^16 part
|
|
overflows out of the result. The "adc ax, 0" does all the work
|
|
of checking for a negative result and adding the modulus if
|
|
needed.
|
|
|
|
The multiplication takes 9 instructions, 4 of which are rarely
|
|
executed. I believe similar tricks are possible on many
|
|
processors. The one drawback to the check-after-multiply tactic
|
|
is that we can't let the multiply overwrite the only copy of an
|
|
operand.
|
|
|
|
Note that most software implementations of IDEA will run at
|
|
slightly different speeds when 0's come up in the multiply
|
|
routine. The reference implementation is faster on 0, this one
|
|
is faster on non-zero. This may be a problem for some real-time
|
|
stuff, and also suggests an attack based on timing.
|
|
|
|
Finally, below is an implementation of the complete encryption
|
|
function in 8086 assembler, to replace the cipher_idea() function
|
|
in PGP. It takes the same parameters as the function from PGP,
|
|
and uses the c language calling conventions. I tested it using
|
|
the debug features of the idea.c file in PGP. You will need to
|
|
add segment/assume directives. This version uses no global data
|
|
and should be reentrant.
|
|
|
|
The handling of zero multipliers is outside the inner loop so
|
|
that a short conditional jump can loop back to the beginning.
|
|
Forward conditional jumps are usually not taken and backward
|
|
jumps are usually taken, which is consistent with 586 branch
|
|
prediction (or so I've heard). Stalls where the output of one
|
|
instruction is needed for the next seem unavoidable.
|
|
|
|
Last I heard, IDEA was patent pending. My code is up for grabs,
|
|
although I would get a kick out being credited if you use it.
|
|
On the other hand Colin's code is already tested and ready
|
|
to assemble and link with PGP.
|
|
|
|
--Bryan
|
|
|
|
____________________CODE STARTS BELOW THIS LINE_________
|
|
|
|
; Called as: asmcrypt( inbuff, outbuff, zkey ) just like PGP
|
|
|
|
PROC _asmcrypt
|
|
|
|
; establish parameter and local space on stack
|
|
; follow c language calling conventions
|
|
|
|
ARG inblock:Word, outblock:Word, zkey:Word
|
|
LOCAL sx1:Word,sx4:Word,skk:Word,done8:Word =stacksize
|
|
|
|
push bp
|
|
mov bp, sp
|
|
sub sp, stacksize
|
|
|
|
; push ax ; My compiler assumes these are not saved.
|
|
; push bx
|
|
; push cx
|
|
; push dx
|
|
|
|
push si
|
|
push di
|
|
|
|
; Put the 16 bit sub-blocks in registers and/or local variables
|
|
mov si, [inblock]
|
|
mov ax, [si]
|
|
mov [sx1], ax ; x1 is in ax and sx1
|
|
mov di, [si+2] ; x2 is in di
|
|
mov bx, [si+4] ; x3 is in bx
|
|
mov dx, [si+6]
|
|
mov [sx4], dx ; x4 is in sx4
|
|
|
|
mov si, [zkey] ; si points to next subkey
|
|
mov [done8], si
|
|
add [done8], 96 ; we will be finished with 8 rounds
|
|
; when si=done8
|
|
|
|
@@loop: ; 8 rounds of this
|
|
add di, [si+2] ; x2+=zkey[2] is in di
|
|
add bx, [si+4] ; x3+=zkey[4] is in bx
|
|
|
|
mul [Word si] ;x1 *= zkey[0]
|
|
sub ax, dx
|
|
jz @@x1 ; if 0, use special case multiply
|
|
adc ax, 0
|
|
@@x1out:
|
|
mov [sx1], ax ; x1 is in ax and sx1
|
|
|
|
xor ax, bx ; ax= x1^x3
|
|
mul [Word si+8] ; compute kk
|
|
sub ax, dx ; if 0, use special case multiply
|
|
jz @@kk
|
|
adc ax, 0
|
|
@@kkout:
|
|
mov cx, ax ; kk is in cx
|
|
|
|
mov ax, [sx4] ; x4 *= zkey[6]
|
|
mul [Word si+6]
|
|
sub ax, dx
|
|
jz @@x4 ; if 0, use special case multiply
|
|
adc ax, 0
|
|
@@x4out:
|
|
mov [sx4], ax ; x4 is in sx4 and ax
|
|
|
|
xor ax, di ; x4^x2
|
|
add ax, cx ; kk+(x2^x4)
|
|
mul [Word si+10] ; compute t1
|
|
sub ax, dx
|
|
jz @@t1 ; if 0, use special case multiply
|
|
adc ax, 0
|
|
@@t1out: ; t1 is in ax
|
|
|
|
add cx, ax ; t2 is in cx kk+t1
|
|
|
|
xor [sx4], cx ; x4 in sx4
|
|
xor di, cx ; new x3 in di
|
|
xor bx, ax ; new x2 in bx
|
|
xchg bx, di ; x2 in di, x3 in bx
|
|
xor ax, [sx1] ; x1 in ax
|
|
mov [sx1], ax ; and [sx1]
|
|
|
|
add si, 12 ; point to next subkey
|
|
cmp si, [done8]
|
|
jne @@loop
|
|
jmp @@out8
|
|
|
|
;------------------------------------------
|
|
; Special case multiplications, when one factor is 0
|
|
|
|
@@x1: mov ax, 1
|
|
sub ax, [sx1]
|
|
sub ax, [Word si]
|
|
jmp @@x1out
|
|
|
|
@@kk: mov ax, [sx1] ; rebuild overwritten operand
|
|
xor ax, bx
|
|
neg ax
|
|
inc ax
|
|
sub ax, [si+8]
|
|
jmp @@kkout
|
|
|
|
@@x4: mov ax, 1
|
|
sub ax, [sx4]
|
|
sub ax, [Word si+6]
|
|
jmp @@x4out
|
|
|
|
@@t1: mov ax, [sx4] ; rebuild
|
|
xor ax, di
|
|
add ax, cx
|
|
neg ax
|
|
inc ax
|
|
sub ax, [si+10]
|
|
jmp @@t1out
|
|
|
|
;---------------------------------------------------
|
|
; 8 rounds are done, now that extra pseudo-round
|
|
|
|
@@out8:
|
|
push di
|
|
mov di, [outblock]
|
|
|
|
mul [Word si]
|
|
sub ax, dx
|
|
jnz @@o1n ; jump over special case code
|
|
mov ax, 1
|
|
sub ax, [sx1]
|
|
sub ax, [si]
|
|
jmp @@o1out
|
|
@@o1n: adc ax, 0
|
|
@@o1out: mov [di], ax ; final ciphertext block 1
|
|
|
|
mov ax, [sx4]
|
|
mul [Word si+6]
|
|
sub ax, dx
|
|
jnz @@o4n ; jump over special case code
|
|
mov ax, 1
|
|
sub ax, [sx4]
|
|
sub ax, [si+6]
|
|
jmp @@o4out
|
|
@@o4n: adc ax, 0
|
|
@@o4out: mov [di+6], ax ; final ciphertext block 4
|
|
|
|
add bx, [si+2]
|
|
mov [di+2], bx ; final ciphertext block 2
|
|
pop ax
|
|
add ax, [si+4]
|
|
mov [di+4], ax ; final ciphertext block 3
|
|
|
|
; Restore the stack and return
|
|
|
|
pop di
|
|
pop si
|
|
; pop dx
|
|
; pop cx
|
|
; pop bx
|
|
; pop ax
|
|
|
|
mov sp, bp
|
|
pop bp
|
|
ret
|
|
ENDP _asmcrypt
|
|
|