123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131 |
- First up, let me say I don't like writing in assembler. It is not portable,
- dependant on the particular CPU architecture release and is generally a pig
- to debug and get right. Having said that, the x86 architecture is probably
- the most important for speed due to number of boxes and since
- it appears to be the worst architecture to to get
- good C compilers for. So due to this, I have lowered myself to do
- assembler for the inner DES routines in libdes :-).
- The file to implement in assembler is des_enc.c. Replace the following
- 4 functions
- des_encrypt1(DES_LONG data[2],des_key_schedule ks, int encrypt);
- des_encrypt2(DES_LONG data[2],des_key_schedule ks, int encrypt);
- des_encrypt3(DES_LONG data[2],des_key_schedule ks1,ks2,ks3);
- des_decrypt3(DES_LONG data[2],des_key_schedule ks1,ks2,ks3);
- They encrypt/decrypt the 64 bits held in 'data' using
- the 'ks' key schedules. The only difference between the 4 functions is that
- des_encrypt2() does not perform IP() or FP() on the data (this is an
- optimization for when doing triple DES and des_encrypt3() and des_decrypt3()
- perform triple des. The triple DES routines are in here because it does
- make a big difference to have them located near the des_encrypt2 function
- at link time..
- Now as we all know, there are lots of different operating systems running on
- x86 boxes, and unfortunately they normally try to make sure their assembler
- formating is not the same as the other peoples.
- The 4 main formats I know of are
- Microsoft Windows 95/Windows NT
- Elf Includes Linux and FreeBSD(?).
- a.out The older Linux.
- Solaris Same as Elf but different comments :-(.
- Now I was not overly keen to write 4 different copies of the same code,
- so I wrote a few perl routines to output the correct assembler, given
- a target assembler type. This code is ugly and is just a hack.
- The libraries are x86unix.pl and x86ms.pl.
- des586.pl, des686.pl and des-som[23].pl are the programs to actually
- generate the assembler.
- So to generate elf assembler
- perl des-som3.pl elf >dx86-elf.s
- For Windows 95/NT
- perl des-som2.pl win32 >win32.asm
- [ update 4 Jan 1996 ]
- I have added another way to do things.
- perl des-som3.pl cpp >dx86-cpp.s
- generates a file that will be included by dx86unix.cpp when it is compiled.
- To build for elf, a.out, solaris, bsdi etc,
- cc -E -DELF asm/dx86unix.cpp | as -o asm/dx86-elf.o
- cc -E -DSOL asm/dx86unix.cpp | as -o asm/dx86-sol.o
- cc -E -DOUT asm/dx86unix.cpp | as -o asm/dx86-out.o
- cc -E -DBSDI asm/dx86unix.cpp | as -o asm/dx86bsdi.o
- This was done to cut down the number of files in the distribution.
- Now the ugly part. I acquired my copy of Intels
- "Optimization's For Intel's 32-Bit Processors" and found a few interesting
- things. First, the aim of the exersize is to 'extract' one byte at a time
- from a word and do an array lookup. This involves getting the byte from
- the 4 locations in the word and moving it to a new word and doing the lookup.
- The most obvious way to do this is
- xor eax, eax # clear word
- movb al, cl # get low byte
- xor edi DWORD PTR 0x100+des_SP[eax] # xor in word
- movb al, ch # get next byte
- xor edi DWORD PTR 0x300+des_SP[eax] # xor in word
- shr ecx 16
- which seems ok. For the pentium, this system appears to be the best.
- One has to do instruction interleaving to keep both functional units
- operating, but it is basically very efficient.
- Now the crunch. When a full register is used after a partial write, eg.
- mov al, cl
- xor edi, DWORD PTR 0x100+des_SP[eax]
- 386 - 1 cycle stall
- 486 - 1 cycle stall
- 586 - 0 cycle stall
- 686 - at least 7 cycle stall (page 22 of the above mentioned document).
- So the technique that produces the best results on a pentium, according to
- the documentation, will produce hideous results on a pentium pro.
- To get around this, des686.pl will generate code that is not as fast on
- a pentium, should be very good on a pentium pro.
- mov eax, ecx # copy word
- shr ecx, 8 # line up next byte
- and eax, 0fch # mask byte
- xor edi DWORD PTR 0x100+des_SP[eax] # xor in array lookup
- mov eax, ecx # get word
- shr ecx 8 # line up next byte
- and eax, 0fch # mask byte
- xor edi DWORD PTR 0x300+des_SP[eax] # xor in array lookup
- Due to the execution units in the pentium, this actually works quite well.
- For a pentium pro it should be very good. This is the type of output
- Visual C++ generates.
- There is a third option. instead of using
- mov al, ch
- which is bad on the pentium pro, one may be able to use
- movzx eax, ch
- which may not incur the partial write penalty. On the pentium,
- this instruction takes 4 cycles so is not worth using but on the
- pentium pro it appears it may be worth while. I need access to one to
- experiment :-).
- eric (20 Oct 1996)
- 22 Nov 1996 - I have asked people to run the 2 different version on pentium
- pros and it appears that the intel documentation is wrong. The
- mov al,bh is still faster on a pentium pro, so just use the des586.pl
- install des686.pl
- 3 Dec 1996 - I added des_encrypt3/des_decrypt3 because I have moved these
- functions into des_enc.c because it does make a massive performance
- difference on some boxes to have the functions code located close to
- the des_encrypt2() function.
- 9 Jan 1997 - des-som2.pl is now the correct perl script to use for
- pentiums. It contains an inner loop from
- Svend Olaf Mikkelsen <svolaf@inet.uni-c.dk> which does raw ecb DES calls at
- 273,000 per second. He had a previous version at 250,000 and the best
- I was able to get was 203,000. The content has not changed, this is all
- due to instruction sequencing (and actual instructions choice) which is able
- to keep both functional units of the pentium going.
- We may have lost the ugly register usage restrictions when x86 went 32 bit
- but for the pentium it has been replaced by evil instruction ordering tricks.
- 13 Jan 1997 - des-som3.pl, more optimizations from Svend Olaf.
- raw DES at 281,000 per second on a pentium 100.
|