Technology: SIMD / MMX / SSE / SSE2 /

ShaneWu 2009-11-06

展開全文

Technology: SIMD / MMX / SSE / SSE2 / 3DNow! 收藏

An increasing number of newer processors come with extensions to their instruction set commonly referred to as SIMD instructions. SIMD stands for Single Instruction Multiple Data meaning that a single instruction, such as "add", operates on a number of data items in parallel. A typical SIMD instruction for example will add 8 16 bit values in parallel. Obviously this can increase execution speed dramatically which is why these instruction set extensions were created.

You can either read this page sequentially or jump directly to the following subsections:

[The 3DNow! instruction set extension is not covered, however, the SSE instruction set is similar.]

MMX Instruction Set

In 1997 Intel launched the Pentium Processor with MMX Technology. Subsequently the Pentium II, III and 4 were launched and other processors, such as the AMD Athlon and Duron, added this instruction set extension.

The MMX instruction set introduces 8 new 64 bit wide registers named mm0 .. mm7; however, these are mapped on the FPU registers, so FPU and MMX instructions can not be intermixed. The registers can hold the following data types:

Data type	MMX-Register
1x 64 bit (raw) integer	Bits 63 .. 0
2x 32 bit integers	32 bit word 1				16 bit word 0
4x 16 bit integers	16 bit word 3		16 bit word 2		16 bit word 1		16 bit word 0
8x 8 bit integers	Byte 7	Byte 6	Byte 5	Byte 4	Byte 3	Byte 2	Byte 1	Byte 0

Corresponding integer instructions were introduced:

Instruction class	Instruction	Operation
Data movement	movq	64 bit move from/to mmx register
Data movement	movd	32 bit move from/to mmx register
Data format conversion / movement	packsswb/dw	Parallel pack signed words into bytes / dwords into words with signed saturation
	packuswb/dw	Parallel pack unsigned words into bytes / dwords into words with unsigned saturation
	punpckhbw/wd/dq	Parallel unpack high 32 bits: bytes to words / words to dwords / dwords to qwords
	punpcklbw/wd/dq	Parallel unpack low 32 bits: bytes to words / words to dwords / dwords to qwords
Arithmetical operations	paddb/w/d	Parallel add for bytes / words / dwords
	psubb/w/d	Parallel subtraction for bytes / words / dwords
	paddsb/w	Parallel saturated add for signed bytes / words
	paddusb/w	Parallel saturated add for unsigned bytes / words
	psubsb/w	Parallel saturated subtraction for signed bytes / words
	psubusb/w	Parallel saturated subtraction for unsigned bytes / words
	pmaddwd	Parallel multiply signed words and add the results
	pmullhw	Parallel multiply signed words and store high 16 bits of results
	pmulllw	Parallel multiply signed words and store low 16 bits of results
	pcmpeqb/w/d	Parallel compare signed bytes / words / dwords for equality
	pcmpgtb/w/d	Parallel compare signed bytes / words / dwords for "greater than"
Logical operations	pand	Parallel and operation on all 64 bits
	pandn	Parallel and not operation on all 64 bits
	por	Parallel or operation on all 64 bits
	pxor	Parallel xor operation on all 64 bits
Shift operations	psllw/d/q	Parallel shift logical left words / dwords / qwords
	psraw/d	Parallel shift right signed words / dwords
	psrlw/d/q	Parallel shift right unsigned words / dwords / qwords
Enable FPU instructions	emms	Resets the FPU register states to enable FPU instructions

All operations are either register, register or register, memory operations. No memory, register or memory, memory operations are available.

Here come some examples:

Instruction	Operation
pxor mm0,mm0	Clear mm0
paddb mm0,mm1	Parallel add bytes from mm1 to mm0
psubusw mm7,[eax]	Parallel add unsigned words with saturation from [eax] to mm7
psslw mm5,3	Parallel shift left word in mm5 by 3 positions, filling in zeros

As can be seen from the above instruction list there are a number of limitations on this initial set of instructions:

Not all data types are supported for all operations, i.e. unsigned multiplication, byte-wide shifts are missing etc.
Various interesting operations are missing, i.e. min/max
The registers are shared with the FPU registers, prohibiting any FPU operations in parallel
No floating point operations are possible
The registers are only 64 bits wide
There are only 8 registers
There are not prefetch or write-through instructions
There are no real provisions for unaligned access
Constants are not supported

The subsequent introduction of the SSE and SSE2 instruction set extensions remove many of these limitations.

Jump to top of page

SSE Instruction Set

The Pentium III processor, introduced in 1999, was the first processor with the SSE instruction set extension. However, AMD at that time already had added the similar 3DNow! instruction set extension to their AMD K6 CPU.

The SSE instruction set in fact consists of 3 distinct enhancements:

Additional MMX instructions such as min/max
Prefetch and write-through instructions for optimizing data movement from and to the L2/L3 caches and main memory
New 128 bit XMM registers and corresponding 32 bit floating point (single precision) instructions

The SSE instruction set introduces 8 new 128 bit wide registers named xmm0 .. xmm7. The registers can hold the following data types:

Data type	XMM-Register
1x 128 bit raw integer	Bits 127 .. 0
4x 32 bit floating point values	32 bit floating point value 3	32 bit floating point value 2	32 bit floating point value 1	32 bit floating point value 0

The SSE instruction set includes the MMX instruction set; however, the following table lists only the additional SSE instructions:

Instruction class	Instruction	Operation
Prefetch and write-through instructions	prefetchx	Prefetch cache line into L1/L2/L3 cache
	movntq	Store data directly into main memory, bypassing the caches (MMX only)
	movntps	Store data directly into main memory, bypassing the caches (XMM only)
	maskmovq	Store individual bytes directly into main memory, bypassing the caches (MMX only)
XMM data movement	movaps	128 bit move from/to xmm register, memory data must be aligned
	movups	128 bit move from/to xmm register, memory data may be unaligned
	movss	32 bit move from/to xmm register
	movlps	64 bit move from/to lower 64 bit of xmm register
	movhps	64 bit move from/to high 64 bit of xmm register
	movlhps	64 bit move from lower 64 bit of xmm register to high 64 bit of xmm register
	movhlps	64 bit move from higher 64 bit of xmm register to lower 64 bit of xmm register
	movmskps	Extract upper bits of all dwords and store in a general register
Data format conversion / movement	pextrw	Extract word into general register
	pinsrw	Insert word from general register
	pmovmskb	Extract upper bits of all bytes and store in a general register
	pshufw	Shuffle words (MMX only)
	pshufps	Shuffle dwords (XMM only)
	unpckhps	Parallel unpack and interleave high dwords (XMM only)
	unpcklps	Parallel unpack and interleave low dwords (XMM only)
	cvtpi2ps	Parallel convert integer dwords to single precision floating point values (XMM only)
	cvtpi2ss	Convert integer dword to single precision floating point value (XMM only)
	cvtps2si	Parallel convert single precision floating point values to integer dwords (XMM only)
	cvtss2si	Convert single precision floating point value to integer dword (XMM only)
MMX arithmetical operations	pavgb/w	Parallel average for unsigned bytes / words
	pminub	Parallel minimum for unsigned bytes
	pmaxub	Parallel maximum for unsigned bytes
	pminsw	Parallel minimum for signed words
	pmaxsw	Parallel maximum for signed words
	pmulhuw	Parallel multiply unsigned words and store high 16 bits of results
	psadbw	Parallel absolute difference of unsigned bytes and sum up the results
XMM arithmetical operations	addps	Parallel add single precision floating point values
	addss	Add single precision floating point value in lower 32 bits
	subps	Parallel subtract single precision floating point values
	subss	Subtract single precision floating point value in lower 32 bits
	mulps	Parallel multiply single precision floating point values
	mulss	Multiply single precision floating point value in lower 32 bits
	divps	Parallel divide single precision floating point values
	divss	Divide single precision floating point value in lower 32 bits
	rcpps	Parallel approximate reciprocal single precision floating point values
	rcpss	Approximate reciprocal single precision floating point value in lower 32 bits
	sqrtps	Parallel square root single precision floating point values
	sqrtss	Square root single precision floating point value in lower 32 bits
	rsqrtps	Parallel approximate reciprocal square root single precision floating point values
	rsqrtss	Approximate reciprocal square root single precision floating point value in lower 32 bits
	minps	Parallel minimum of single precision floating point values
	minss	Minimum of precision floating point value in lower 32 bits
	maxps	Parallel maximum of single precision floating point values
	maxss	Maximum of precision floating point value in lower 32 bits
	cmpps	Parallel ompare single precision floating point values and return masks as result
	cmpss	Compare single precision floating point value in lower 32 bits and return mask as result
	comiss	Compare single precision floating point value in lower 32 bits and return result in flag register
	ucomiss	Compare unordered single precision floating point value in lower 32 bits and return result in flag register
XMM Logical operations	andps	Parallel and operation on all 128 bits
	andnps	Parallel and not operation on all 128 bits
	orps	Parallel or operation on all 128 bits
	xorps	Parallel xor operation on all 128 bits

Here come some examples:

Instruction	Operation
prefetchnta [esi+64]	Prefetch data into L1 (or L2) cache from [esi+64]
pminub mm3,mm4	Parallel minimum of unsigned bytes of mm3 and mm4
mulps xmm0,xmm5	Parallel single precision floating point add xmm5 to xmm0
maxps xmm1,[edi]	Parallel single precision floating point maximum of xmm1 and data at [edi]

As can be seen from the above instruction list there are still a number of limitations on this set of instructions:

Not all data types are supported for all operations, i.e. 32 bit integer multiplies, byte-wide shifts are missing etc.
Only 32 bit floating point data types are available
There are no integer instructions for the XMM registers
There are only 8 registers
There are no real provisions for unaligned access
Constants are not supported

The subsequent introduction of the SSE2 instruction set extension removes some of these limitations.

Jump to top of page

SSE2 Instruction Set

The Pentium 4 processor introduced the SSE2 instruction set extensions in 2000.

Two important improvements were made:

Support 64 bit (double precision) floating point data types
Integer instruction for the XMM registers

Because the SSE2 instruction set introduces integer instructions to the XMM registers and double precision floating data types the XMM registers can now hold following data types:

Data type

XMMx-Register

1x 128 bit raw integer

Bits 127 .. 0

2x 64 bit integer

64 bit word 1

64 bit word 0

4x 32 bit integer

32 bit word 3

32 bit word 2

32 bit word 1

32 bit word 0

8x 16 bit integer

16 bit word 7

16 bit word 6

16 bit word 5

16 bit word 4

16 bit word 3

16 bit word 2

16 bit word 1

16 bit word 0

16x 8 bit integer

Byte 15

Byte 14

Byte 13

Byte 12

Byte 11

Byte 10

Byte 9

Byte 8

Byte 7

Byte 6

Byte 5

Byte 4

Byte 3

Byte 2

Byte 1

Byte 0

2x 64 bit floating point values

64 bit floating point value 1

64 bit floating point value 0

4x 32 bit floating point values

32 bit floating point value 3

32 bit floating point value 2

32 bit floating point value 1

32 bit floating point value 0

The SSE2 instruction set includes the SSE instruction set (the MMX and SSE integer instructions also apply to the XMM registers now and the single precision floating point instructions of the SSE instruction set also apply to the double precision data types); the following table lists only the additional SSE2 instructions:

Instruction class	Instruction	Operation
Prefetch and write-through instructions	movntdq / movntpd	Store data directly into main memory, bypassing the caches (XMM only)
	movnti	Store data in general register directly into main memory, bypassing the caches
	maskmovdqu	Store individual bytes directly into main memory, bypassing the caches (XMM only)
XMM data movement	movapd / movdqa	128 bit move from/to xmm register, memory data must be aligned
	movupd / movdqu	128 bit move from/to xmm register, memory data may be unaligned
	movsd	64 bit move from/to xmm register
	movlpd	64 bit move from/to lower 64 bit of xmm register
	movhpd	64 bit move from/to high 64 bit of xmm register
	movlhpd	64 bit move from lower 64 bit of xmm register to high 64 bit of xmm register
	movhlpd	64 bit move from higher 64 bit of xmm register to lower 64 bit of xmm register
	movmskpd	Extract upper bits of all qwords and store in a general register
Data format conversion / movement	pmovmskpd	Extract upper bits of all qwords and store in a general register
	pshuflw	Shuffle words in lower 64 bits (XMM only)
	pshufhw	Shuffle words in high 64 bits (XMM only)
	pshufpd	Shuffle dwords (XMM only)
	unpckhpd	Parallel unpack and interleave high qwords (XMM only)
	unpcklpd	Parallel unpack and interleave low qwords (XMM only)
	cvtpi2pd	Parallel convert integer dwords to double precision floating point values (XMM only)
	punpckhqdq	Parallel unpack high qwords
	punpcklqdq	Parallel unpack lower qwords
	cvtpi2sd	Convert integer dword to double precision floating point value (XMM only)
	cvtpd2si	Parallel convert double precision floating point values to integer qwords (XMM only)
	cvtsd2si	Convert double precision floating point value to integer qword (XMM only)
	cvtps2pd	Parallel convert single precision floating point values to double precision floating point values
	cvtpd2ps	Parallel convert double precision floating point values to single precision floating point values
	cvtss2sd	Convert single precision floating point value to double precision floating point value
	cvtsd2ss	Convert double precision floating point value to single precision floating point value
	cvtps2dq	Parallel convert single precision floating point values to signed dword integers
	cvtdq2ps	Parallel convert signed dword integers to single precision floating point values
	movq2dq	Move MMX register value to XMM register
	movdq2q	Move lower 64 bit of XMM register to MMX register
Integer arithmetical operations	paddq	Parallel add for 64 bit integers
	psubq	Parallel subtraction for 64 bit integers
	pmuludq	Parallel multiply 32 bit unsigned integers and return 64 bit result
XMM arithmetical operations	addpd	Parallel add double precision floating point values
	addsd	Add double precision floating point value in lower 64 bits
	subpd	Parallel subtract double precision floating point values
	subsd	Subtract double precision floating point value in lower 64 bits
	mulpd	Parallel multiply double precision floating point values
	mulsd	Multiply double precision floating point value in lower 64 bits
	divpd	Parallel divide double precision floating point values
	divsd	Divide double precision floating point value in lower 64 bits
	rcppd	Parallel approximate reciprocal double precision floating point values
	rcpsd	Approximate reciprocal double precision floating point value in lower 64 bits
	sqrtpd	Parallel square root double precision floating point values
	sqrtsd	Square root double precision floating point value in lower 64 bits
	rsqrtpd	Parallel approximate reciprocal square root double precision floating point values
	rsqrtsd	Approximate reciprocal square root double precision floating point value in lower 64 bits
	minpd	Parallel minimum of double precision floating point values
	minsd	Minimum of precision floating point value in lower 64 bits
	maxpd	Parallel maximum of double precision floating point values
	maxsd	Maximum of precision floating point value in lower 64 bits
	cmpps	Compare double precision floating point values and return masks as result
	cmpsd	Compare double precision floating point value in lower 64 bits and return mask as result
	comisd	Compare double precision floating point value in lower 64 bits and return result in flag register
	ucomisd	Compare unordered double precision floating point value in lower 64 bits and return result in flag register
XMM logical operations	andpd	Parallel and operation on all 128 bits
	andnpd	Parallel and not operation on all 128 bits
	orpd	Parallel or operation on all 128 bits
	xorpd	Parallel xor operation on all 128 bits
XMM shift operations	pslldq	Shift all 128 bits left
XMM shift operations	psrldq	Shift all 128 bits rights, filling in zeros

Here come some examples:

Instruction	Operation
paddd xmm0,xmm6	Parallel add integer dwords from xmm6 to xmm0
pmaddwd xmm2,[edx]	Parallel multiply signed integer words in xmm2 and at [edi] and add results
sqrtpd xmm0,xmm0	Parallel double precision floating point square root of values in xmm0
addpd xmm2,[ebx+16]	Parallel add double precision floating point values from [ebx+16] to xmm2
psrldq xmm5,1	Shift right all 128 bits by one, filling in a zero

As can be seen from the above instruction list there are still a number of limitations on this set of instructions:

Not all data types are supported for all operations, i.e. 32 bit saturated adds, byte-wide shifts are missing etc.
There are only 8 registers
There are no real provisions for unaligned access
Constants are not supported

It is unclear whether future instruction set extensions will remove some of these limitations.

Nevertheless the SSE2 instruction set in effect is a parallel version of the typical integer and floating point instructions found in the basic instruction set of most processors, thereby enabling impressive speed-ups.

Jump to top of page

General Issues with SIMD Instructions

The basic characteristic of SIMD instructions is that they operate on n data items in parallel, n typically being 1, 2, 4, 8 or 16. There are mainly 5 problems with this:

If the data layout does not match the SIMD requirements SIMD instructions can either not be used or data rearrangement code is necessary
In case of unaligned data the performance will suffer dramatically; this problem leads to massively more complicated alignment code being necessary
If the problem does not match the possible values of n, for example in case of a filter which would require n = 5, the solution is rather complicated
(Parallel) Table lookups are not possible
Compare operations are limited

Basically there are solutions to these problems, but they require some clever or additional code. Due to these problems even compilers which generate SIMD code will in many cases not do so or not be able to do so.

Jump to top of page

Speed-Ups

SIMD instructions have the potential to speed-up software by factors of 2 .. 16 or even more. This especially applies to compute-intensive software dealing with arrays of data.

Please also see the case studies which give numbers on speed-up factors.

Jump to top of page

Do Compilers Create SIMD Code?

Most compilers do not create SIMD code, especially neither the Microsoft Visual C/C++ 6.0 nor the GCC compiler 3.0x for x86 do so. Please check your compiler manual for details.

This simple fact deprives any compiler-only generated program of significant possible speed-ups.

In fact things are even worse: Because (in case of the SSE/SSE2 instruction sets) the prefetch and write-though instructions are part of the instruction set extensions, these are also not supported.

Jump to top of page

Conclusion

SIMD instructions have the potential to speed-up software by factors of 2 .. 16 or even more.

However, as most compilers do not utilize the (full) potential of these instructions these speed-ups can generally only be achieved by an optimization expert.

Hayes Technologies offers services to help your software realize the high potential of the SIMD instruction sets.

Jump to top of page