乡下人产国偷v产偷v自拍,国产午夜片在线观看,婷婷成人亚洲综合国产麻豆,久久综合给合久久狠狠狠9

  • <output id="e9wm2"></output>
    <s id="e9wm2"><nobr id="e9wm2"><ins id="e9wm2"></ins></nobr></s>

    • 分享

      Technology: SIMD / MMX / SSE / SSE2 /

       ShaneWu 2009-11-06

      Technology: SIMD / MMX / SSE / SSE2 / 3DNow! 收藏

      An increasing number of newer processors come with extensions to their instruction set commonly referred to as SIMD instructions. SIMD stands for Single Instruction Multiple Data meaning that a single instruction, such as "add", operates on a number of data items in parallel. A typical SIMD instruction for example will add 8 16 bit values in parallel. Obviously this can increase execution speed dramatically which is why these instruction set extensions were created.

      You can either read this page sequentially or jump directly to the following subsections:

      [The 3DNow! instruction set extension is not covered, however, the SSE instruction set is similar.]

      MMX Instruction Set

      In 1997 Intel launched the Pentium Processor with MMX Technology. Subsequently the Pentium II, III and 4 were launched and other processors, such as the AMD Athlon and Duron, added this instruction set extension.

      The MMX instruction set introduces 8 new 64 bit wide registers named mm0 .. mm7; however, these are mapped on the FPU registers, so FPU and MMX instructions can not be intermixed. The registers can hold the following data types:

      Data type MMX-Register
      1x 64 bit (raw) integer Bits 63 .. 0
      2x 32 bit integers 32 bit word 1 16 bit word 0
      4x 16 bit integers 16 bit word 3 16 bit word 2 16 bit word 1 16 bit word 0
      8x 8 bit integers Byte 7 Byte 6 Byte 5 Byte 4 Byte 3 Byte 2 Byte 1 Byte 0

      Corresponding integer instructions were introduced:

      Instruction class

      Instruction

      Operation

      Data movement movq 64 bit move from/to mmx register
      movd 32 bit move from/to mmx register
      Data format conversion / movement packsswb/dw Parallel pack signed words into bytes / dwords into words with signed saturation
      packuswb/dw Parallel pack unsigned words into bytes / dwords into words with unsigned saturation
      punpckhbw/wd/dq Parallel unpack high 32 bits: bytes to words / words to dwords / dwords to qwords 
      punpcklbw/wd/dq Parallel unpack low 32 bits: bytes to words / words to dwords / dwords to qwords
      Arithmetical operations paddb/w/d Parallel add for bytes / words / dwords
      psubb/w/d Parallel subtraction for bytes / words / dwords
      paddsb/w Parallel saturated add for signed bytes / words
      paddusb/w Parallel saturated add for unsigned bytes / words
      psubsb/w Parallel saturated subtraction for signed bytes / words
      psubusb/w Parallel saturated subtraction for unsigned bytes / words
      pmaddwd Parallel multiply signed words and add the results
      pmullhw Parallel multiply signed words and store high 16 bits of results
      pmulllw Parallel multiply signed words and store low 16 bits of results
      pcmpeqb/w/d Parallel compare signed bytes / words / dwords for equality
      pcmpgtb/w/d Parallel compare signed bytes / words / dwords for "greater than"
      Logical operations pand Parallel and operation on all 64 bits
      pandn Parallel and not operation on all 64 bits
      por Parallel or operation on all 64 bits
      pxor Parallel xor operation on all 64 bits
      Shift operations psllw/d/q Parallel shift logical left words / dwords / qwords
      psraw/d Parallel shift right signed words / dwords
      psrlw/d/q Parallel shift right unsigned words / dwords / qwords
      Enable FPU instructions emms Resets the FPU register states to enable FPU instructions

      All operations are either register, register or register, memory operations. No memory, register or memory, memory operations are available.

      Here come some examples:

      Instruction

      Operation

      pxor mm0,mm0 Clear mm0
      paddb mm0,mm1 Parallel add bytes from mm1 to mm0
      psubusw mm7,[eax] Parallel add unsigned words with saturation from [eax] to mm7
      psslw mm5,3 Parallel shift left word in mm5 by 3 positions, filling in zeros

      As can be seen from the above instruction list there are a number of limitations on this initial set of instructions:

      • Not all data types are supported for all operations, i.e. unsigned multiplication, byte-wide shifts are missing etc.
      • Various interesting operations are missing, i.e. min/max
      • The registers are shared with the FPU registers, prohibiting any FPU operations in parallel
      • No floating point operations are possible
      • The registers are only 64 bits wide
      • There are only 8 registers
      • There are not prefetch or write-through instructions
      • There are no real provisions for unaligned access
      • Constants are not supported

      The subsequent introduction of the SSE and SSE2 instruction set extensions remove many of these limitations.

      Jump to top of page

      SSE Instruction Set

      The Pentium III processor, introduced in 1999, was the first processor with the SSE instruction set extension. However, AMD at that time already had added the similar 3DNow! instruction set extension to their AMD K6 CPU.

      The SSE instruction set in fact consists of 3 distinct enhancements:

      • Additional MMX instructions such as min/max
      • Prefetch and write-through instructions for optimizing data movement from and to the L2/L3 caches and main memory
      • New 128 bit XMM registers and corresponding 32 bit floating point (single precision) instructions

      The SSE instruction set introduces 8 new 128 bit wide registers named xmm0 .. xmm7. The registers can hold the following data types:

      Data type XMM-Register
      1x 128 bit raw integer Bits 127 .. 0
      4x 32 bit floating point values 32 bit floating point value 3 32 bit floating point value 2 32 bit floating point value 1 32 bit floating point value 0

      The SSE instruction set includes the MMX instruction set; however, the following table lists only the additional SSE instructions:

      Instruction class

      Instruction

      Operation

      Prefetch and write-through instructions prefetchx Prefetch cache line into L1/L2/L3 cache
      movntq Store data directly into main memory, bypassing the caches (MMX only)
      movntps Store data directly into main memory, bypassing the caches (XMM only)
      maskmovq Store individual bytes directly into main memory, bypassing the caches (MMX only)
      XMM data movement movaps 128 bit move from/to xmm register, memory data must be aligned
      movups 128 bit move from/to xmm register, memory data may be unaligned
      movss 32 bit move from/to xmm register
      movlps 64 bit move from/to lower 64 bit of xmm register
      movhps 64 bit move from/to high 64 bit of xmm register
      movlhps 64 bit move from lower 64 bit of xmm register to high 64 bit of xmm register
      movhlps 64 bit move from higher 64 bit of xmm register to lower 64 bit of xmm register
      movmskps Extract upper bits of all dwords and store in a general register 
      Data format conversion / movement pextrw Extract word into general register
      pinsrw Insert word from general register
      pmovmskb Extract upper bits of all bytes and store in a general register 
      pshufw Shuffle words (MMX only)
      pshufps Shuffle dwords (XMM only)
      unpckhps Parallel unpack and interleave high dwords (XMM only)
      unpcklps Parallel unpack and interleave low dwords (XMM only)
      cvtpi2ps Parallel convert integer dwords to single precision floating point values (XMM only)
      cvtpi2ss Convert integer dword to single precision floating point value (XMM only)
      cvtps2si Parallel convert single precision floating point values to integer dwords (XMM only)
      cvtss2si Convert single precision floating point value to integer dword (XMM only) 
      MMX arithmetical operations pavgb/w Parallel average for unsigned bytes / words
      pminub Parallel minimum for unsigned bytes
      pmaxub Parallel maximum for unsigned bytes
      pminsw Parallel minimum for signed words
      pmaxsw Parallel maximum for signed words
      pmulhuw Parallel multiply unsigned words and store high 16 bits of results
      psadbw Parallel absolute difference of unsigned bytes and sum up the results
      XMM arithmetical operations addps Parallel add single precision floating point values
      addss Add single precision floating point value in lower 32 bits
      subps Parallel subtract single precision floating point values
      subss Subtract single precision floating point value in lower 32 bits
      mulps Parallel multiply single precision floating point values
      mulss Multiply single precision floating point value in lower 32 bits
      divps Parallel divide single precision floating point values
      divss Divide single precision floating point value in lower 32 bits
      rcpps Parallel approximate reciprocal single precision floating point values
      rcpss Approximate reciprocal single precision floating point value in lower 32 bits
      sqrtps Parallel square root single precision floating point values
      sqrtss Square root single precision floating point value in lower 32 bits
      rsqrtps Parallel approximate reciprocal square root single precision floating point values
      rsqrtss Approximate reciprocal square root single precision floating point value in lower 32 bits
      minps Parallel minimum of single precision floating point values
      minss Minimum of precision floating point value in lower 32 bits
      maxps Parallel maximum of single precision floating point values
      maxss Maximum of precision floating point value in lower 32 bits
      cmpps Parallel ompare single precision floating point values and return masks as result
      cmpss Compare single precision floating point value in lower 32 bits and return mask as result
      comiss Compare single precision floating point value in lower 32 bits and return result in flag register
      ucomiss Compare unordered single precision floating point value in lower 32 bits and return result in flag register
      XMM Logical operations andps Parallel and operation on all 128 bits
      andnps Parallel and not operation on all 128 bits
      orps Parallel or operation on all 128 bits
      xorps Parallel xor operation on all 128 bits

      Here come some examples:

      Instruction

      Operation

      prefetchnta [esi+64] Prefetch data into L1 (or L2) cache from [esi+64]
      pminub mm3,mm4 Parallel minimum of unsigned bytes of mm3 and mm4
      mulps xmm0,xmm5 Parallel single precision floating point add xmm5 to xmm0
      maxps xmm1,[edi] Parallel single precision floating point maximum of xmm1 and data at [edi]

      As can be seen from the above instruction list there are still a number of limitations on this set of instructions:

      • Not all data types are supported for all operations, i.e. 32 bit integer multiplies, byte-wide shifts are missing etc.
      • Only 32 bit floating point data types are available
      • There are no integer instructions for the XMM registers
      • There are only 8 registers
      • There are no real provisions for unaligned access
      • Constants are not supported

      The subsequent introduction of the SSE2 instruction set extension removes some of these limitations.

      Jump to top of page

      SSE2 Instruction Set

      The Pentium 4 processor introduced the SSE2 instruction set extensions in 2000.

      Two important improvements were made:

      • Support 64 bit (double precision) floating point data types
      • Integer instruction for the XMM registers 

      Because the SSE2 instruction set introduces integer instructions to the XMM registers and double precision floating data types the XMM registers can now hold following data types:

      Data type XMMx-Register
      1x 128 bit raw integer Bits 127 .. 0
      2x 64 bit integer 64 bit word 1 64 bit word 0
      4x 32 bit integer 32 bit word 3 32 bit word 2 32 bit word 1 32 bit word 0
      8x 16 bit integer 16 bit word 7 16 bit word 6 16 bit word 5 16 bit word 4 16 bit word 3 16 bit word 2 16 bit word 1 16 bit word 0
      16x 8 bit integer Byte 15 Byte 14 Byte 13 Byte 12 Byte 11 Byte 10 Byte 9 Byte 8 Byte 7 Byte 6 Byte 5 Byte 4 Byte 3 Byte 2 Byte 1 Byte 0
      2x 64 bit floating point values 64 bit floating point value 1 64 bit floating point value 0
      4x 32 bit floating point values 32 bit floating point value 3 32 bit floating point value 2 32 bit floating point value 1 32 bit floating point value 0

      The SSE2 instruction set includes the SSE instruction set (the MMX and SSE integer instructions also apply to the XMM registers now and the single precision floating point instructions of the SSE instruction set also apply to the double precision data types); the following table lists only the additional SSE2 instructions:

      Instruction class

      Instruction

      Operation

      Prefetch and write-through instructions movntdq / movntpd Store data directly into main memory, bypassing the caches (XMM only)
      movnti Store data in general register directly into main memory, bypassing the caches
      maskmovdqu Store individual bytes directly into main memory, bypassing the caches (XMM only)
      XMM data movement movapd / movdqa 128 bit move from/to xmm register, memory data must be aligned
      movupd / movdqu 128 bit move from/to xmm register, memory data may be unaligned
      movsd 64 bit move from/to xmm register
      movlpd 64 bit move from/to lower 64 bit of xmm register
      movhpd 64 bit move from/to high 64 bit of xmm register
      movlhpd 64 bit move from lower 64 bit of xmm register to high 64 bit of xmm register
      movhlpd 64 bit move from higher 64 bit of xmm register to lower 64 bit of xmm register
      movmskpd Extract upper bits of all qwords and store in a general register 
      Data format conversion / movement pmovmskpd Extract upper bits of all qwords and store in a general register 
      pshuflw Shuffle words in lower 64 bits (XMM only)
      pshufhw Shuffle words in high 64 bits (XMM only)
      pshufpd Shuffle dwords (XMM only)
      unpckhpd Parallel unpack and interleave high qwords (XMM only)
      unpcklpd Parallel unpack and interleave low qwords (XMM only)
      cvtpi2pd Parallel convert integer dwords to double precision floating point values (XMM only)
      punpckhqdq Parallel unpack high qwords
      punpcklqdq Parallel unpack lower qwords
      cvtpi2sd Convert integer dword to double precision floating point value (XMM only)
      cvtpd2si Parallel convert double precision floating point values to integer qwords (XMM only)
      cvtsd2si Convert double precision floating point value to integer qword (XMM only) 
      cvtps2pd Parallel convert single precision floating point values to double precision floating point values
      cvtpd2ps Parallel convert double precision floating point values to single precision floating point values
      cvtss2sd Convert single precision floating point value to double precision floating point value
      cvtsd2ss Convert double precision floating point value to single precision floating point value
      cvtps2dq Parallel convert single precision floating point values to signed dword integers
      cvtdq2ps Parallel convert signed dword integers to single precision floating point values
      movq2dq Move MMX register value to XMM register
      movdq2q Move lower 64 bit of XMM register to MMX register
      Integer arithmetical operations paddq Parallel add for 64 bit integers
      psubq Parallel subtraction for 64 bit integers
      pmuludq Parallel multiply 32 bit unsigned integers and return 64 bit result
      XMM arithmetical operations addpd Parallel add double precision floating point values
      addsd Add double precision floating point value in lower 64 bits
      subpd Parallel subtract double precision floating point values
      subsd Subtract double precision floating point value in lower 64 bits
      mulpd Parallel multiply double precision floating point values
      mulsd Multiply double precision floating point value in lower 64 bits
      divpd Parallel divide double precision floating point values
      divsd Divide double precision floating point value in lower 64 bits
      rcppd Parallel approximate reciprocal double precision floating point values
      rcpsd Approximate reciprocal double precision floating point value in lower 64 bits
      sqrtpd Parallel square root double precision floating point values
      sqrtsd Square root double precision floating point value in lower 64 bits
      rsqrtpd Parallel approximate reciprocal square root double precision floating point values
      rsqrtsd Approximate reciprocal square root double precision floating point value in lower 64 bits
      minpd Parallel minimum of double precision floating point values
      minsd Minimum of precision floating point value in lower 64 bits
      maxpd Parallel maximum of double precision floating point values
      maxsd Maximum of precision floating point value in lower 64 bits
      cmpps Compare double precision floating point values and return masks as result
      cmpsd Compare double precision floating point value in lower 64 bits and return mask as result
      comisd Compare double precision floating point value in lower 64 bits and return result in flag register
      ucomisd Compare unordered double precision floating point value in lower 64 bits and return result in flag register
      XMM logical operations andpd Parallel and operation on all 128 bits
      andnpd Parallel and not operation on all 128 bits
      orpd Parallel or operation on all 128 bits
      xorpd Parallel xor operation on all 128 bits
      XMM shift operations pslldq Shift all 128 bits left
      psrldq Shift all 128 bits rights, filling in zeros

      Here come some examples:

      Instruction

      Operation

      paddd xmm0,xmm6 Parallel add integer dwords from xmm6 to xmm0
      pmaddwd xmm2,[edx] Parallel multiply signed integer words in xmm2 and at [edi] and add results
      sqrtpd xmm0,xmm0 Parallel double precision floating point square root of values in xmm0
      addpd xmm2,[ebx+16] Parallel add double precision floating point values from [ebx+16] to xmm2
      psrldq xmm5,1 Shift right all 128 bits by one, filling in a zero

      As can be seen from the above instruction list there are still a number of limitations on this set of instructions:

      • Not all data types are supported for all operations, i.e. 32 bit saturated adds, byte-wide shifts are missing etc.
      • There are only 8 registers
      • There are no real provisions for unaligned access
      • Constants are not supported

      It is unclear whether future instruction set extensions will remove some of these limitations.

      Nevertheless the SSE2 instruction set in effect is a parallel version of the typical integer and floating point instructions found in the basic instruction set of most processors, thereby enabling impressive speed-ups.

      Jump to top of page

      General Issues with SIMD Instructions

      The basic characteristic of SIMD instructions is that they operate on n data items in parallel, n typically being 1, 2, 4, 8 or 16. There are mainly 5 problems with this:

      • If the data layout does not match the SIMD requirements SIMD instructions can either not be used or data rearrangement code is necessary
      • In case of unaligned data the performance will suffer dramatically; this problem leads to massively more complicated alignment code being necessary
      • If the problem does not match the possible values of n, for example in case of a filter which would require n = 5, the solution is rather complicated 
      • (Parallel) Table lookups are not possible
      • Compare operations are limited

      Basically there are solutions to these problems, but they require some clever or additional code. Due to these problems even compilers which generate SIMD code will in many cases not do so or not be able to do so.

      Jump to top of page

      Speed-Ups

      SIMD instructions have the potential to speed-up software by factors of 2 .. 16 or even more. This especially applies to compute-intensive software dealing with arrays of data.

      Please also see the case studies which give numbers on speed-up factors.

      Jump to top of page

      Do Compilers Create SIMD Code?

      Most compilers do not create SIMD code, especially neither the Microsoft Visual C/C++ 6.0 nor the GCC compiler 3.0x for x86 do so. Please check your compiler manual for details.

      This simple fact deprives any compiler-only generated program of significant possible speed-ups.

      In fact things are even worse: Because (in case of the SSE/SSE2 instruction sets) the prefetch and write-though instructions are part of the instruction set extensions, these are also not supported.

      Jump to top of page

      Conclusion

      SIMD instructions have the potential to speed-up software by factors of 2 .. 16 or even more.

      However, as most compilers do not utilize the (full) potential of these instructions these speed-ups can generally only be achieved by an optimization expert.

      Hayes Technologies offers services to help your software realize the high potential of the SIMD instruction sets.

      Jump to top of page

        本站是提供個人知識管理的網絡存儲空間,所有內容均由用戶發(fā)布,不代表本站觀點。請注意甄別內容中的聯系方式、誘導購買等信息,謹防詐騙。如發(fā)現有害或侵權內容,請點擊一鍵舉報。
        轉藏 分享 獻花(0

        0條評論

        發(fā)表

        請遵守用戶 評論公約

        類似文章 更多