After experimenting with OpenCL, I became frustrated with the poor performance of the NVidia 9400M card and started playing with SSE vector instructions. While there is some good documentation on these instructions, it is spread out and tedious to navigate. This page is a combination of my notes, a large table containing all MMX and SSE instructions up to SSE 4.2, and some Javascript to filter the table.

The original Intel instruction set is called IA32. The early Pentium had a set of seven integer General Purpose Registers (GPRs) and a stack of eight floating point unit (FPU) registers. Instructions placed their results into a single register or memory location. This is called a scalar operation.

Multi-Media Extensions (MMX) was Intel's first attempt at providing vector instructions. MMX re-uses the floating point unit's registers but treats them as an array or vector of integers. An MMX instruction can operate on four 16-bit integers or two 32-bit integers in a single MMX register. For operations that could be performed in parallel, MMX could potentially make the code four times as fast. This idea is known as Single Instruction Multiple Data (SIMD). The downside to MMX was that the code could not do MMX "packed integer" operations and floating point operations at the same time. After using an MMX instruction, the programmer had to remember to clear the floating point registers with EMMS.

AMD invented the 3DNow! instruction set which treated the FPU/MMX registers as vectors of 32-bit floating point values. This suffered from the same register overlap problem as MMX. Even though 3DNow! provided many floating point instructions, it did not provide the more complex functions like trigonometric or logarithmic functions, so many programs still needed to switch modes. The 3DNow! instructions did not win the market away from Intel and AMD discontinued 3DNow! in 2010 in favor of Intel's later design.

With the Streaming SIMD Extensions (SSE), Intel created an entirely new set of registers. XMM registers were 128 bits long and each could contain four 32-bit floating point values. Like 3DNow!, SSE provided instructions for doing basic floating point math but did not provide the more complex instructions from the IA32 FPU. Later, SSE2 added the ability to use the XMM registers as two 64-bit floating point values or as packed integers.

If the programmer did not use the MMX registers, then the FPU stack could be kept in floating point mode for access to the complex functions. If the program did not require complex floating point, the MMX registers could be used for scalar integer operations to reduce memory accesses. The extra size of the SSE registers still make them more attractive for integer vector operations.

SSE2 made the instruction set more consistent and provided many conversion instructions for moving values between different vector formats and general purpose registers. Later SSE versions were refinements that added additional instructions, but did not significantly change the programming environment.

AVX is the next generation. It expands the SSE registers to 256 bits each which could double the throughput of SSE code. It also provides a new encoding scheme which allows non-destructive operations.

Version | Intel | AMD |
---|---|---|

MMX | 1996, Pentium P5 w/MMX; Pentium II | 1997, K6 |

3DNow! | not invented here | 1998, K6-2 |

SSE | 1999, Pentium III | 2001, K7 Palomino |

SSE2 | 2001, Pentium 4, Xeon | 2005, K8 core |

SSE3 | 2004, Pentium 4 w/HT | 2005, K8 core |

SSSE3 | 2006, Core 2; Xeon | 2011, Bobcat, Bulldozer |

SSE4.1 | 2007, Penryn 45nm process | 2011, Bulldozer |

SSE4.2 | 2008, Nehamem Core i7 | 2011, Bulldozer |

AVX | 2011, Sandy Bridge | 2011, Bulldozer |

AVX2 | 2013, Haswell | no data |

MMX should be avoided due to conflicts with the FPU. SSE3 has been supported by Intel and AMD for six years now. It would be reasonable to treat SSE3 as the baseline for programming in 2012. The next major jump in performance will happen with AVX and it will be supported by AMD, but AVX has only been on the market since 2011.

Before using vector instructions, you should check which instructions your current processor supports using the CPUID instruction. Below are two ways of formatting that information.

// Inline ASM for GCC. #define cpuid(func,ax,bx,cx,dx) __asm__ __volatile__ ("cpuid": "=a" (ax), "=b" (bx), "=c" (cx), "=d" (dx) : "a" (func)); #define xgetbv(func,lo,hi) __asm__ __volatile__ ("xgetbv": "=a" (lo), "=d" (hi) : "c" (func)); // If we assume that each SSE level implies that all previous levels are supported, we can reduce the check to a single number. int get_sse_level() { int a,b,c,d,e,f; cpuid(1,a,b,c,d); if ((c & 0x018000000) == 0x018000000) { // AVX bit, OSXSAVE bit xgetbv(0,e,f); if ((e & 6) == 6) { return 500; // AVX } } if (c & (1 << 20)) return 420; // SSE4.2 if (c & (1 << 19)) return 410; // SSE4.1 if (c & (1 << 9)) return 310; // SSSE3 if (c & (1 << 0)) return 300; // SSE3 if (d & (1 << 26)) return 200; // SSE2 if (d & (1 << 25)) return 100; // SSE if (d & (1 << 23)) return 10; // MMX return 0; } // It is safer to look at the individual bits for each instruction group and optional instructions. int get_sse_bits() { int a,b,c,d,e,f; cpuid(1,a,b,c,d); int bits = 0; if (d & (1 << 23)) bits |= 0x0001; // MMX if (d & (1 << 24)) bits |= 0x0002; // FXSAVE, FXRSTOR (always with SSE) if (d & (1 << 25)) bits |= 0x0004; // SSE if (d & (1 << 26)) bits |= 0x0008; // SSE2 if (c & (1 << 0)) bits |= 0x0010; // SSE3 if (c & (1 << 9)) bits |= 0x0020; // SSSE3 if (c & (1 << 19)) bits |= 0x0040; // SSE4.1 if (c & (1 << 20)) bits |= 0x0080; // SSE4.2 if (c & (1 << 28)) { if (c & (1 << 27)) { xgetbv(0,e,f); if ((e & 6) == 6) { bits |= 0x0100; // AVX } } } if (c & (1 << 3)) bits |= 0x01000; // MONITOR,MWAIT (SSE3 option) if (d & (1 << 19)) bits |= 0x02000; // CLFLUSH (SSE2 option) if (c & (1 << 1)) bits |= 0x04000; // PCLMULQDQ (SSE option) if (c & (1 << 12)) bits |= 0x08000; // FMA (AVX option) if (c & (1 << 23)) bits |= 0x10000; // POPCNT return bits; }

When Intel introduced the MMX instruction set, they implemented special "intrinsic" functions in their compiler to manipulate vector values directly from C/C++ code. Microsoft and the GCC developers copied those intrinsics in their own compilers. For the MMX instructions, most assembler instructions have two intrinsics, one formed by prepending _m_ to the assembler mnemonic, and another beginning with _mm_ that is more descriptive. So although assembler mnemonic PUNPCKHBW has an intrinsic _m_punpckhbw, most programmers would use the more descriptive _mm_unpackhi_pi8 instead. There are exceptions such as the bidirectional MOVD (_m_from_int = _mm_cvtsi32_si64 and _m_to_int = _mm_cvtsi64_si32) and composite intrinsics such as _mm_set_pi8 which correspond to more than one assembler instruction. For SSE, many of the _m_ forms were never defined.

The GCC developers implemented a set of intrinsics that correspond to assembler instructions (PBLENDW = __builtin_ia32_pblendw128) but they also provided header files to duplicate the Intel intrinsics. You only need to include one header file: use the one that provides the latest SSE version that you use. The current header files are:

Instruction set | Header file |
---|---|

MMX | mmintrin.h |

SSE | xmmintrin.h |

SSE2 | emmintrin.h |

SSE3 | pmmintrin.h |

SSSE3 | tmmintrin.h |

SSE4.1 | smmintrin.h |

SSE4.2 | nmmintrin.h |

AVX | immintrin.h |

Note that some of the intrinsics documented here have nothing to do with vector operations. Instructions such as PAUSE, POPCOUNT, and CRC32 were introduced by Intel within a group of SSE instructions but they do not really belong there. Note also that some instructions were introduced with an SSE group but they have their own CPUID bit, so you must check for them appropriately.

I Set | Fun Cat | Data Type | Intrinsic | Description |
---|---|---|---|---|

MMX | BITLOGIC | INTMMX | __m64 _mm_and_si64(__m64 __m1, __m64 __m2) | Bit-wise AND the 64-bit values in M1 and M2. |

MMX | BITLOGIC | INTMMX | __m64 _m_pand(__m64 __m1, __m64 __m2) | Bit-wise AND the 64-bit values in M1 and M2. |

MMX | BITLOGIC | INTMMX | __m64 _mm_andnot_si64(__m64 __m1, __m64 __m2) | Bit-wise complement the 64-bit value in M1 and bit-wise AND it with the 64-bit value in M2. |

MMX | BITLOGIC | INTMMX | __m64 _m_pandn(__m64 __m1, __m64 __m2) | Bit-wise complement the 64-bit value in M1 and bit-wise AND it with the 64-bit value in M2. |

MMX | BITLOGIC | INTMMX | __m64 _mm_or_si64(__m64 __m1, __m64 __m2) | Bit-wise inclusive OR the 64-bit values in M1 and M2. |

MMX | BITLOGIC | INTMMX | __m64 _m_por(__m64 __m1, __m64 __m2) | Bit-wise inclusive OR the 64-bit values in M1 and M2. |

MMX | BITLOGIC | INTMMX | __m64 _mm_xor_si64(__m64 __m1, __m64 __m2) | Bit-wise exclusive OR the 64-bit values in M1 and M2. |

MMX | BITLOGIC | INTMMX | __m64 _m_pxor(__m64 __m1, __m64 __m2) | Bit-wise exclusive OR the 64-bit values in M1 and M2. |

MMX | BITSHIFT | INTMMX | __m64 _mm_sll_pi16(__m64 __m, __m64 __count) | Shift four 16-bit values in M left by COUNT. |

MMX | BITSHIFT | INTMMX | __m64 _m_psllw(__m64 __m, __m64 __count) | Shift four 16-bit values in M left by COUNT. |

MMX | BITSHIFT | INTMMX | __m64 _mm_slli_pi16(__m64 __m, int __count) | Shift four 16-bit values in M left by COUNT. |

MMX | BITSHIFT | INTMMX | __m64 _m_psllwi(__m64 __m, int __count) | Shift four 16-bit values in M left by COUNT. |

MMX | BITSHIFT | INTMMX | __m64 _mm_sll_pi32(__m64 __m, __m64 __count) | Shift two 32-bit values in M left by COUNT. |

MMX | BITSHIFT | INTMMX | __m64 _m_pslld(__m64 __m, __m64 __count) | Shift two 32-bit values in M left by COUNT. |

MMX | BITSHIFT | INTMMX | __m64 _mm_slli_pi32(__m64 __m, int __count) | Shift two 32-bit values in M left by COUNT. |

MMX | BITSHIFT | INTMMX | __m64 _m_pslldi(__m64 __m, int __count) | Shift two 32-bit values in M left by COUNT. |

MMX | BITSHIFT | INTMMX | __m64 _mm_sll_si64(__m64 __m, __m64 __count) | Shift the 64-bit value in M left by COUNT. |

MMX | BITSHIFT | INTMMX | __m64 _m_psllq(__m64 __m, __m64 __count) | Shift the 64-bit value in M left by COUNT. |

MMX | BITSHIFT | INTMMX | __m64 _mm_slli_si64(__m64 __m, int __count) | Shift the 64-bit value in M left by COUNT. |

MMX | BITSHIFT | INTMMX | __m64 _m_psllqi(__m64 __m, int __count) | Shift the 64-bit value in M left by COUNT. |

MMX | BITSHIFT | INTMMX | __m64 _mm_sra_pi16(__m64 __m, __m64 __count) | Shift four 16-bit values in M right by COUNT; shift in the sign bit. |

MMX | BITSHIFT | INTMMX | __m64 _m_psraw(__m64 __m, __m64 __count) | Shift four 16-bit values in M right by COUNT; shift in the sign bit. |

MMX | BITSHIFT | INTMMX | __m64 _mm_srai_pi16(__m64 __m, int __count) | Shift four 16-bit values in M right by COUNT; shift in the sign bit. |

MMX | BITSHIFT | INTMMX | __m64 _m_psrawi(__m64 __m, int __count) | Shift four 16-bit values in M right by COUNT; shift in the sign bit. |

MMX | BITSHIFT | INTMMX | __m64 _mm_sra_pi32(__m64 __m, __m64 __count) | Shift two 32-bit values in M right by COUNT; shift in the sign bit. |

MMX | BITSHIFT | INTMMX | __m64 _m_psrad(__m64 __m, __m64 __count) | Shift two 32-bit values in M right by COUNT; shift in the sign bit. |

MMX | BITSHIFT | INTMMX | __m64 _mm_srai_pi32(__m64 __m, int __count) | Shift two 32-bit values in M right by COUNT; shift in the sign bit. |

MMX | BITSHIFT | INTMMX | __m64 _m_psradi(__m64 __m, int __count) | Shift two 32-bit values in M right by COUNT; shift in the sign bit. |

MMX | BITSHIFT | INTMMX | __m64 _mm_srl_pi16(__m64 __m, __m64 __count) | Shift four 16-bit values in M right by COUNT; shift in zeros. |

MMX | BITSHIFT | INTMMX | __m64 _m_psrlw(__m64 __m, __m64 __count) | Shift four 16-bit values in M right by COUNT; shift in zeros. |

MMX | BITSHIFT | INTMMX | __m64 _mm_srli_pi16(__m64 __m, int __count) | Shift four 16-bit values in M right by COUNT; shift in zeros. |

MMX | BITSHIFT | INTMMX | __m64 _m_psrlwi(__m64 __m, int __count) | Shift four 16-bit values in M right by COUNT; shift in zeros. |

MMX | BITSHIFT | INTMMX | __m64 _mm_srl_pi32(__m64 __m, __m64 __count) | Shift two 32-bit values in M right by COUNT; shift in zeros. |

MMX | BITSHIFT | INTMMX | __m64 _m_psrld(__m64 __m, __m64 __count) | Shift two 32-bit values in M right by COUNT; shift in zeros. |

MMX | BITSHIFT | INTMMX | __m64 _mm_srli_pi32(__m64 __m, int __count) | Shift two 32-bit values in M right by COUNT; shift in zeros. |

MMX | BITSHIFT | INTMMX | __m64 _m_psrldi(__m64 __m, int __count) | Shift two 32-bit values in M right by COUNT; shift in zeros. |

MMX | BITSHIFT | INTMMX | __m64 _mm_srl_si64(__m64 __m, __m64 __count) | Shift the 64-bit value in M left by COUNT; shift in zeros. |

MMX | BITSHIFT | INTMMX | __m64 _m_psrlq(__m64 __m, __m64 __count) | Shift the 64-bit value in M left by COUNT; shift in zeros. |

MMX | BITSHIFT | INTMMX | __m64 _mm_srli_si64(__m64 __m, int __count) | Shift the 64-bit value in M left by COUNT; shift in zeros. |

MMX | BITSHIFT | INTMMX | __m64 _m_psrlqi(__m64 __m, int __count) | Shift the 64-bit value in M left by COUNT; shift in zeros. |

MMX | COMPARE | INTMMX | __m64 _mm_cmpeq_pi8(__m64 __m1, __m64 __m2) | Compare eight 8-bit values. The result of the comparison is 0xFF if the test is true and zero if false. |

MMX | COMPARE | INTMMX | __m64 _m_pcmpeqb(__m64 __m1, __m64 __m2) | Compare eight 8-bit values. The result of the comparison is 0xFF if the test is true and zero if false. |

MMX | COMPARE | INTMMX | __m64 _mm_cmpgt_pi8(__m64 __m1, __m64 __m2) | Compare eight 8-bit values. The result of the comparison is 0xFF if the test is true and zero if false. |

MMX | COMPARE | INTMMX | __m64 _m_pcmpgtb(__m64 __m1, __m64 __m2) | |

MMX | COMPARE | INTMMX | __m64 _mm_cmpeq_pi16(__m64 __m1, __m64 __m2) | Compare four 16-bit values. The result of the comparison is 0xFFFF if the test is true and zero if false. |

MMX | COMPARE | INTMMX | __m64 _m_pcmpeqw(__m64 __m1, __m64 __m2) | Compare four 16-bit values. The result of the comparison is 0xFFFF if the test is true and zero if false. |

MMX | COMPARE | INTMMX | __m64 _mm_cmpgt_pi16(__m64 __m1, __m64 __m2) | Compare four 16-bit values. The result of the comparison is 0xFFFF if the test is true and zero if false. |

MMX | COMPARE | INTMMX | __m64 _m_pcmpgtw(__m64 __m1, __m64 __m2) | |

MMX | COMPARE | INTMMX | __m64 _mm_cmpeq_pi32(__m64 __m1, __m64 __m2) | Compare two 32-bit values. The result of the comparison is 0xFFFFFFFF if the test is true and zero if false. |

MMX | COMPARE | INTMMX | __m64 _m_pcmpeqd(__m64 __m1, __m64 __m2) | Compare two 32-bit values. The result of the comparison is 0xFFFFFFFF if the test is true and zero if false. |

MMX | COMPARE | INTMMX | __m64 _mm_cmpgt_pi32(__m64 __m1, __m64 __m2) | Compare two 32-bit values. The result of the comparison is 0xFFFFFFFF if the test is true and zero if false. |

MMX | COMPARE | INTMMX | __m64 _m_pcmpgtd(__m64 __m1, __m64 __m2) | |

MMX | CONVERT | INTMMX | __m64 _mm_cvtsi32_si64(int __i) | Convert I to a __m64 object. The integer is zero-extended to 64-bits. |

MMX | CONVERT | INTMMX | __m64 _m_from_int(int __i) | Convert I to a __m64 object. The integer is zero-extended to 64-bits. |

MMX | CONVERT | INTMMX | __m64 _m_from_int64(long long __i) | Convert I to a __m64 object. |

MMX | CONVERT | INTMMX | __m64 _mm_cvtsi64_m64(long long __i) | Convert I to a __m64 object. |

MMX | CONVERT | INTMMX | __m64 _mm_cvtsi64x_si64(long long __i) | Convert I to a __m64 object. |

MMX | CONVERT | INTMMX | __m64 _mm_set_pi64x(long long __i) | Convert I to a __m64 object. |

SSE | CONVERT | INTMMX | __m64 _mm_cvtps_pi32(__m128 __A) | Convert two lowest floats in vector to doubleword integers in MMX register. Round per MXCSR. |

SSE | CONVERT | INTMMX | __m64 _mm_cvt_ps2pi(__m128 __A) | Convert two lowest floats in vector to doubleword integers in MMX register. Round per MXCSR. |

SSE | CONVERT | INTMMX | __m64 _mm_cvttps_pi32(__m128 __A) | Convert two lowest floats in vector to doubleword integers in MMX register. Round by truncation. |

SSE | CONVERT | INTMMX | __m64 _mm_cvtt_ps2pi(__m128 __A) | Convert two lowest floats in vector to doubleword integers in MMX register. Round by truncation. |

SSE | CONVERT | INTMMX | __m128 _mm_cvtpi32_ps(__m128 __A, __m64 __B) | Convert two doubleword integers in B to floats and replace two low elements in A. |

SSE | CONVERT | INTMMX | __m128 _mm_cvt_pi2ps(__m128 __A, __m64 __B) | Convert two doubleword integers in B to floats and replace two low elements in A. |

SSE | CONVERT | INTMMX | __m128 _mm_cvtpi16_ps(__m64 __A) | Convert four signed word integers to floats. |

SSE | CONVERT | INTMMX | __m128 _mm_cvtpu16_ps(__m64 __A) | Convert four unsigned word integers to floats. |

SSE | CONVERT | INTMMX | __m128 _mm_cvtpi8_ps(__m64 __A) | Convert low four signed bytes to floats. |

SSE | CONVERT | INTMMX | __m128 _mm_cvtpu8_ps(__m64 __A) | Convert low four unsigned bytes to floats. |

SSE | CONVERT | INTMMX | __m128 _mm_cvtpi32x2_ps(__m64 __A, __m64 __B) | Convert four signed doubleword integers to floats. |

SSE | CONVERT | INTMMX | __m64 _mm_cvtps_pi16(__m128 __A) | Convert the four SPFP values in A to four signed 16-bit integers. |

SSE | CONVERT | INTMMX | __m64 _mm_cvtps_pi8(__m128 __A) | Convert the four SPFP values in A to four signed 8-bit integers. |

SSE2 | CONVERT | INTMMX | __m64 _mm_cvtpd_pi32(__m128d __A) | Convert two double floats to doubleword integers. Round per MXCSR. |

SSE2 | CONVERT | INTMMX | __m64 _mm_cvttpd_pi32(__m128d __A) | Convert two double floats to doubleword integers. Round by truncation. |

SSE2 | CONVERT | INTMMX | __m128d _mm_cvtpi32_pd(__m64 __A) | Convert two doubleword integers to double floats. |

SSE2 | CONVERT | INTMMX | __m64 _mm_movepi64_pi64(__m128i __B) | Move low quadword from XMM to MMX register. |

SSE2 | CONVERT | INTMMX | __m128i _mm_movpi64_epi64(__m64 __A) | Move MMX register to low quadword of XMM register. |

MMX | EXTRACT | INTMMX | int _mm_cvtsi64_si32(__m64 __i) | Convert the lower 32 bits of the __m64 object into an integer. |

MMX | EXTRACT | INTMMX | int _m_to_int(__m64 __i) | Convert the lower 32 bits of the __m64 object into an integer. |

MMX | EXTRACT | INTMMX | long long _m_to_int64(__m64 __i) | Convert the __m64 object to a 64bit integer. |

MMX | EXTRACT | INTMMX | long long _mm_cvtm64_si64(__m64 __i) | Convert the __m64 object to a 64bit integer. |

MMX | EXTRACT | INTMMX | long long _mm_cvtsi64_si64x(__m64 __i) | Convert the __m64 object to a 64bit integer. |

SSE | EXTRACT | INTMMX | int _mm_extract_pi16(__m64 const __A, int const __N) | Extracts one of the four words of A. The selector N must be immediate. |

SSE | EXTRACT | INTMMX | int _m_pextrw(__m64 const __A, int const __N) | Extracts one of the four words of A. The selector N must be immediate. |

SSE | EXTRACT | INTMMX | int _mm_movemask_pi8(__m64 __A) | Create an 8-bit mask of the signs of 8-bit values. |

SSE | EXTRACT | INTMMX | int _m_pmovmskb(__m64 __A) | Create an 8-bit mask of the signs of 8-bit values. |

SSE | INSERT | INTMMX | __m64 _mm_insert_pi16(__m64 const __A, int const __D, int const __N) | Inserts word D into one of four words of A. The selector N must be immediate. |

SSE | INSERT | INTMMX | __m64 _m_pinsrw(__m64 const __A, int const __D, int const __N) | Inserts word D into one of four words of A. The selector N must be immediate. |

SSE | LOADSTORE | INTMMX | void _mm_stream_pi(__m64 * __P, __m64 __A) | Write value to memory without polluting caches. |

SSE | MATHOP | INTMMX | __m64 _mm_max_pi16(__m64 __A, __m64 __B) | Compute the element-wise maximum of signed 16-bit values. |

SSE | MATHOP | INTMMX | __m64 _m_pmaxsw(__m64 __A, __m64 __B) | Compute the element-wise maximum of signed 16-bit values. |

SSE | MATHOP | INTMMX | __m64 _mm_max_pu8(__m64 __A, __m64 __B) | Compute the element-wise maximum of unsigned 8-bit values. |

SSE | MATHOP | INTMMX | __m64 _m_pmaxub(__m64 __A, __m64 __B) | Compute the element-wise maximum of unsigned 8-bit values. |

SSE | MATHOP | INTMMX | __m64 _mm_min_pi16(__m64 __A, __m64 __B) | Compute the element-wise minimum of signed 16-bit values. |

SSE | MATHOP | INTMMX | __m64 _m_pminsw(__m64 __A, __m64 __B) | Compute the element-wise minimum of signed 16-bit values. |

SSE | MATHOP | INTMMX | __m64 _mm_min_pu8(__m64 __A, __m64 __B) | Compute the element-wise minimum of unsigned 8-bit values. |

SSE | MATHOP | INTMMX | __m64 _m_pminub(__m64 __A, __m64 __B) | Compute the element-wise minimum of unsigned 8-bit values. |

SSE | MATHOP | INTMMX | __m64 _mm_mulhi_pu16(__m64 __A, __m64 __B) | Multiply four unsigned 16-bit values in A by four unsigned 16-bit values in B and produce the high 16 bits of the 32-bit results. |

SSE | MATHOP | INTMMX | __m64 _m_pmulhuw(__m64 __A, __m64 __B) | Multiply four unsigned 16-bit values in A by four unsigned 16-bit values in B and produce the high 16 bits of the 32-bit results. |

SSE | MATHOP | INTMMX | __m64 _mm_avg_pu8(__m64 __A, __m64 __B) | Compute the rounded averages of the unsigned 8-bit values in A and B. |

SSE | MATHOP | INTMMX | __m64 _m_pavgb(__m64 __A, __m64 __B) | Compute the rounded averages of the unsigned 8-bit values in A and B. |

SSE | MATHOP | INTMMX | __m64 _mm_avg_pu16(__m64 __A, __m64 __B) | Compute the rounded averages of the unsigned 16-bit values in A and B. |

SSE | MATHOP | INTMMX | __m64 _m_pavgw(__m64 __A, __m64 __B) | Compute the rounded averages of the unsigned 16-bit values in A and B. |

SSE | MATHOP | INTMMX | __m64 _mm_sad_pu8(__m64 __A, __m64 __B) | Compute the sum of the absolute differences of the unsigned 8-bit values in A and B. Return the value in the lower 16-bit word; the upper words are cleared. |

SSE | MATHOP | INTMMX | __m64 _m_psadbw(__m64 __A, __m64 __B) | Compute the sum of the absolute differences of the unsigned 8-bit values in A and B. Return the value in the lower 16-bit word; the upper words are cleared. |

SSE | MATHOP | INTMMX | __m64 _mm_add_pi8(__m64 __m1, __m64 __m2) | Add the 8-bit values in M1 to the 8-bit values in M2. |

SSE | MATHOP | INTMMX | __m64 _m_paddb(__m64 __m1, __m64 __m2) | Add the 8-bit values in M1 to the 8-bit values in M2. |

SSE | MATHOP | INTMMX | __m64 _mm_add_pi16(__m64 __m1, __m64 __m2) | Add the 16-bit values in M1 to the 16-bit values in M2. |

SSE | MATHOP | INTMMX | __m64 _m_paddw(__m64 __m1, __m64 __m2) | Add the 16-bit values in M1 to the 16-bit values in M2. |

SSE | MATHOP | INTMMX | __m64 _mm_add_pi32(__m64 __m1, __m64 __m2) | Add the 32-bit values in M1 to the 32-bit values in M2. |

SSE | MATHOP | INTMMX | __m64 _m_paddd(__m64 __m1, __m64 __m2) | Add the 32-bit values in M1 to the 32-bit values in M2. |

SSE | MATHOP | INTMMX | __m64 _mm_adds_pi8(__m64 __m1, __m64 __m2) | Add the 8-bit values in M1 to the 8-bit values in M2 using signed saturated arithmetic. |

SSE | MATHOP | INTMMX | __m64 _m_paddsb(__m64 __m1, __m64 __m2) | Add the 8-bit values in M1 to the 8-bit values in M2 using signed saturated arithmetic. |

SSE | MATHOP | INTMMX | __m64 _mm_adds_pi16(__m64 __m1, __m64 __m2) | Add the 16-bit values in M1 to the 16-bit values in M2 using signed saturated arithmetic. |

SSE | MATHOP | INTMMX | __m64 _m_paddsw(__m64 __m1, __m64 __m2) | Add the 16-bit values in M1 to the 16-bit values in M2 using signed saturated arithmetic. |

SSE | MATHOP | INTMMX | __m64 _mm_adds_pu8(__m64 __m1, __m64 __m2) | Add the 8-bit values in M1 to the 8-bit values in M2 using unsigned saturated arithmetic. |

SSE | MATHOP | INTMMX | __m64 _m_paddusb(__m64 __m1, __m64 __m2) | Add the 8-bit values in M1 to the 8-bit values in M2 using unsigned saturated arithmetic. |

SSE | MATHOP | INTMMX | __m64 _mm_adds_pu16(__m64 __m1, __m64 __m2) | Add the 16-bit values in M1 to the 16-bit values in M2 using unsigned saturated arithmetic. |

SSE | MATHOP | INTMMX | __m64 _m_paddusw(__m64 __m1, __m64 __m2) | Add the 16-bit values in M1 to the 16-bit values in M2 using unsigned saturated arithmetic. |

SSE | MATHOP | INTMMX | __m64 _mm_sub_pi8(__m64 __m1, __m64 __m2) | Subtract the 8-bit values in M2 from the 8-bit values in M1. |

SSE | MATHOP | INTMMX | __m64 _m_psubb(__m64 __m1, __m64 __m2) | Subtract the 8-bit values in M2 from the 8-bit values in M1. |

SSE | MATHOP | INTMMX | __m64 _mm_sub_pi16(__m64 __m1, __m64 __m2) | Subtract the 16-bit values in M2 from the 16-bit values in M1. |

SSE | MATHOP | INTMMX | __m64 _m_psubw(__m64 __m1, __m64 __m2) | Subtract the 16-bit values in M2 from the 16-bit values in M1. |

SSE | MATHOP | INTMMX | __m64 _mm_sub_pi32(__m64 __m1, __m64 __m2) | Subtract the 32-bit values in M2 from the 32-bit values in M1. |

SSE | MATHOP | INTMMX | __m64 _m_psubd(__m64 __m1, __m64 __m2) | Subtract the 32-bit values in M2 from the 32-bit values in M1. |

SSE | MATHOP | INTMMX | __m64 _mm_subs_pi8(__m64 __m1, __m64 __m2) | Subtract the 8-bit values in M2 from the 8-bit values in M1 using signed saturating arithmetic. |

SSE | MATHOP | INTMMX | __m64 _m_psubsb(__m64 __m1, __m64 __m2) | Subtract the 8-bit values in M2 from the 8-bit values in M1 using signed saturating arithmetic. |

SSE | MATHOP | INTMMX | __m64 _mm_subs_pi16(__m64 __m1, __m64 __m2) | Subtract the 16-bit values in M2 from the 16-bit values in M1 using signed saturating arithmetic. |

SSE | MATHOP | INTMMX | __m64 _m_psubsw(__m64 __m1, __m64 __m2) | Subtract the 16-bit values in M2 from the 16-bit values in M1 using signed saturating arithmetic. |

SSE | MATHOP | INTMMX | __m64 _mm_subs_pu8(__m64 __m1, __m64 __m2) | Subtract the 8-bit values in M2 from the 8-bit values in M1 using unsigned saturating arithmetic. |

SSE | MATHOP | INTMMX | __m64 _m_psubusb(__m64 __m1, __m64 __m2) | Subtract the 8-bit values in M2 from the 8-bit values in M1 using unsigned saturating arithmetic. |

SSE | MATHOP | INTMMX | __m64 _mm_subs_pu16(__m64 __m1, __m64 __m2) | Subtract the 16-bit values in M2 from the 16-bit values in M1 using unsigned saturating arithmetic. |

SSE | MATHOP | INTMMX | __m64 _m_psubusw(__m64 __m1, __m64 __m2) | Subtract the 16-bit values in M2 from the 16-bit values in M1 using unsigned saturating arithmetic. |

SSE | MATHOP | INTMMX | __m64 _mm_madd_pi16(__m64 __m1, __m64 __m2) | Multiply four 16-bit values in M1 by four 16-bit values in M2 producing four 32-bit intermediate results, which are then summed by pairs to produce two 32-bit results. |

SSE | MATHOP | INTMMX | __m64 _m_pmaddwd(__m64 __m1, __m64 __m2) | Multiply four 16-bit values in M1 by four 16-bit values in M2 producing four 32-bit intermediate results, which are then summed by pairs to produce two 32-bit results. |

SSE | MATHOP | INTMMX | __m64 _mm_mulhi_pi16(__m64 __m1, __m64 __m2) | Multiply four signed 16-bit values in M1 by four signed 16-bit values in M2 and produce the high 16 bits of the 32-bit results. |

SSE | MATHOP | INTMMX | __m64 _m_pmulhw(__m64 __m1, __m64 __m2) | Multiply four signed 16-bit values in M1 by four signed 16-bit values in M2 and produce the high 16 bits of the 32-bit results. |

SSE | MATHOP | INTMMX | __m64 _mm_mullo_pi16(__m64 __m1, __m64 __m2) | Multiply four 16-bit values in M1 by four 16-bit values in M2 and produce the low 16 bits of the results. |

SSE | MATHOP | INTMMX | __m64 _m_pmullw(__m64 __m1, __m64 __m2) | Multiply four 16-bit values in M1 by four 16-bit values in M2 and produce the low 16 bits of the results. |

SSE2 | MATHOP | INTMMX | __m64 _mm_add_si64(__m64 __m1, __m64 __m2) | Add the 64-bit value in M1 to the 64-bit value in M2. |

SSE2 | MATHOP | INTMMX | __m64 _mm_sub_si64(__m64 __m1, __m64 __m2) | Subtract the 64-bit value in M2 from the 64-bit value in M1. |

SSE2 | MATHOP | INTMMX | __m64 _mm_mul_su32(__m64 __A, __m64 __B) | Multiply low unsigned doublewords and returns quadword result. |

SSSE3 | MATHOP | INTMMX | __m64 _mm_abs_pi8(__m64 __X) | Get absolute values of signed elements. |

SSSE3 | MATHOP | INTMMX | __m64 _mm_abs_pi16(__m64 __X) | Get absolute values of signed elements. |

SSSE3 | MATHOP | INTMMX | __m64 _mm_abs_pi32(__m64 __X) | Get absolute values of signed elements. |

SSSE3 | MATHOP | INTMMX | __m64 _mm_hadd_pi16(__m64 __X, __m64 __Y) | Horizontal addition across vectors. Returns [[Xi0+Xi1] [Xi2+Xi3] [Yi0+Yi1] [Yi2+Yi3]]. |

SSSE3 | MATHOP | INTMMX | __m64 _mm_hadd_pi32(__m64 __X, __m64 __Y) | Horizontal addition across vectors. Returns [[Xi0+Xi1] [Yi0+Yi1]]. |

SSSE3 | MATHOP | INTMMX | __m64 _mm_hadds_pi16(__m64 __X, __m64 __Y) | Horizontal addition across vectors with signed saturation. Returns [[Xi0+Xi1] [Xi2+Xi3] [Yi0+Yi1] [Yi2+Yi3]]. |

SSSE3 | MATHOP | INTMMX | __m64 _mm_hsub_pi16(__m64 __X, __m64 __Y) | Horizontal subtraction across vectors. Returns [[Xi0-Xi1] [Xi2-Xi3] [Yi0-Yi1] [Yi2-Yi3]]. |

SSSE3 | MATHOP | INTMMX | __m64 _mm_hsub_pi32(__m64 __X, __m64 __Y) | Horizontal subtraction across vectors. Returns [[Xi0-Xi1] [Yi0-Yi1]]. |

SSSE3 | MATHOP | INTMMX | __m64 _mm_hsubs_pi16(__m64 __X, __m64 __Y) | Horizontal subtraction across vectors with signed saturation. Returns [[Xi0-Xi1] [Xi2-Xi3] [Yi0-Yi1] [Yi2-Yi3]]. |

SSSE3 | MATHOP | INTMMX | __m64 _mm_maddubs_pi16(__m64 __X, __m64 __Y) | Multiplies vertically each unsigned byte of X with the corresponding signed byte of Y, producing intermediate signed 16-bit integers. Each adjacent pair of signed words is added and the saturated result is returned. |

SSSE3 | MATHOP | INTMMX | __m64 _mm_mulhrs_pi16(__m64 __X, __m64 __Y) | Multiplies vertically each signed 16-bit integer from X with the corresponding signed 16-bit integer of Y, producing intermediate, signed 32-bit integers. Each intermediate 32-bit integer is truncated to the 18 most significant bits. Rounding is always performed by adding 1 to the least significant bit of the 18-bit intermediate result. The final result is obtained by selecting the 16 bits immediately to the right of the most significant bit of each 18-bit intermediate result and packed. |

SSSE3 | MATHOP | INTMMX | __m64 _mm_sign_pi8(__m64 __X, __m64 __Y) | Multiply element in X by {1, 0, -1} depending on sign of corresponding element in Y. |

SSSE3 | MATHOP | INTMMX | __m64 _mm_sign_pi16(__m64 __X, __m64 __Y) | Multiply element in X by {1, 0, -1} depending on sign of corresponding element in Y. |

SSSE3 | MATHOP | INTMMX | __m64 _mm_sign_pi32(__m64 __X, __m64 __Y) | Multiply element in X by {1, 0, -1} depending on sign of corresponding element in Y. |

SSSE3 | MATHOP | INTMMX | __m64 _mm_alignr_pi8(__m64 __X, __m64 __Y, int __N) | Concatenates X and Y into an intermediate composite, shifts the composite at byte granularity to the right by a constant immediate, and extracts the right-aligned result. |

MMX | OTHER | INTMMX | void _mm_empty(void ) | Empty the multimedia state. |

MMX | OTHER | INTMMX | void _m_empty(void ) | Empty the multimedia state. |

MMX | SET | INTMMX | __m64 _mm_setzero_si64(void ) | Creates a 64-bit zero. |

MMX | SET | INTMMX | __m64 _mm_set_pi32(int __i1, int __i0) | Creates a vector of two 32-bit values; I0 is least significant. |

MMX | SET | INTMMX | __m64 _mm_set_pi16(short __w3, short __w2, short __w1, short __w0) | Creates a vector of four 16-bit values; W0 is least significant. |

MMX | SET | INTMMX | __m64 _mm_set_pi8(char __b7, ... char __b0) | Creates a vector of eight 8-bit values; B0 is least significant. |

MMX | SET | INTMMX | __m64 _mm_setr_pi32(int __i0, int __i1) | Creates a vector of two 32-bit values; I0 is least significant. |

MMX | SET | INTMMX | __m64 _mm_setr_pi16(short __w0, short __w1, short __w2, short __w3) | Creates a vector of four 16-bit values; W0 is least significant. |

MMX | SET | INTMMX | __m64 _mm_setr_pi8(char __b0, ... char __b7) | Creates a vector of eight 8-bit values; B0 is least significant. |

MMX | SET | INTMMX | __m64 _mm_set1_pi32(int __i) | Creates a vector of two 32-bit values, both elements containing I. |

MMX | SET | INTMMX | __m64 _mm_set1_pi16(short __w) | Creates a vector of four 16-bit values, all elements containing W. |

MMX | SET | INTMMX | __m64 _mm_set1_pi8(char __b) | Creates a vector of eight 8-bit values, all elements containing B. |

MMX | SHUFFLE | INTMMX | __m64 _mm_packs_pi16(__m64 __m1, __m64 __m2) | Pack the four 16-bit values from M1 into the lower four 8-bit values of the result, and the four 16-bit values from M2 into the upper four 8-bit values of the result, all with signed saturation. |

MMX | SHUFFLE | INTMMX | __m64 _m_packsswb(__m64 __m1, __m64 __m2) | Pack the four 16-bit values from M1 into the lower four 8-bit values of the result, and the four 16-bit values from M2 into the upper four 8-bit values of the result, all with signed saturation. |

MMX | SHUFFLE | INTMMX | __m64 _mm_packs_pi32(__m64 __m1, __m64 __m2) | Pack the two 32-bit values from M1 in to the lower two 16-bit values of the result, and the two 32-bit values from M2 into the upper two 16-bit values of the result, all with signed saturation. |

MMX | SHUFFLE | INTMMX | __m64 _m_packssdw(__m64 __m1, __m64 __m2) | Pack the two 32-bit values from M1 in to the lower two 16-bit values of the result, and the two 32-bit values from M2 into the upper two 16-bit values of the result, all with signed saturation. |

MMX | SHUFFLE | INTMMX | __m64 _mm_packs_pu16(__m64 __m1, __m64 __m2) | Pack the four 16-bit values from M1 into the lower four 8-bit values of the result, and the four 16-bit values from M2 into the upper four 8-bit values of the result, all with unsigned saturation. |

MMX | SHUFFLE | INTMMX | __m64 _m_packuswb(__m64 __m1, __m64 __m2) | Pack the four 16-bit values from M1 into the lower four 8-bit values of the result, and the four 16-bit values from M2 into the upper four 8-bit values of the result, all with unsigned saturation. |

MMX | SHUFFLE | INTMMX | __m64 _mm_unpackhi_pi8(__m64 __m1, __m64 __m2) | Interleave the four 8-bit values from the high half of M1 with the four 8-bit values from the high half of M2. |

MMX | SHUFFLE | INTMMX | __m64 _m_punpckhbw(__m64 __m1, __m64 __m2) | Interleave the four 8-bit values from the high half of M1 with the four 8-bit values from the high half of M2. |

MMX | SHUFFLE | INTMMX | __m64 _mm_unpackhi_pi16(__m64 __m1, __m64 __m2) | Interleave the two 16-bit values from the high half of M1 with the two 16-bit values from the high half of M2. |

MMX | SHUFFLE | INTMMX | __m64 _m_punpckhwd(__m64 __m1, __m64 __m2) | Interleave the two 16-bit values from the high half of M1 with the two 16-bit values from the high half of M2. |

MMX | SHUFFLE | INTMMX | __m64 _mm_unpackhi_pi32(__m64 __m1, __m64 __m2) | Interleave the 32-bit value from the high half of M1 with the 32-bit value from the high half of M2. |

MMX | SHUFFLE | INTMMX | __m64 _m_punpckhdq(__m64 __m1, __m64 __m2) | Interleave the 32-bit value from the high half of M1 with the 32-bit value from the high half of M2. |

MMX | SHUFFLE | INTMMX | __m64 _mm_unpacklo_pi8(__m64 __m1, __m64 __m2) | Interleave the four 8-bit values from the low half of M1 with the four 8-bit values from the low half of M2. |

MMX | SHUFFLE | INTMMX | __m64 _m_punpcklbw(__m64 __m1, __m64 __m2) | Interleave the four 8-bit values from the low half of M1 with the four 8-bit values from the low half of M2. |

MMX | SHUFFLE | INTMMX | __m64 _mm_unpacklo_pi16(__m64 __m1, __m64 __m2) | Interleave the two 16-bit values from the low half of M1 with the two 16-bit values from the low half of M2. |

MMX | SHUFFLE | INTMMX | __m64 _m_punpcklwd(__m64 __m1, __m64 __m2) | Interleave the two 16-bit values from the low half of M1 with the two 16-bit values from the low half of M2. |

MMX | SHUFFLE | INTMMX | __m64 _mm_unpacklo_pi32(__m64 __m1, __m64 __m2) | Interleave the 32-bit value from the low half of M1 with the 32-bit value from the low half of M2. |

MMX | SHUFFLE | INTMMX | __m64 _m_punpckldq(__m64 __m1, __m64 __m2) | Interleave the 32-bit value from the low half of M1 with the 32-bit value from the low half of M2. |

SSE | SHUFFLE | INTMMX | __m64 _mm_shuffle_pi16(__m64 __A, int __N) | Return a combination of the four 16-bit values in A. The selector must be an immediate. |

SSE | SHUFFLE | INTMMX | __m64 _m_pshufw(__m64 __A, int __N) | Return a combination of the four 16-bit values in A. The selector must be an immediate. |

SSE | SHUFFLE | INTMMX | void _mm_maskmove_si64(__m64 __A, __m64 __N, char * __P) | Conditionally store byte elements of A into P. The high bit of each byte in the selector N determines whether the corresponding byte from A is stored. |

SSE | SHUFFLE | INTMMX | void _m_maskmovq(__m64 __A, __m64 __N, char * __P) | Conditionally store byte elements of A into P. The high bit of each byte in the selector N determines whether the corresponding byte from A is stored. |

SSSE3 | SHUFFLE | INTMMX | __m64 _mm_shuffle_pi8(__m64 __X, __m64 __Y) | Permute bytes in X. For each byte in Y, if the high bit is set, the corresponding byte in X is zeroed out. Otherwise, the low bits of the Y byte specify the source of the byte in X. |

SSE2 | BITLOGIC | INTSSE | __m128i _mm_and_si128(__m128i __A, __m128i __B) | Bitwise logic |

SSE2 | BITLOGIC | INTSSE | __m128i _mm_andnot_si128(__m128i __A, __m128i __B) | Bitwise logic |

SSE2 | BITLOGIC | INTSSE | __m128i _mm_or_si128(__m128i __A, __m128i __B) | Bitwise logic |

SSE2 | BITLOGIC | INTSSE | __m128i _mm_xor_si128(__m128i __A, __m128i __B) | Bitwise logic |

SSE2 | BITSHIFT | INTSSE | __m128i _mm_slli_epi16(__m128i __A, int __B) | Shift left logical (shift in zeros) by immediate count B. (No byte forms.) |

SSE2 | BITSHIFT | INTSSE | __m128i _mm_slli_epi32(__m128i __A, int __B) | Shift left logical (shift in zeros) by immediate count B. (No byte forms.) |

SSE2 | BITSHIFT | INTSSE | __m128i _mm_slli_epi64(__m128i __A, int __B) | Shift left logical (shift in zeros) by immediate count B. (No byte forms.) |

SSE2 | BITSHIFT | INTSSE | __m128i _mm_slli_si128(__m128i __A, int __B) | Shift left logical (shift in zeros) by immediate count B. (No byte forms.) |

SSE2 | BITSHIFT | INTSSE | __m128i _mm_srli_epi16(__m128i __A, int __B) | Shift right logical (shift in zeros) by immediate count B. (No byte forms.) |

SSE2 | BITSHIFT | INTSSE | __m128i _mm_srli_epi32(__m128i __A, int __B) | Shift right logical (shift in zeros) by immediate count B. (No byte forms.) |

SSE2 | BITSHIFT | INTSSE | __m128i _mm_srli_epi64(__m128i __A, int __B) | Shift right logical (shift in zeros) by immediate count B. (No byte forms.) |

SSE2 | BITSHIFT | INTSSE | __m128i _mm_srli_si128(__m128i __A, int __B) | Shift right logical (shift in zeros) by immediate count B. (No byte forms.) |

SSE2 | BITSHIFT | INTSSE | __m128i _mm_srai_epi16(__m128i __A, int __B) | Shift right arithmetic (duplicate sign bit) by immediate count B. (No byte or 128-bit forms.) |

SSE2 | BITSHIFT | INTSSE | __m128i _mm_srai_epi32(__m128i __A, int __B) | Shift right arithmetic (duplicate sign bit) by immediate count B. (No byte or 128-bit forms.) |

SSE2 | BITSHIFT | INTSSE | __m128i _mm_sll_epi16(__m128i __A, __m128i __B) | Shift left logical (shift in zeros) by count in low 64 bits of B. (No byte forms.) |

SSE2 | BITSHIFT | INTSSE | __m128i _mm_sll_epi32(__m128i __A, __m128i __B) | Shift left logical (shift in zeros) by count in low 64 bits of B. (No byte forms.) |

SSE2 | BITSHIFT | INTSSE | __m128i _mm_sll_epi64(__m128i __A, __m128i __B) | Shift left logical (shift in zeros) by count in low 64 bits of B. (No byte forms.) |

SSE2 | BITSHIFT | INTSSE | __m128i _mm_sll_si128(__m128i __A, __m128i __B) | Shift left logical (shift in zeros) by count in low 64 bits of B. (No byte forms.) |

SSE2 | BITSHIFT | INTSSE | __m128i _mm_srl_epi16(__m128i __A, __m128i __B) | Shift right logical (shift in zeros) by count in low 64 bits of B. (No byte forms.) |

SSE2 | BITSHIFT | INTSSE | __m128i _mm_srl_epi32(__m128i __A, __m128i __B) | Shift right logical (shift in zeros) by count in low 64 bits of B. (No byte forms.) |

SSE2 | BITSHIFT | INTSSE | __m128i _mm_srl_epi64(__m128i __A, __m128i __B) | Shift right logical (shift in zeros) by count in low 64 bits of B. (No byte forms.) |

SSE2 | BITSHIFT | INTSSE | __m128i _mm_srl_si128(__m128i __A, __m128i __B) | Shift right logical (shift in zeros) by count in low 64 bits of B. (No byte forms.) |

SSE2 | BITSHIFT | INTSSE | __m128i _mm_sra_epi16(__m128i __A, __m128i __B) | Shift right arithmetic (duplicate sign bit) by count in low 64 bits of B. (No byte or 128-bit forms.) |

SSE2 | BITSHIFT | INTSSE | __m128i _mm_sra_epi32(__m128i __A, __m128i __B) | Shift right arithmetic (duplicate sign bit) by count in low 64 bits of B. (No byte or 128-bit forms.) |

SSE2 | COMPARE | INTSSE | __m128i _mm_cmpeq_epi8(__m128i __A, __m128i __B) | Compare packed integers. |

SSE2 | COMPARE | INTSSE | __m128i _mm_cmpeq_epi16(__m128i __A, __m128i __B) | Compare packed integers. |

SSE2 | COMPARE | INTSSE | __m128i _mm_cmpeq_epi32(__m128i __A, __m128i __B) | Compare packed integers. |

SSE2 | COMPARE | INTSSE | __m128i _mm_cmplt_epi8(__m128i __A, __m128i __B) | Compare packed integers. |

SSE2 | COMPARE | INTSSE | __m128i _mm_cmplt_epi16(__m128i __A, __m128i __B) | Compare packed integers. |

SSE2 | COMPARE | INTSSE | __m128i _mm_cmplt_epi32(__m128i __A, __m128i __B) | Compare packed integers. |

SSE2 | COMPARE | INTSSE | __m128i _mm_cmpgt_epi8(__m128i __A, __m128i __B) | Compare packed integers. |

SSE2 | COMPARE | INTSSE | __m128i _mm_cmpgt_epi16(__m128i __A, __m128i __B) | Compare packed integers. |

SSE2 | COMPARE | INTSSE | __m128i _mm_cmpgt_epi32(__m128i __A, __m128i __B) | Compare packed integers. |

SSE41 | COMPARE | INTSSE | __m128i _mm_cmpeq_epi64(__m128i __X, __m128i __Y) | Packed integer 64-bit comparison, zeroing or filling with ones corresponding parts of result. |

SSE41 | COMPARE | INTSSE | int _mm_testz_si128(__m128i __M, __m128i __V) | Packed integer 128-bit bitwise comparison. Return 1 if (__V & __M) == 0. |

SSE41 | COMPARE | INTSSE | int _mm_testc_si128(__m128i __M, __m128i __V) | Packed integer 128-bit bitwise comparison. Return 1 if (__V & ~__M) == 0. |

SSE41 | COMPARE | INTSSE | int _mm_testnzc_si128(__m128i __M, __m128i __V) | Packed integer 128-bit bitwise comparison. Return 1 if (__V & __M) != 0 && (__V & ~__M) != 0. |

SSE42 | COMPARE | INTSSE | __m128i _mm_cmpgt_epi64(__m128i __X, __m128i __Y) | Packed integer 64-bit comparison, zeroing or filling with ones corresponding parts of result. |

SSE2 | CONVERT | INTSSE | __m128i _mm_move_epi64(__m128i __A) | Move vector between registers. |

SSE41 | CONVERT | INTSSE | __m128i _mm_cvtepi8_epi32(__m128i __X) | Packed integer sign-extension. |

SSE41 | CONVERT | INTSSE | __m128i _mm_cvtepi16_epi32(__m128i __X) | Packed integer sign-extension. |

SSE41 | CONVERT | INTSSE | __m128i _mm_cvtepi8_epi64(__m128i __X) | Packed integer sign-extension. |

SSE41 | CONVERT | INTSSE | __m128i _mm_cvtepi32_epi64(__m128i __X) | Packed integer sign-extension. |

SSE41 | CONVERT | INTSSE | __m128i _mm_cvtepi16_epi64(__m128i __X) | Packed integer sign-extension. |

SSE41 | CONVERT | INTSSE | __m128i _mm_cvtepi8_epi16(__m128i __X) | Packed integer sign-extension. |

SSE41 | CONVERT | INTSSE | __m128i _mm_cvtepu8_epi32(__m128i __X) | Packed integer zero-extension. |

SSE41 | CONVERT | INTSSE | __m128i _mm_cvtepu16_epi32(__m128i __X) | Packed integer zero-extension. |

SSE41 | CONVERT | INTSSE | __m128i _mm_cvtepu8_epi64(__m128i __X) | Packed integer zero-extension. |

SSE41 | CONVERT | INTSSE | __m128i _mm_cvtepu32_epi64(__m128i __X) | Packed integer zero-extension. |

SSE41 | CONVERT | INTSSE | __m128i _mm_cvtepu16_epi64(__m128i __X) | Packed integer zero-extension. |

SSE41 | CONVERT | INTSSE | __m128i _mm_cvtepu8_epi16(__m128i __X) | Packed integer zero-extension. |

SSE2 | EXTRACT | INTSSE | int _mm_cvtsi128_si32(__m128i __A) | Extract low 32-bit integer from vector. |

SSE2 | EXTRACT | INTSSE | long long _mm_cvtsi128_si64(__m128i __A) | Extract low 64-bit integer from vector. |

SSE2 | EXTRACT | INTSSE | long long _mm_cvtsi128_si64x(__m128i __A) | Extract low 64-bit integer from vector. |

SSE2 | EXTRACT | INTSSE | int _mm_movemask_epi8(__m128i __A) | Extract sign bits of byte components into integer. |

SSE2 | EXTRACT | INTSSE | int _mm_extract_epi16(__m128i const __A, int const __N) | Extract word at index N from vector. No sign extension. |

SSE2 | EXTRACT | INTSSE | void _mm_maskmoveu_si128(__m128i __A, __m128i __B, char * __C) | Write bytes of A to memory based on mask in B. If high bit of a byte in B is set, byte is written. C may be unaligned. |

SSE41 | EXTRACT | INTSSE | int _mm_extract_epi8(__m128i __X, const int __N) | Extract integer from packed integer array element of X selected by index N. |

SSE41 | EXTRACT | INTSSE | int _mm_extract_epi32(__m128i __X, const int __N) | Extract integer from packed integer array element of X selected by index N. |

SSE41 | EXTRACT | INTSSE | long long _mm_extract_epi64(__m128i __X, const int __N) | Extract integer from packed integer array element of X selected by index N. |

SSE2 | INSERT | INTSSE | __m128i _mm_insert_epi16(__m128i const __A, int const __D, int const __N) | Insert word D into vector at index N. |

SSE41 | INSERT | INTSSE | __m128i _mm_insert_epi8(__m128i __D, int __S, const int __N) | Insert integer, S, into packed integer array element of D selected by index N. |

SSE41 | INSERT | INTSSE | __m128i _mm_insert_epi32(__m128i __D, int __S, const int __N) | Insert integer, S, into packed integer array element of D selected by index N. |

SSE41 | INSERT | INTSSE | __m128i _mm_insert_epi64(__m128i __D, long long __S, const int __N) | Insert integer, S, into packed integer array element of D selected by index N. |

SSE | LOADSTORE | INTSSE | __m128i _mm_stream_load_si128(__m128i * __X) | Load double quadword using non-temporal aligned hint. |

SSE2 | LOADSTORE | INTSSE | __m128i _mm_load_si128(__m128i const * __P) | Load 128-bit integer from aligned address. |

SSE2 | LOADSTORE | INTSSE | __m128i _mm_loadu_si128(__m128i const * __P) | Load 128-bit integer from unaligned address. |

SSE2 | LOADSTORE | INTSSE | void _mm_store_si128(__m128i * __P, __m128i __B) | Store 128-bit integer at aligned address. |

SSE2 | LOADSTORE | INTSSE | void _mm_storeu_si128(__m128i * __P, __m128i __B) | Store 128-bit integer at unaligned address. |

SSE2 | LOADSTORE | INTSSE | __m128i _mm_loadl_epi64(__m128i const * __P) | Load 64-bit integer into low element of vector. |

SSE2 | LOADSTORE | INTSSE | void _mm_storel_epi64(__m128i * __P, __m128i __B) | Store 64-bit integer into low element of vector. |

SSE2 | LOADSTORE | INTSSE | void _mm_stream_si32(int * __A, int __B) | Write value to memory without polluting caches. |

SSE2 | LOADSTORE | INTSSE | void _mm_stream_si128(__m128i * __A, __m128i __B) | Write value to memory without polluting caches. |

SSE2 | MATHOP | INTSSE | __m128i _mm_add_epi8(__m128i __A, __m128i __B) | Integer math, wraparound |

SSE2 | MATHOP | INTSSE | __m128i _mm_add_epi16(__m128i __A, __m128i __B) | Integer math, wraparound |

SSE2 | MATHOP | INTSSE | __m128i _mm_add_epi32(__m128i __A, __m128i __B) | Integer math, wraparound |

SSE2 | MATHOP | INTSSE | __m128i _mm_add_epi64(__m128i __A, __m128i __B) | Integer math, wraparound |

SSE2 | MATHOP | INTSSE | __m128i _mm_sub_epi8(__m128i __A, __m128i __B) | Integer math, wraparound |

SSE2 | MATHOP | INTSSE | __m128i _mm_sub_epi16(__m128i __A, __m128i __B) | Integer math, wraparound |

SSE2 | MATHOP | INTSSE | __m128i _mm_sub_epi32(__m128i __A, __m128i __B) | Integer math, wraparound |

SSE2 | MATHOP | INTSSE | __m128i _mm_sub_epi64(__m128i __A, __m128i __B) | Integer math, wraparound |

SSE2 | MATHOP | INTSSE | __m128i _mm_adds_epi8(__m128i __A, __m128i __B) | Integer math, signed saturation |

SSE2 | MATHOP | INTSSE | __m128i _mm_adds_epi16(__m128i __A, __m128i __B) | Integer math, signed saturation |

SSE2 | MATHOP | INTSSE | __m128i _mm_subs_epi8(__m128i __A, __m128i __B) | Integer math, signed saturation |

SSE2 | MATHOP | INTSSE | __m128i _mm_subs_epi16(__m128i __A, __m128i __B) | Integer math, signed saturation |

SSE2 | MATHOP | INTSSE | __m128i _mm_adds_epu8(__m128i __A, __m128i __B) | Integer math, unsigned saturation |

SSE2 | MATHOP | INTSSE | __m128i _mm_adds_epu16(__m128i __A, __m128i __B) | Integer math, unsigned saturation |

SSE2 | MATHOP | INTSSE | __m128i _mm_subs_epu8(__m128i __A, __m128i __B) | Integer math, unsigned saturation |

SSE2 | MATHOP | INTSSE | __m128i _mm_subs_epu16(__m128i __A, __m128i __B) | Integer math, unsigned saturation |

SSE2 | MATHOP | INTSSE | __m128i _mm_madd_epi16(__m128i __A, __m128i __B) | Multiply packed word integers and add adjacent doublewords. |

SSE2 | MATHOP | INTSSE | __m128i _mm_mullo_epi16(__m128i __A, __m128i __B) | Multiply packed words and return low half of each result. |

SSE2 | MATHOP | INTSSE | __m128i _mm_mulhi_epi16(__m128i __A, __m128i __B) | Multiply packed signed words and return high half of each result. |

SSE2 | MATHOP | INTSSE | __m128i _mm_mulhi_epu16(__m128i __A, __m128i __B) | Multiply packed unsigned words and return high half of each result. |

SSE2 | MATHOP | INTSSE | __m128i _mm_mul_epu32(__m128i __A, __m128i __B) | Multiply first and third unsigned doublewords and returns quadword results. |

SSE2 | MATHOP | INTSSE | __m128i _mm_max_epi16(__m128i __A, __m128i __B) | Get min/max of signed words. |

SSE2 | MATHOP | INTSSE | __m128i _mm_min_epi16(__m128i __A, __m128i __B) | Get min/max of signed words. |

SSE2 | MATHOP | INTSSE | __m128i _mm_max_epu8(__m128i __A, __m128i __B) | Get min/max of unsigned bytes. |

SSE2 | MATHOP | INTSSE | __m128i _mm_min_epu8(__m128i __A, __m128i __B) | Get min/max of unsigned bytes. |

SSE2 | MATHOP | INTSSE | __m128i _mm_avg_epu8(__m128i __A, __m128i __B) | Average unsigned components. |

SSE2 | MATHOP | INTSSE | __m128i _mm_avg_epu16(__m128i __A, __m128i __B) | Average unsigned components. |

SSE2 | MATHOP | INTSSE | __m128i _mm_sad_epu8(__m128i __A, __m128i __B) | Sum of absolute differences of low unsigned bytes are stored in low half. Sum of absolute differences of high unsigned bytes are stored in high half. |

SSE41 | MATHOP | INTSSE | __m128i _mm_min_epi8(__m128i __X, __m128i __Y) | Min/max packed integer instructions. |

SSE41 | MATHOP | INTSSE | __m128i _mm_max_epi8(__m128i __X, __m128i __Y) | Min/max packed integer instructions. |

SSE41 | MATHOP | INTSSE | __m128i _mm_min_epu16(__m128i __X, __m128i __Y) | Min/max packed integer instructions. |

SSE41 | MATHOP | INTSSE | __m128i _mm_max_epu16(__m128i __X, __m128i __Y) | Min/max packed integer instructions. |

SSE41 | MATHOP | INTSSE | __m128i _mm_min_epi32(__m128i __X, __m128i __Y) | Min/max packed integer instructions. |

SSE41 | MATHOP | INTSSE | __m128i _mm_max_epi32(__m128i __X, __m128i __Y) | Min/max packed integer instructions. |

SSE41 | MATHOP | INTSSE | __m128i _mm_min_epu32(__m128i __X, __m128i __Y) | Min/max packed integer instructions. |

SSE41 | MATHOP | INTSSE | __m128i _mm_max_epu32(__m128i __X, __m128i __Y) | Min/max packed integer instructions. |

SSE41 | MATHOP | INTSSE | __m128i _mm_mullo_epi32(__m128i __X, __m128i __Y) | Packed integer 32-bit multiplication with truncation of upper halves of results. |

SSE41 | MATHOP | INTSSE | __m128i _mm_mul_epi32(__m128i __X, __m128i __Y) | Packed integer 32-bit multiplication of 2 pairs of operands with two 64-bit results. |

SSE41 | MATHOP | INTSSE | __m128i _mm_mpsadbw_epu8(__m128i __X, __m128i __Y, const int __M) | Sum absolute 8-bit integer difference of adjacent groups of 4 byte integers in the first 2 operands. Starting offsets within operands are determined by the 3rd mask operand. |

SSE41 | MATHOP | INTSSE | __m128i _mm_minpos_epu16(__m128i __X) | Return horizontal packed word minimum and its index in bits [15:0] and bits [18:16] respectively. |

SSSE3 | MATHOP | INTSSE | __m128i _mm_abs_epi8(__m128i __X) | Get absolute values of signed elements. |

SSSE3 | MATHOP | INTSSE | __m128i _mm_abs_epi16(__m128i __X) | Get absolute values of signed elements. |

SSSE3 | MATHOP | INTSSE | __m128i _mm_abs_epi32(__m128i __X) | Get absolute values of signed elements. |

SSSE3 | MATHOP | INTSSE | __m128i _mm_hadd_epi16(__m128i __X, __m128i __Y) | Horizontal addition across vectors. Returns [[Xi0+Xi1] [Xi2+Xi3] ... [Yi4+Yi5] [Yi6+Yi7]]. |

SSSE3 | MATHOP | INTSSE | __m128i _mm_hadd_epi32(__m128i __X, __m128i __Y) | Horizontal addition across vectors. Returns [[Xi0-Xi1] ... [Yi2-Yi3]]. |

SSSE3 | MATHOP | INTSSE | __m128i _mm_hadds_epi16(__m128i __X, __m128i __Y) | Horizontal addition across vectors with signed saturation. Returns [[Xi0+Xi1] [Xi2+Xi3] ... [Yi4+Yi5] [Yi6+Yi7]]. |

SSSE3 | MATHOP | INTSSE | __m128i _mm_hsub_epi16(__m128i __X, __m128i __Y) | Horizontal subtraction across vectors. Returns [[Xi0+Xi1] [Xi2+Xi3] ... [Yi4+Yi5] [Yi6+Yi7]]. |

SSSE3 | MATHOP | INTSSE | __m128i _mm_hsub_epi32(__m128i __X, __m128i __Y) | Horizontal subtraction across vectors. Returns [[Xi0-Xi1] ... [Yi2-Yi3]]. |

SSSE3 | MATHOP | INTSSE | __m128i _mm_hsubs_epi16(__m128i __X, __m128i __Y) | Horizontal subtraction across vectors with signed saturation. Returns [[Xi0+Xi1] [Xi2+Xi3] ... [Yi4+Yi5] [Yi6+Yi7]]. |

SSSE3 | MATHOP | INTSSE | __m128i _mm_maddubs_epi16(__m128i __X, __m128i __Y) | Multiplies vertically each unsigned byte of X with the corresponding signed byte of Y, producing intermediate signed 16-bit integers. Each adjacent pair of signed words is added and the saturated result is packed. |

SSSE3 | MATHOP | INTSSE | __m128i _mm_mulhrs_epi16(__m128i __X, __m128i __Y) | Multiplies vertically each signed 16-bit integer from X with the corresponding signed 16-bit integer of Y, producing intermediate, signed 32-bit integers. Each intermediate 32-bit integer is truncated to the 18 most significant bits. Rounding is always performed by adding 1 to the least significant bit of the 18-bit intermediate result. The final result is obtained by selecting the 16 bits immediately to the right of the most significant bit of each 18-bit intermediate result and packed. |

SSSE3 | MATHOP | INTSSE | __m128i _mm_sign_epi8(__m128i __X, __m128i __Y) | Multiply element in X by {1, 0, -1} depending on sign of corresponding element in Y. |

SSSE3 | MATHOP | INTSSE | __m128i _mm_sign_epi16(__m128i __X, __m128i __Y) | Multiply element in X by {1, 0, -1} depending on sign of corresponding element in Y. |

SSSE3 | MATHOP | INTSSE | __m128i _mm_sign_epi32(__m128i __X, __m128i __Y) | Multiply element in X by {1, 0, -1} depending on sign of corresponding element in Y. |

SSSE3 | MATHOP | INTSSE | __m128i _mm_alignr_epi8(__m128i __X, __m128i __Y, int __N) | Concatenates X and Y into an intermediate composite, shifts the composite at byte granularity to the right by a constant immediate, and extracts the right-aligned result. |

SSE2 | SET | INTSSE | __m128i _mm_set_epi64x(long long __q1, long long __q0) | Create vector from elements, lowest element last. |

SSE2 | SET | INTSSE | __m128i _mm_set_epi64(__m64 __q1, __m64 __q0) | Create vector from elements, lowest element last. |

SSE2 | SET | INTSSE | __m128i _mm_set_epi32(int __q3, int __q2, int __q1, int __q0) | Create vector from elements, lowest element last. |

SSE2 | SET | INTSSE | __m128i _mm_set_epi16(short __q7, ... short __q0) | Create vector from elements, lowest element last. |

SSE2 | SET | INTSSE | __m128i _mm_set_epi8(char __q15, ... char __q00) | Create vector from elements, lowest element last. |

SSE2 | SET | INTSSE | __m128i _mm_setr_epi64(__m64 __q0, __m64 __q1) | Create vector from elements, lowest element first. |

SSE2 | SET | INTSSE | __m128i _mm_setr_epi32(int __q0, int __q1, int __q2, int __q3) | Create vector from elements, lowest element first. |

SSE2 | SET | INTSSE | __m128i _mm_setr_epi16(short __q0, ... short __q7) | Create vector from elements, lowest element first. |

SSE2 | SET | INTSSE | __m128i _mm_setr_epi8(char __q00, ... char __q15) | Create vector from elements, lowest element first. |

SSE2 | SET | INTSSE | __m128i _mm_setzero_si128(void ) | Create a vector of zeros. |

SSE2 | SET | INTSSE | __m128i _mm_cvtsi32_si128(int __A) | Set low doubleword of vector to A and clear high bits. |

SSE2 | SET | INTSSE | __m128i _mm_cvtsi64_si128(long long __A) | Set low quadword of vector to A and clear high bits. |

SSE2 | SET | INTSSE | __m128i _mm_cvtsi64x_si128(long long __A) | Set low quadword of vector to A and clear high bits. |

SSE2 | SET | INTSSE | __m128i _mm_set1_epi64x(long long __A) | Set all components of the vector to A. |

SSE2 | SET | INTSSE | __m128i _mm_set1_epi64(__m64 __A) | Set all components of the vector to A. |

SSE2 | SET | INTSSE | __m128i _mm_set1_epi32(int __A) | Set all components of the vector to A. |

SSE2 | SET | INTSSE | __m128i _mm_set1_epi16(short __A) | Set all components of the vector to A. |

SSE2 | SET | INTSSE | __m128i _mm_set1_epi8(char __A) | Set all components of the vector to A. |

SSE2 | SHUFFLE | INTSSE | __m128i _mm_packs_epi16(__m128i __A, __m128i __B) | Pack eight words from each operand into sixteen bytes using signed saturation. |

SSE2 | SHUFFLE | INTSSE | __m128i _mm_packs_epi32(__m128i __A, __m128i __B) | Pack four doublewords from each operand into eight words using signed saturation. |

SSE2 | SHUFFLE | INTSSE | __m128i _mm_packus_epi16(__m128i __A, __m128i __B) | Pack eight words from each operand into sixteen bytes using unsigned saturation. |

SSE2 | SHUFFLE | INTSSE | __m128i _mm_unpackhi_epi8(__m128i __A, __m128i __B) | Unpack and interleave high components of operands. |

SSE2 | SHUFFLE | INTSSE | __m128i _mm_unpackhi_epi16(__m128i __A, __m128i __B) | Unpack and interleave high components of operands. |

SSE2 | SHUFFLE | INTSSE | __m128i _mm_unpackhi_epi32(__m128i __A, __m128i __B) | Unpack and interleave high components of operands. |

SSE2 | SHUFFLE | INTSSE | __m128i _mm_unpackhi_epi64(__m128i __A, __m128i __B) | Unpack and interleave high components of operands. |

SSE2 | SHUFFLE | INTSSE | __m128i _mm_unpacklo_epi8(__m128i __A, __m128i __B) | Unpack and interleave low components of operands. |

SSE2 | SHUFFLE | INTSSE | __m128i _mm_unpacklo_epi16(__m128i __A, __m128i __B) | Unpack and interleave low components of operands. |

SSE2 | SHUFFLE | INTSSE | __m128i _mm_unpacklo_epi32(__m128i __A, __m128i __B) | Unpack and interleave low components of operands. |

SSE2 | SHUFFLE | INTSSE | __m128i _mm_unpacklo_epi64(__m128i __A, __m128i __B) | Unpack and interleave low components of operands. |

SSE2 | SHUFFLE | INTSSE | __m128i _mm_shufflehi_epi16(__m128i __A, int __B) | Shuffle high words of input based on fields of __B. |

SSE2 | SHUFFLE | INTSSE | __m128i _mm_shufflelo_epi16(__m128i __A, int __B) | Shuffle low words of input based on fields of __B. |

SSE2 | SHUFFLE | INTSSE | __m128i _mm_shuffle_epi32(__m128i __A, int __B) | Shuffle doublewords of input based on fields of __B. |

SSE41 | SHUFFLE | INTSSE | __m128i _mm_blend_epi16(__m128i __X, __m128i __Y, const int __M) | Integer blend instructions - select data from 2 sources using constant/variable mask. |

SSE41 | SHUFFLE | INTSSE | __m128i _mm_blendv_epi8(__m128i __X, __m128i __Y, __m128i __M) | Integer blend instructions - select data from 2 sources using constant/variable mask. |

SSE41 | SHUFFLE | INTSSE | __m128i _mm_packus_epi32(__m128i __X, __m128i __Y) | Pack 8 double words from 2 operands into 8 words of result with unsigned saturation. |

SSSE3 | SHUFFLE | INTSSE | __m128i _mm_shuffle_epi8(__m128i __X, __m128i __Y) | Permute bytes in X. For each byte in Y, if the high bit is set, the corresponding byte in X is zeroed out. Otherwise, the low bits of the Y byte specify the source of the byte in X. |

SSE42 | STRING | INTSSE | __m128i _mm_cmpistrm(__m128i __X, __m128i __Y, const int __M) | Intrinsics for text/string processing. |

SSE42 | STRING | INTSSE | int _mm_cmpistri(__m128i __X, __m128i __Y, const int __M) | Intrinsics for text/string processing. |

SSE42 | STRING | INTSSE | __m128i _mm_cmpestrm(__m128i __X, int __LX, __m128i __Y, int __LY, const int __M) | Intrinsics for text/string processing. |

SSE42 | STRING | INTSSE | int _mm_cmpestri(__m128i __X, int __LX, __m128i __Y, int __LY, const int __M) | Intrinsics for text/string processing. |

SSE42 | STRING | INTSSE | int _mm_cmpistra(__m128i __X, __m128i __Y, const int __M) | Intrinsics for text/string processing and reading values of EFlags. |

SSE42 | STRING | INTSSE | int _mm_cmpistrc(__m128i __X, __m128i __Y, const int __M) | Intrinsics for text/string processing and reading values of EFlags. |

SSE42 | STRING | INTSSE | int _mm_cmpistro(__m128i __X, __m128i __Y, const int __M) | Intrinsics for text/string processing and reading values of EFlags. |

SSE42 | STRING | INTSSE | int _mm_cmpistrs(__m128i __X, __m128i __Y, const int __M) | Intrinsics for text/string processing and reading values of EFlags. |

SSE42 | STRING | INTSSE | int _mm_cmpistrz(__m128i __X, __m128i __Y, const int __M) | Intrinsics for text/string processing and reading values of EFlags. |

SSE42 | STRING | INTSSE | int _mm_cmpestra(__m128i __X, int __LX, __m128i __Y, int __LY, const int __M) | Intrinsics for text/string processing and reading values of EFlags. |

SSE42 | STRING | INTSSE | int _mm_cmpestrc(__m128i __X, int __LX, __m128i __Y, int __LY, const int __M) | Intrinsics for text/string processing and reading values of EFlags. |

SSE42 | STRING | INTSSE | int _mm_cmpestro(__m128i __X, int __LX, __m128i __Y, int __LY, const int __M) | Intrinsics for text/string processing and reading values of EFlags. |

SSE42 | STRING | INTSSE | int _mm_cmpestrs(__m128i __X, int __LX, __m128i __Y, int __LY, const int __M) | Intrinsics for text/string processing and reading values of EFlags. |

SSE42 | STRING | INTSSE | int _mm_cmpestrz(__m128i __X, int __LX, __m128i __Y, int __LY, const int __M) | Intrinsics for text/string processing and reading values of EFlags. |

SSE2 | CASTING | MIXED | __m128 _mm_castpd_ps(__m128d __A) | Type conversion for compiler. No value modification. |

SSE2 | CASTING | MIXED | __m128i _mm_castpd_si128(__m128d __A) | Type conversion for compiler. No value modification. |

SSE2 | CASTING | MIXED | __m128d _mm_castps_pd(__m128 __A) | Type conversion for compiler. No value modification. |

SSE2 | CASTING | MIXED | __m128i _mm_castps_si128(__m128 __A) | Type conversion for compiler. No value modification. |

SSE2 | CASTING | MIXED | __m128 _mm_castsi128_ps(__m128i __A) | Type conversion for compiler. No value modification. |

SSE2 | CASTING | MIXED | __m128d _mm_castsi128_pd(__m128i __A) | Type conversion for compiler. No value modification. |

SSE | CONVERT | MIXED | int _mm_cvtss_si32(__m128 __A) | Convert lowest float in vector to integer. Round per MXCSR. |

SSE | CONVERT | MIXED | int _mm_cvt_ss2si(__m128 __A) | Convert lowest float in vector to integer. Round per MXCSR. |

SSE | CONVERT | MIXED | int _mm_cvttss_si32(__m128 __A) | Convert lowest float in vector to integer. Round by truncation. |

SSE | CONVERT | MIXED | int _mm_cvtt_ss2si(__m128 __A) | Convert lowest float in vector to integer. Round by truncation. |

SSE | CONVERT | MIXED | long long _mm_cvtss_si64(__m128 __A) | Convert lowest float in vector to quadword integer. Round per MXCSR. |

SSE | CONVERT | MIXED | long long _mm_cvtss_si64x(__m128 __A) | Convert lowest float in vector to quadword integer. Round per MXCSR. |

SSE | CONVERT | MIXED | long long _mm_cvttss_si64(__m128 __A) | Convert lowest float in vector to quadword integer. Round by truncation. |

SSE | CONVERT | MIXED | long long _mm_cvttss_si64x(__m128 __A) | Convert lowest float in vector to quadword integer. Round by truncation. |

SSE | CONVERT | MIXED | __m128 _mm_cvtsi32_ss(__m128 __A, int __B) | Convert B to a float and replace low element in A. |

SSE | CONVERT | MIXED | __m128 _mm_cvt_si2ss(__m128 __A, int __B) | Convert B to a float and replace low element in A. |

SSE | CONVERT | MIXED | __m128 _mm_cvtsi64_ss(__m128 __A, long long __B) | Convert B to a float and replace low element in A. |

SSE | CONVERT | MIXED | __m128 _mm_cvtsi64x_ss(__m128 __A, long long __B) | Convert B to a float and replace low element in A. |

SSE2 | CONVERT | MIXED | __m128d _mm_cvtepi32_pd(__m128i __A) | Convert two lowest doubleword integers to double floats. |

SSE2 | CONVERT | MIXED | __m128 _mm_cvtepi32_ps(__m128i __A) | Convert four doubleword integers to four single floats. |

SSE2 | CONVERT | MIXED | __m128i _mm_cvtpd_epi32(__m128d __A) | Convert two double floats to lowest doubleword integers. High half is cleared. Round per MXCSR. |

SSE2 | CONVERT | MIXED | __m128 _mm_cvtpd_ps(__m128d __A) | Convert two double floats to lowest single floats. High half is cleared. |

SSE2 | CONVERT | MIXED | __m128i _mm_cvttpd_epi32(__m128d __A) | Convert two double floats to lowest doubleword integers. High half is cleared. Round by truncation. |

SSE2 | CONVERT | MIXED | __m128i _mm_cvtps_epi32(__m128d __A) | Convert four single floats to doubleword integers. Round per MXCSR. |

SSE2 | CONVERT | MIXED | __m128i _mm_cvttps_epi32(__m128d __A) | Convert four single floats to doubleword integers. Round by truncation. |

SSE2 | CONVERT | MIXED | __m128d _mm_cvtpd_ps(__m128 __A) | Convert two lowest single floats to double floats. |

SSE2 | CONVERT | MIXED | int _mm_cvtsd_si32(__m128d __A) | Convert lowest double float to integer. Round per MXCSR. |

SSE2 | CONVERT | MIXED | long long _mm_cvtsd_si64(__m128d __A) | Convert lowest double float to integer. Round per MXCSR. |

SSE2 | CONVERT | MIXED | long long _mm_cvtsd_si64x(__m128d __A) | Convert lowest double float to integer. Round per MXCSR. |

SSE2 | CONVERT | MIXED | int _mm_cvttsd_si32(__m128d __A) | Convert lowest double float to integer. Round by truncation. |

SSE2 | CONVERT | MIXED | long long _mm_cvttsd_si64(__m128d __A) | Convert lowest double float to integer. Round by truncation. |

SSE2 | CONVERT | MIXED | long long _mm_cvttsd_si64x(__m128d __A) | Convert lowest double float to integer. Round by truncation. |

SSE2 | CONVERT | MIXED | __m128 _mm_cvtsd_ss(__m128 __A, __m128d __B) | Convert lowest double float to lowest single float. High bits are copied from B? |

SSE2 | CONVERT | MIXED | __m128d _mm_cvtsi32_sd(__m128d __A, int __B) | Convert signed doubleword integer to lowest double float. High bits are copied from B? |

SSE2 | CONVERT | MIXED | __m128d _mm_cvtsi64_sd(__m128d __A, long long __B) | Convert signed quadword integer to lowest double float. High bits are copied from B? |

SSE2 | CONVERT | MIXED | __m128d _mm_cvtsi64x_sd(__m128d __A, long long __B) | Convert signed quadword integer to lowest double float. High bits are copied from B? |

SSE2 | CONVERT | MIXED | __m128d _mm_cvtss_sd(__m128d __A, __m128 __B) | Convert lowest single float to lowest double float. High bits are copied from B? |

SSE | FPUMODE | OTHER | unsigned int _mm_getcsr(void ) | Get/set contents of the control register. |

SSE | FPUMODE | OTHER | void _mm_setcsr(unsigned int __I) | Get/set contents of the control register. |

SSE | FPUMODE | OTHER | unsigned int _MM_GET_EXCEPTION_STATE(void ) | Read bits from the control register. |

SSE | FPUMODE | OTHER | unsigned int _MM_GET_EXCEPTION_MASK(void ) | Read bits from the control register. |

SSE | FPUMODE | OTHER | unsigned int _MM_GET_ROUNDING_MODE(void ) | Read bits from the control register. |

SSE | FPUMODE | OTHER | unsigned int _MM_GET_FLUSH_ZERO_MODE(void ) | Read bits from the control register. |

SSE | FPUMODE | OTHER | void _MM_SET_EXCEPTION_STATE(unsigned int __mask) | Set bits in the control register. |

SSE | FPUMODE | OTHER | void _MM_SET_EXCEPTION_MASK(unsigned int __mask) | Set bits in the control register. |

SSE | FPUMODE | OTHER | void _MM_SET_ROUNDING_MODE(unsigned int __mode) | Set bits in the control register. |

SSE | FPUMODE | OTHER | void _MM_SET_FLUSH_ZERO_MODE(unsigned int __mode) | Set bits in the control register. |

SSE3 | FPUMODE | OTHER | void _MM_SET_DENORMALS_ZERO_MODE(int mode) | Set bits in the control register. |

SSE3 | FPUMODE | OTHER | int _MM_GET_DENORMALS_ZERO_MODE(void ) | Read bits from the control register. |

SSE42 | MATHOP | OTHER | int _mm_popcnt_u32(unsigned int __X) | Calculate a number of bits set to 1. |

SSE42 | MATHOP | OTHER | long long _mm_popcnt_u64(unsigned long long __X) | Calculate a number of bits set to 1. |

SSE42 | MATHOP | OTHER | unsigned int _mm_crc32_u8(unsigned int __C, unsigned char __V) | Accumulate CRC32 (polynomial 0x11EDC6F41) value. |

SSE42 | MATHOP | OTHER | unsigned int _mm_crc32_u16(unsigned int __C, unsigned short __V) | Accumulate CRC32 (polynomial 0x11EDC6F41) value. |

SSE42 | MATHOP | OTHER | unsigned int _mm_crc32_u32(unsigned int __C, unsigned int __V) | Accumulate CRC32 (polynomial 0x11EDC6F41) value. |

SSE42 | MATHOP | OTHER | unsigned long long _mm_crc32_u64(unsigned long long __C, unsigned long long __V) | Accumulate CRC32 (polynomial 0x11EDC6F41) value. |

SSE | MEMORY | OTHER | void _mm_prefetch(void * __P, enum _mm_hint __I) | Loads one cache line from address P to a location "closer" to the processor. The selector I specifies the type of prefetch operation. Obsolete in Pentium 4+ which perform speculative prefetches? |

SSE | MEMORY | OTHER | void _mm_sfence(void ) | Store fence. Serializing instruction for cache manipulation. |

SSE2 | MEMORY | OTHER | void _mm_clflush(void const * __A) | Flush cache line. Note: check CPUID bit for availability. |

SSE2 | MEMORY | OTHER | void _mm_lfence(void ) | Load fence. Serializing instruction for cache manipulation. |

SSE2 | MEMORY | OTHER | void _mm_mfence(void ) | Memory fence. Serializing instruction for cache manipulation. |

SSE3 | MEMORY | OTHER | void _mm_monitor(void const * __P, unsigned int __E, unsigned int __H) | Specify address to watch for future _mm_mwait() call. |

SSE3 | MEMORY | OTHER | void _mm_mwait(unsigned int __E, unsigned int __H) | Enter a low power state and wait for store operation at address specified by _mm_monitor() or certain system-defined events. Used for power management or multiprocessor synchronization. |

SSE3 | LOADSTORE | INTSSE | __m128i _mm_lddqu_si128(__m128i const * __P) | Load value from unaligned address where cache line splits are a performance problem. |

SSE | OTHER | OTHER | void _mm_pause(void ) | Pause processor for an implementation specific amount of time. May save power. |

SSE | BITLOGIC | REAL32 | __m128 _mm_and_ps(__m128 __A, __m128 __B) | Bitwise logic |

SSE | BITLOGIC | REAL32 | __m128 _mm_andnot_ps(__m128 __A, __m128 __B) | Bitwise logic |

SSE | BITLOGIC | REAL32 | __m128 _mm_or_ps(__m128 __A, __m128 __B) | Bitwise logic |

SSE | BITLOGIC | REAL32 | __m128 _mm_xor_ps(__m128 __A, __m128 __B) | Bitwise logic |

SSE | COMPARE | REAL32 | __m128 _mm_cmpeq_ss(__m128 __A, __m128 __B) | Compare low elements. |

SSE | COMPARE | REAL32 | __m128 _mm_cmplt_ss(__m128 __A, __m128 __B) | Compare low elements. |

SSE | COMPARE | REAL32 | __m128 _mm_cmple_ss(__m128 __A, __m128 __B) | Compare low elements. |

SSE | COMPARE | REAL32 | __m128 _mm_cmpgt_ss(__m128 __A, __m128 __B) | Compare low elements. |

SSE | COMPARE | REAL32 | __m128 _mm_cmpge_ss(__m128 __A, __m128 __B) | Compare low elements. |

SSE | COMPARE | REAL32 | __m128 _mm_cmpneq_ss(__m128 __A, __m128 __B) | Compare low elements. |

SSE | COMPARE | REAL32 | __m128 _mm_cmpnlt_ss(__m128 __A, __m128 __B) | Compare low elements. |

SSE | COMPARE | REAL32 | __m128 _mm_cmpnle_ss(__m128 __A, __m128 __B) | Compare low elements. |

SSE | COMPARE | REAL32 | __m128 _mm_cmpngt_ss(__m128 __A, __m128 __B) | Compare low elements. |

SSE | COMPARE | REAL32 | __m128 _mm_cmpnge_ss(__m128 __A, __m128 __B) | Compare low elements. |

SSE | COMPARE | REAL32 | __m128 _mm_cmpord_ss(__m128 __A, __m128 __B) | Compare low elements. Unordered means one or both operands is a NaN. |

SSE | COMPARE | REAL32 | __m128 _mm_cmpunord_ss(__m128 __A, __m128 __B) | Compare low elements. Unordered means one or both operands is a NaN. |

SSE | COMPARE | REAL32 | __m128 _mm_cmpeq_ps(__m128 __A, __m128 __B) | Compare all elements. |

SSE | COMPARE | REAL32 | __m128 _mm_cmplt_ps(__m128 __A, __m128 __B) | Compare all elements. |

SSE | COMPARE | REAL32 | __m128 _mm_cmple_ps(__m128 __A, __m128 __B) | Compare all elements. |

SSE | COMPARE | REAL32 | __m128 _mm_cmpgt_ps(__m128 __A, __m128 __B) | Compare all elements. |

SSE | COMPARE | REAL32 | __m128 _mm_cmpge_ps(__m128 __A, __m128 __B) | Compare all elements. |

SSE | COMPARE | REAL32 | __m128 _mm_cmpneq_ps(__m128 __A, __m128 __B) | Compare all elements. |

SSE | COMPARE | REAL32 | __m128 _mm_cmpnlt_ps(__m128 __A, __m128 __B) | Compare all elements. |

SSE | COMPARE | REAL32 | __m128 _mm_cmpnle_ps(__m128 __A, __m128 __B) | Compare all elements. |

SSE | COMPARE | REAL32 | __m128 _mm_cmpngt_ps(__m128 __A, __m128 __B) | Compare all elements. |

SSE | COMPARE | REAL32 | __m128 _mm_cmpnge_ps(__m128 __A, __m128 __B) | Compare all elements. |

SSE | COMPARE | REAL32 | __m128 _mm_cmpord_ps(__m128 __A, __m128 __B) | Compare all elements. Unordered means one or both operands is a NaN. |

SSE | COMPARE | REAL32 | __m128 _mm_cmpunord_ps(__m128 __A, __m128 __B) | Compare all elements. Unordered means one or both operands is a NaN. |

SSE | COMPARE | REAL32 | int _mm_comieq_ss(__m128 __A, __m128 __B) | Compare elements. Throws exception on QNaN or SNaN. |

SSE | COMPARE | REAL32 | int _mm_comilt_ss(__m128 __A, __m128 __B) | Compare elements. Throws exception on QNaN or SNaN. |

SSE | COMPARE | REAL32 | int _mm_comile_ss(__m128 __A, __m128 __B) | Compare elements. Throws exception on QNaN or SNaN. |

SSE | COMPARE | REAL32 | int _mm_comigt_ss(__m128 __A, __m128 __B) | Compare elements. Throws exception on QNaN or SNaN. |

SSE | COMPARE | REAL32 | int _mm_comige_ss(__m128 __A, __m128 __B) | Compare elements. Throws exception on QNaN or SNaN. |

SSE | COMPARE | REAL32 | int _mm_comineq_ss(__m128 __A, __m128 __B) | Compare elements. Throws exception on QNaN or SNaN. |

SSE | COMPARE | REAL32 | int _mm_ucomieq_ss(__m128 __A, __m128 __B) | Compare elements. Tolerates QNaN but throws exception on SNaN. |

SSE | COMPARE | REAL32 | int _mm_ucomilt_ss(__m128 __A, __m128 __B) | Compare elements. Tolerates QNaN but throws exception on SNaN. |

SSE | COMPARE | REAL32 | int _mm_ucomile_ss(__m128 __A, __m128 __B) | Compare elements. Tolerates QNaN but throws exception on SNaN. |

SSE | COMPARE | REAL32 | int _mm_ucomigt_ss(__m128 __A, __m128 __B) | Compare elements. Tolerates QNaN but throws exception on SNaN. |

SSE | COMPARE | REAL32 | int _mm_ucomige_ss(__m128 __A, __m128 __B) | Compare elements. Tolerates QNaN but throws exception on SNaN. |

SSE | COMPARE | REAL32 | int _mm_ucomineq_ss(__m128 __A, __m128 __B) | Compare elements. Tolerates QNaN but throws exception on SNaN. |

SSE | EXTRACT | REAL32 | int _mm_movemask_ps(__m128 __A) | Extract sign bits of all components into integer. |

SSE | EXTRACT | REAL32 | float _mm_cvtss_f32(__m128 __A) | Extract low element of vector. |

SSE | EXTRACT | REAL32 | __m128 _mm_move_ss(__m128 __A, __m128 __B) | Sets the low SPFP value of A from the low value of B. |

SSE41 | EXTRACT | REAL32 | int _mm_extract_ps(__m128 __X, const int __N) | Extract binary representation of single precision float from packed single precision array element of X selected by index N. |

SSE41 | INSERT | REAL32 | __m128 _mm_insert_ps(__m128 __D, __m128 __S, const int __N) | Insert single precision float into packed single precision array element selected by index N. The bits [7-6] of N define S index, the bits [5-4] define D index, and bits [3-0] define zeroing mask for D. |

SSE | LOADSTORE | REAL32 | void _mm_stream_ps(float * __P, __m128 __A) | Write value to memory without polluting caches. |

SSE | LOADSTORE | REAL32 | __m128 _mm_loadh_pi(__m128 __A, __m64 const * __P) | Sets the upper two SPFP values with 64-bits of data loaded from P; the lower two values are passed through from A. |

SSE | LOADSTORE | REAL32 | void _mm_storeh_pi(__m64 * __P, __m128 __A) | Stores the upper two SPFP values of A into P. |

SSE | LOADSTORE | REAL32 | __m128 _mm_loadl_pi(__m128 __A, __m64 const * __P) | Sets the lower two SPFP values with 64-bits of data loaded from P; the upper two values are passed through from A. |

SSE | LOADSTORE | REAL32 | void _mm_storel_pi(__m64 * __P, __m128 __A) | Stores the lower two SPFP values of A into P. |

SSE | LOADSTORE | REAL32 | __m128 _mm_load1_ps(float const * __P) | Create a vector with all four elements equal to *P. |

SSE | LOADSTORE | REAL32 | __m128 _mm_load_ps1(float const * __P) | Create a vector with all four elements equal to *P. |

SSE | LOADSTORE | REAL32 | __m128 _mm_load_ps(float const * __P) | Load four SPFP values from P. The address must be 16-byte aligned. |

SSE | LOADSTORE | REAL32 | __m128 _mm_loadu_ps(float const * __P) | Load four SPFP values from P. The address need not be 16-byte aligned. |

SSE | LOADSTORE | REAL32 | __m128 _mm_loadr_ps(float const * __P) | Load four SPFP values in reverse order. The address must be aligned. |

SSE | LOADSTORE | REAL32 | void _mm_store_ss(float * __P, __m128 __A) | Stores the lower SPFP value. |

SSE | LOADSTORE | REAL32 | void _mm_store_ps(float * __P, __m128 __A) | Store four SPFP values. The address must be 16-byte aligned. |

SSE | LOADSTORE | REAL32 | void _mm_storeu_ps(float * __P, __m128 __A) | Store four SPFP values. The address need not be 16-byte aligned. |

SSE | LOADSTORE | REAL32 | void _mm_store1_ps(float * __P, __m128 __A) | Store the lower SPFP value across four words. (Duplicate low value.) |

SSE | LOADSTORE | REAL32 | void _mm_store_ps1(float * __P, __m128 __A) | Store the lower SPFP value across four words. (Duplicate low value.) |

SSE | LOADSTORE | REAL32 | void _mm_storer_ps(float * __P, __m128 __A) | Store four SPFP values in reverse order. The address must be aligned. |

SSE3 | LOADSTORE | REAL32 | __m128 _mm_movehdup_ps(__m128 __X) | Load vector [f0 f1 f2 f3] from address X, then duplicate elements 1 and 3. Return [f1 f1 f3 f3]. |

SSE3 | LOADSTORE | REAL32 | __m128 _mm_moveldup_ps(__m128 __X) | Load vector [f0 f1 f2 f3] from address X, then duplicate elements 0 and 2. Return [f0 f0 f2 f2]. |

SSE | MATHOP | REAL32 | __m128 _mm_add_ss(__m128 __A, __m128 __B) | Basic math on low elements. Copy high bits of A. |

SSE | MATHOP | REAL32 | __m128 _mm_sub_ss(__m128 __A, __m128 __B) | Basic math on low elements. Copy high bits of A. |

SSE | MATHOP | REAL32 | __m128 _mm_mul_ss(__m128 __A, __m128 __B) | Basic math on low elements. Copy high bits of A. |

SSE | MATHOP | REAL32 | __m128 _mm_div_ss(__m128 __A, __m128 __B) | Basic math on low elements. Copy high bits of A. |

SSE | MATHOP | REAL32 | __m128 _mm_min_ss(__m128 __A, __m128 __B) | Basic math on low elements. Copy high bits of A. |

SSE | MATHOP | REAL32 | __m128 _mm_max_ss(__m128 __A, __m128 __B) | Basic math on low elements. Copy high bits of A. |

SSE | MATHOP | REAL32 | __m128 _mm_add_ps(__m128 __A, __m128 __B) | Basic math on all elements. |

SSE | MATHOP | REAL32 | __m128 _mm_sub_ps(__m128 __A, __m128 __B) | Basic math on all elements. |

SSE | MATHOP | REAL32 | __m128 _mm_mul_ps(__m128 __A, __m128 __B) | Basic math on all elements. |

SSE | MATHOP | REAL32 | __m128 _mm_div_ps(__m128 __A, __m128 __B) | Basic math on all elements. |

SSE | MATHOP | REAL32 | __m128 _mm_min_ps(__m128 __A, __m128 __B) | Basic math on all elements. |

SSE | MATHOP | REAL32 | __m128 _mm_max_ps(__m128 __A, __m128 __B) | Basic math on all elements. |

SSE | MATHOP | REAL32 | __m128 _mm_sqrt_ps(__m128 __A) | Get Square root of each element. |

SSE | MATHOP | REAL32 | __m128 _mm_sqrt_ss(__m128 __A) | Get square root of low component of A. High bits are unchanged. |

SSE | MATHOP | REAL32 | __m128 _mm_rcp_ps(__m128 __A) | Get reciprocal of each element. |

SSE | MATHOP | REAL32 | __m128 _mm_rcp_ss(__m128 __A) | Get reciprocal of low element. High bits are unchanged. |

SSE | MATHOP | REAL32 | __m128 _mm_sqrt_ps(__m128 __A) | Get reciprocal of square root of each element. |

SSE | MATHOP | REAL32 | __m128 _mm_sqrt_ss(__m128 __A) | Get reciprocal of square root of low component of A. High bits are unchanged. |

SSE3 | MATHOP | REAL32 | __m128 _mm_addsub_ps(__m128 __X, __m128 __Y) | Adds odd-numbered SPFP values of X with the corresponding SPFP values from Y; returns results in odd-numbered values. Subtracts even-numbered SPFP values from Y from the corresponding SPFP values in X; returns results in even-numbered values. |

SSE3 | MATHOP | REAL32 | __m128 _mm_hadd_ps(__m128 __X, __m128 __Y) | Horizontal addition across vectors. Returns [[Xf0+Xf1] [Xf2+Xf3] [Yf0+Yf1] [Yf2+Yf3]]. |

SSE3 | MATHOP | REAL32 | __m128 _mm_hsub_ps(__m128 __X, __m128 __Y) | Horizontal subtraction across vectors. Returns [[Xf0-Xf1] [Xf2-Xf3] [Yf0-Yf1] [Yf2-Yf3]]. |

SSE41 | MATHOP | REAL32 | __m128 _mm_dp_ps(__m128 __X, __m128 __Y, const int __M) | Dot product instructions with mask-defined summing and zeroing parts of result. |

SSE | SET | REAL32 | __m128 _mm_set_ss(float __F) | Create a vector with element 0 as F and the rest zero. |

SSE | SET | REAL32 | __m128 _mm_set1_ps(float __F) | Set all elements of vector to same value |

SSE | SET | REAL32 | __m128 _mm_set_ps1(float __F) | Set all elements of vector to same value |

SSE | SET | REAL32 | __m128 _mm_load_ss(float const * __P) | Create a vector with element 0 as *P and the rest zero. |

SSE | SET | REAL32 | __m128d _mm_setzero_sd(void ) | Create a vector of zeros. |

SSE | SET | REAL32 | __m128 _mm_set_ps(const float __Z, const float __Y, const float __X, const float __W) | Create the vector [Z Y X W]. |

SSE | SET | REAL32 | __m128 _mm_setr_ps(float __Z, float __Y, float __X, float __W) | Create the vector [W X Y Z]. |

SSE | SHUFFLE | REAL32 | __m128 _mm_shuffle_ps(__m128 __A, __m128 __B, int __mask) | Moves two of the four packed single-precision floating-point values from the destination operand (first operand) into the low quadword of the destination operand; moves two of the four packed single-precision floating-point values from the source operand (second operand) into to the high quadword of the destination operand. |

SSE | SHUFFLE | REAL32 | __m128 _mm_unpackhi_ps(__m128 __A, __m128 __B) | Unpack and interleave high components of inputs. |

SSE | SHUFFLE | REAL32 | __m128 _mm_unpacklo_ps(__m128 __A, __m128 __B) | Unpack and interleave low components of inputs. |

SSE | SHUFFLE | REAL32 | __m128 _mm_movehl_ps(__m128 __A, __m128 __B) | Moves the upper two values of B into the lower two values of A. |

SSE | SHUFFLE | REAL32 | __m128 _mm_movelh_ps(__m128 __A, __m128 __B) | Moves the lower two values of B into the upper two values of A. |

SSE | SHUFFLE | REAL32 | void _MM_TRANSPOSE4_PS(__m128& row0, __m128& row1, __m128& row2, __m128& row3) | Transpose the 4x4 matrix composed of row[0-3]. (MACRO) |

SSE41 | SHUFFLE | REAL32 | __m128 _mm_blend_ps(__m128 __X, __m128 __Y, const int __M) | Single precision floating point blend instructions - select data from 2 sources using constant/variable mask. |

SSE41 | SHUFFLE | REAL32 | __m128 _mm_blendv_ps(__m128 __X, __m128 __Y, __m128 __M) | Single precision floating point blend instructions - select data from 2 sources using constant/variable mask. |

SSE2 | BITLOGIC | REAL64 | __m128d _mm_and_pd(__m128d __A, __m128d __B) | Bitwise logic |

SSE2 | BITLOGIC | REAL64 | __m128d _mm_andnot_pd(__m128d __A, __m128d __B) | Bitwise logic |

SSE2 | BITLOGIC | REAL64 | __m128d _mm_or_pd(__m128d __A, __m128d __B) | Bitwise logic |

SSE2 | BITLOGIC | REAL64 | __m128d _mm_xor_pd(__m128d __A, __m128d __B) | Bitwise logic |

SSE2 | COMPARE | REAL64 | __m128d _mm_cmpeq_sd(__m128d __A, __m128d __B) | Compare low elements. |

SSE2 | COMPARE | REAL64 | __m128d _mm_cmplt_sd(__m128d __A, __m128d __B) | Compare low elements. |

SSE2 | COMPARE | REAL64 | __m128d _mm_cmple_sd(__m128d __A, __m128d __B) | Compare low elements. |

SSE2 | COMPARE | REAL64 | __m128d _mm_cmpgt_sd(__m128d __A, __m128d __B) | Compare low elements. |

SSE2 | COMPARE | REAL64 | __m128d _mm_cmpge_sd(__m128d __A, __m128d __B) | Compare low elements. |

SSE2 | COMPARE | REAL64 | __m128d _mm_cmpneq_sd(__m128d __A, __m128d __B) | Compare low elements. |

SSE2 | COMPARE | REAL64 | __m128d _mm_cmpnlt_sd(__m128d __A, __m128d __B) | Compare low elements. |

SSE2 | COMPARE | REAL64 | __m128d _mm_cmpnle_sd(__m128d __A, __m128d __B) | Compare low elements. |

SSE2 | COMPARE | REAL64 | __m128d _mm_cmpngt_sd(__m128d __A, __m128d __B) | Compare low elements. |

SSE2 | COMPARE | REAL64 | __m128d _mm_cmpnge_sd(__m128d __A, __m128d __B) | Compare low elements. |

SSE2 | COMPARE | REAL64 | __m128d _mm_cmpord_sd(__m128d __A, __m128d __B) | Compare low elements. Unordered means one or both operands is a NaN. |

SSE2 | COMPARE | REAL64 | __m128d _mm_cmpunord_sd(__m128d __A, __m128d __B) | Compare low elements. Unordered means one or both operands is a NaN. |

SSE2 | COMPARE | REAL64 | __m128d _mm_cmpeq_pd(__m128d __A, __m128d __B) | Compare low elements. |

SSE2 | COMPARE | REAL64 | __m128d _mm_cmplt_pd(__m128d __A, __m128d __B) | Compare low elements. |

SSE2 | COMPARE | REAL64 | __m128d _mm_cmple_pd(__m128d __A, __m128d __B) | Compare low elements. |

SSE2 | COMPARE | REAL64 | __m128d _mm_cmpgt_pd(__m128d __A, __m128d __B) | Compare low elements. |

SSE2 | COMPARE | REAL64 | __m128d _mm_cmpge_pd(__m128d __A, __m128d __B) | Compare low elements. |

SSE2 | COMPARE | REAL64 | __m128d _mm_cmpneq_pd(__m128d __A, __m128d __B) | Compare low elements. |

SSE2 | COMPARE | REAL64 | __m128d _mm_cmpnlt_pd(__m128d __A, __m128d __B) | Compare low elements. |

SSE2 | COMPARE | REAL64 | __m128d _mm_cmpnle_pd(__m128d __A, __m128d __B) | Compare low elements. |

SSE2 | COMPARE | REAL64 | __m128d _mm_cmpngt_pd(__m128d __A, __m128d __B) | Compare low elements. |

SSE2 | COMPARE | REAL64 | __m128d _mm_cmpnge_pd(__m128d __A, __m128d __B) | Compare low elements. |

SSE2 | COMPARE | REAL64 | __m128d _mm_cmpord_pd(__m128d __A, __m128d __B) | Compare all elements. Unordered means one or both operands is a NaN. |

SSE2 | COMPARE | REAL64 | __m128d _mm_cmpunord_pd(__m128d __A, __m128d __B) | Compare all elements. Unordered means one or both operands is a NaN. |

SSE2 | COMPARE | REAL64 | int _mm_comieq_sd(__m128d __A, __m128d __B) | Compare elements. Throws exception on QNaN or SNaN. |

SSE2 | COMPARE | REAL64 | int _mm_comilt_sd(__m128d __A, __m128d __B) | Compare elements. Throws exception on QNaN or SNaN. |

SSE2 | COMPARE | REAL64 | int _mm_comile_sd(__m128d __A, __m128d __B) | Compare elements. Throws exception on QNaN or SNaN. |

SSE2 | COMPARE | REAL64 | int _mm_comigt_sd(__m128d __A, __m128d __B) | Compare elements. Throws exception on QNaN or SNaN. |

SSE2 | COMPARE | REAL64 | int _mm_comige_sd(__m128d __A, __m128d __B) | Compare elements. Throws exception on QNaN or SNaN. |

SSE2 | COMPARE | REAL64 | int _mm_comineq_sd(__m128d __A, __m128d __B) | Compare elements. Throws exception on QNaN or SNaN. |

SSE2 | COMPARE | REAL64 | int _mm_ucomieq_sd(__m128d __A, __m128d __B) | Compare elements. Tolerates QNaN but throws exception on SNaN. |

SSE2 | COMPARE | REAL64 | int _mm_ucomilt_sd(__m128d __A, __m128d __B) | Compare elements. Tolerates QNaN but throws exception on SNaN. |

SSE2 | COMPARE | REAL64 | int _mm_ucomile_sd(__m128d __A, __m128d __B) | Compare elements. Tolerates QNaN but throws exception on SNaN. |

SSE2 | COMPARE | REAL64 | int _mm_ucomigt_sd(__m128d __A, __m128d __B) | Compare elements. Tolerates QNaN but throws exception on SNaN. |

SSE2 | COMPARE | REAL64 | int _mm_ucomige_sd(__m128d __A, __m128d __B) | Compare elements. Tolerates QNaN but throws exception on SNaN. |

SSE2 | COMPARE | REAL64 | int _mm_ucomineq_sd(__m128d __A, __m128d __B) | Compare elements. Tolerates QNaN but throws exception on SNaN. |

SSE41 | CONVERT | REAL64 | __m128d _mm_round_pd(__m128d __V, const int __M) | Packed/scalar double precision floating point rounding. |

SSE41 | CONVERT | REAL64 | __m128d _mm_round_sd(__m128d __D, __m128d __V, const int __M) | Packed/scalar double precision floating point rounding. |

SSE41 | CONVERT | REAL64 | __m128d _mm_ceil_pd(__m128d V) | Packed/scalar double precision floating point rounding. |

SSE41 | CONVERT | REAL64 | __m128d _mm_ceil_sd(__m128d __D, __m128d __V) | Packed/scalar double precision floating point rounding. |

SSE41 | CONVERT | REAL64 | __m128d _mm_floor_pd(__m128d V) | Packed/scalar double precision floating point rounding. |

SSE41 | CONVERT | REAL64 | __m128d _mm_floor_sd(__m128d __D, __m128d __V) | Packed/scalar double precision floating point rounding. |

SSE41 | CONVERT | REAL64 | __m128 _mm_round_ps(__m128 __V, const int __M) | Packed/scalar single precision floating point rounding. |

SSE41 | CONVERT | REAL64 | __m128 _mm_round_ss(__m128 __D, __m128 __V, const int __M) | Packed/scalar single precision floating point rounding. |

SSE41 | CONVERT | REAL64 | __m128d _mm_ceil_ps(__m128d V) | Packed/scalar single precision floating point rounding. |

SSE41 | CONVERT | REAL64 | __m128d _mm_ceil_ss(__m128d __D, __m128d __V) | Packed/scalar single precision floating point rounding. |

SSE41 | CONVERT | REAL64 | __m128d _mm_floor_ps(__m128d V) | Packed/scalar single precision floating point rounding. |

SSE41 | CONVERT | REAL64 | __m128d _mm_floor_ss(__m128d __D, __m128d __V) | Packed/scalar single precision floating point rounding. |

SSE2 | EXTRACT | REAL64 | double _mm_cvtsd_f64(__m128d __A) | Extract lower value of DPFP. |

SSE2 | EXTRACT | REAL64 | int _mm_movemask_pd(__m128d __A) | Extract sign bits of all components into integer. |

SSE2 | LOADSTORE | REAL64 | __m128d _mm_move_sd(__m128d __A, __m128d __B) | Sets the low DPFP value of A from the low value of B. |

SSE2 | LOADSTORE | REAL64 | __m128d _mm_load_pd(double const * __P) | Load two DPFP values from P. The address must be 16-byte aligned. |

SSE2 | LOADSTORE | REAL64 | __m128d _mm_loadu_pd(double const * __P) | Load two DPFP values from P. The address need not be 16-byte aligned. |

SSE2 | LOADSTORE | REAL64 | __m128d _mm_load1_pd(double const * __P) | Create a vector with both elements equal to *P. |

SSE2 | LOADSTORE | REAL64 | __m128d _mm_load_pd1(double const * __P) | Create a vector with both elements equal to *P. |

SSE2 | LOADSTORE | REAL64 | __m128d _mm_load_sd(double const * __P) | Create a vector with element 0 as *P and the rest zero. |

SSE2 | LOADSTORE | REAL64 | __m128d _mm_loadr_pd(double const * __P) | Load two DPFP values in reverse order. The address must be aligned. |

SSE2 | LOADSTORE | REAL64 | void _mm_store_pd(double * __P, __m128d __A) | Store two DPFP values. The address must be 16-byte aligned. |

SSE2 | LOADSTORE | REAL64 | void _mm_storeu_pd(double * __P, __m128d __A) | Store two DPFP values. The address need not be 16-byte aligned. |

SSE2 | LOADSTORE | REAL64 | void _mm_store_sd(double * __P, __m128d __A) | Stores the lower DPFP value. |

SSE2 | LOADSTORE | REAL64 | void _mm_storel_pd(double * __P, __m128d __A) | Stores the lower DPFP value. |

SSE2 | LOADSTORE | REAL64 | void _mm_storeh_pd(double * __P, __m128d __A) | Stores the upper DPFP value. |

SSE2 | LOADSTORE | REAL64 | void _mm_store1_pd(double * __P, __m128d __A) | Store the lower DPFP value across two words. (Duplicate the value.) The address must be 16-byte aligned. |

SSE2 | LOADSTORE | REAL64 | void _mm_store_pd1(double * __P, __m128d __A) | Store the lower DPFP value across two words. (Duplicate the value.) The address must be 16-byte aligned. |

SSE2 | LOADSTORE | REAL64 | void _mm_storer_pd(double * __P, __m128d __A) | Store two DPFP values in reverse order. The address must be aligned. |

SSE2 | LOADSTORE | REAL64 | __m128d _mm_loadh_pd(__m128d __A, double const * __B) | Load double float from memory into high double. Copy low bits from A? Unaligned address. |

SSE2 | LOADSTORE | REAL64 | __m128d _mm_loadl_pd(__m128d __A, double const * __B) | Load double float from memory into low double. Copy high bits from A? Unaligned address. |

SSE2 | LOADSTORE | REAL64 | void _mm_stream_pd(double * __A, __m128d __B) | Write value to memory without polluting caches. |

SSE3 | LOADSTORE | REAL64 | __m128d _mm_loaddup_pd(double const * __P) | Load vector [d0 d1] from address X, then duplicate element 0. Return [d0 d0]. Synonym for _mm_load1_pd(). |

SSE2 | MATHOP | REAL64 | __m128d _mm_add_sd(__m128d __A, __m128d __B) | Basic math on low elements. Copy high bits of A. |

SSE2 | MATHOP | REAL64 | __m128d _mm_sub_sd(__m128d __A, __m128d __B) | Basic math on low elements. Copy high bits of A. |

SSE2 | MATHOP | REAL64 | __m128d _mm_mul_sd(__m128d __A, __m128d __B) | Basic math on low elements. Copy high bits of A. |

SSE2 | MATHOP | REAL64 | __m128d _mm_div_sd(__m128d __A, __m128d __B) | Basic math on low elements. Copy high bits of A. |

SSE2 | MATHOP | REAL64 | __m128d _mm_min_sd(__m128d __A, __m128d __B) | Basic math on low elements. Copy high bits of A. |

SSE2 | MATHOP | REAL64 | __m128d _mm_max_sd(__m128d __A, __m128d __B) | Basic math on low elements. Copy high bits of A. |

SSE2 | MATHOP | REAL64 | __m128d _mm_add_pd(__m128d __A, __m128d __B) | Basic math on all elements. |

SSE2 | MATHOP | REAL64 | __m128d _mm_sub_pd(__m128d __A, __m128d __B) | Basic math on all elements. |

SSE2 | MATHOP | REAL64 | __m128d _mm_mul_pd(__m128d __A, __m128d __B) | Basic math on all elements. |

SSE2 | MATHOP | REAL64 | __m128d _mm_div_pd(__m128d __A, __m128d __B) | Basic math on all elements. |

SSE2 | MATHOP | REAL64 | __m128d _mm_min_pd(__m128d __A, __m128d __B) | Basic math on all elements. |

SSE2 | MATHOP | REAL64 | __m128d _mm_max_pd(__m128d __A, __m128d __B) | Basic math on all elements. |

SSE2 | MATHOP | REAL64 | __m128d _mm_sqrt_pd(__m128d __A) | Square root of each element. |

SSE2 | MATHOP | REAL64 | __m128d _mm_sqrt_sd(__m128d __A, __m128d __B) | Get square root of low half of A and copy top half of B. |

SSE3 | MATHOP | REAL64 | __m128d _mm_addsub_pd(__m128d __X, __m128d __Y) | Adds odd-numbered DPFP values of X with the corresponding DPFP values from Y; returns results in odd-numbered values. Subtracts even-numbered DPFP values from Y from the corresponding DPFP values in X; returns results in even-numbered values. |

SSE3 | MATHOP | REAL64 | __m128d _mm_hadd_pd(__m128d __X, __m128d __Y) | Horizontal addition across vectors. Returns [[Xd0+Xd1] [Yd0+Yd1]]. |

SSE3 | MATHOP | REAL64 | __m128d _mm_hsub_pd(__m128d __X, __m128d __Y) | Horizontal subtraction across vectors. Returns [[Xd0-Xd1] [Yd0-Yd1]]. |

SSE41 | MATHOP | REAL64 | __m128d _mm_dp_pd(__m128d __X, __m128d __Y, const int __M) | Dot product instructions with mask-defined summing and zeroing parts of result. |

SSE2 | SET | REAL64 | __m128d _mm_set_sd(double __F) | Create a vector with element 0 as F and the rest zero. |

SSE2 | SET | REAL64 | __m128d _mm_set1_pd(double __F) | Set all elements of vector to same value |

SSE2 | SET | REAL64 | __m128d _mm_set_pd1(double __F) | Set all elements of vector to same value |

SSE2 | SET | REAL64 | __m128d _mm_set_pd(double __W, double __X) | Create a vector with the lower value X and upper value W. |

SSE2 | SET | REAL64 | __m128d _mm_setr_pd(double __W, double __X) | Create a vector with the lower value W and upper value X. |

SSE2 | SET | REAL64 | __m128d _mm_setzero_pd(void ) | Create a vector of zeros. |

SSE2 | SHUFFLE | REAL64 | __m128d _mm_shuffle_pd(__m128d a, __m128d b, unsigned int imm8) | Select double float components. Bit 0 of imm8 selects low double. Bit 1 of imm8 selects high double. |

SSE2 | SHUFFLE | REAL64 | __m128d _mm_unpackhi_pd(__m128d __A, __m128d __B) | Unpack and interleave high components of inputs. |

SSE2 | SHUFFLE | REAL64 | __m128d _mm_unpacklo_pd(__m128d __A, __m128d __B) | Unpack and interleave low components of inputs. |

SSE3 | SHUFFLE | REAL64 | __m128d _mm_movedup_pd(__m128d __X) | Duplicate low element. |

SSE41 | SHUFFLE | REAL64 | __m128d _mm_blend_pd(__m128d __X, __m128d __Y, const int __M) | Double precision floating point blend instructions - select data from 2 sources using constant/variable mask. |

SSE41 | SHUFFLE | REAL64 | __m128d _mm_blendv_pd(__m128d __X, __m128d __Y, __m128d __M) | Double precision floating point blend instructions - select data from 2 sources using constant/variable mask. |

Loading or storing SSE 128-bit values should be done on 128-bit aligned addresses except when using instructions that explicitly provide unaligned access (MOVUPS, MOVUPD, MOVDQU, LOADDQU). The corresponding intrinsics are _mm_storeu_ps, _mm_storeu_pd, _mm_storeu_si128, _mm_loadu_ps, _mm_loadu_pd, _mm_loadu_si128.

When moving to AVX instructions, then see Intel Volume 1, Section 13.3, (AVX) Memory Alignment:

- MOVDQA, MOVAPS, MOVAPD, MOVNTPS, MOVNTPD, MOVNTDQ, MONVNTDQA always require natural alignment.
- MOVDQU, MOVUPS, MOVUPD always allow unaligned addresses. (Also LOADDQU?)
- Other instructions require alignment when using "Legacy SSE" encoding but relax the alignment requirements when using the new VEX encoding.

- Intel manuals
*Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 1: Basic Architecture*(order number 253665)*Intel® 64 and IA-32 Architectures Software Developer's Manual, Volumes 2A, 2B & 2C: Instruction Set Reference*(order numbers 253666, 253667 and 326018)*Intel® 64 and IA-32 Architectures Software Developer's Manual, Volumes 3A, 3B & 3C: System Programming Guide, Part 1*(order numbers 253668, 253669 and 326019)

- Intel C++ Intrinsics Reference has great detail diagrams for each intrinsic.
- Markus PĆ¼schel's course slides
- Intel Math Kernel Library
- Intel compiler documentation
- Agner Fog's optimization manuals
- Suggestions from Stack Overflow
- Argument for using SSE over MMX
- SSSE3: fast popcount by Wojciech Muła
- SSE4.2 string functions

Added a reference to the *Intel C++ Intrinsics Reference* document.