2012-04-18

After experimenting with OpenCL, I became frustrated with the poor performance of the NVidia 9400M card and started playing with SSE vector instructions. While there is some good documentation on these instructions, it is spread out and tedious to navigate. This page is a combination of my notes, a large table containing all MMX and SSE instructions up to SSE 4.2, and some Javascript to filter the table.

Overview

The original Intel instruction set is called IA32. The early Pentium had a set of seven integer General Purpose Registers (GPRs) and a stack of eight floating point unit (FPU) registers. Instructions placed their results into a single register or memory location. This is called a scalar operation.

Multi-Media Extensions (MMX) was Intel's first attempt at providing vector instructions. MMX re-uses the floating point unit's registers but treats them as an array or vector of integers. An MMX instruction can operate on four 16-bit integers or two 32-bit integers in a single MMX register. For operations that could be performed in parallel, MMX could potentially make the code four times as fast. This idea is known as Single Instruction Multiple Data (SIMD). The downside to MMX was that the code could not do MMX "packed integer" operations and floating point operations at the same time. After using an MMX instruction, the programmer had to remember to clear the floating point registers with EMMS.

AMD invented the 3DNow! instruction set which treated the FPU/MMX registers as vectors of 32-bit floating point values. This suffered from the same register overlap problem as MMX. Even though 3DNow! provided many floating point instructions, it did not provide the more complex functions like trigonometric or logarithmic functions, so many programs still needed to switch modes. The 3DNow! instructions did not win the market away from Intel and AMD discontinued 3DNow! in 2010 in favor of Intel's later design.

With the Streaming SIMD Extensions (SSE), Intel created an entirely new set of registers. XMM registers were 128 bits long and each could contain four 32-bit floating point values. Like 3DNow!, SSE provided instructions for doing basic floating point math but did not provide the more complex instructions from the IA32 FPU. Later, SSE2 added the ability to use the XMM registers as two 64-bit floating point values or as packed integers.

If the programmer did not use the MMX registers, then the FPU stack could be kept in floating point mode for access to the complex functions. If the program did not require complex floating point, the MMX registers could be used for scalar integer operations to reduce memory accesses. The extra size of the SSE registers still make them more attractive for integer vector operations.

SSE2 made the instruction set more consistent and provided many conversion instructions for moving values between different vector formats and general purpose registers. Later SSE versions were refinements that added additional instructions, but did not significantly change the programming environment.

AVX is the next generation. It expands the SSE registers to 256 bits each which could double the throughput of SSE code. It also provides a new encoding scheme which allows non-destructive operations.

Timeline

VersionIntelAMD
MMX1996, Pentium P5 w/MMX; Pentium II1997, K6
3DNow!not invented here1998, K6-2
SSE1999, Pentium III2001, K7 Palomino
SSE22001, Pentium 4, Xeon2005, K8 core
SSE32004, Pentium 4 w/HT2005, K8 core
SSSE32006, Core 2; Xeon2011, Bobcat, Bulldozer
SSE4.12007, Penryn 45nm process2011, Bulldozer
SSE4.22008, Nehamem Core i72011, Bulldozer
AVX2011, Sandy Bridge2011, Bulldozer
AVX22013, Haswellno data

MMX should be avoided due to conflicts with the FPU. SSE3 has been supported by Intel and AMD for six years now. It would be reasonable to treat SSE3 as the baseline for programming in 2012. The next major jump in performance will happen with AVX and it will be supported by AMD, but AVX has only been on the market since 2011.

Processor support

Before using vector instructions, you should check which instructions your current processor supports using the CPUID instruction. Below are two ways of formatting that information.


// Inline ASM for GCC.
#define cpuid(func,ax,bx,cx,dx) __asm__ __volatile__ ("cpuid": "=a" (ax), "=b" (bx), "=c" (cx), "=d" (dx) : "a" (func));
#define xgetbv(func,lo,hi) __asm__ __volatile__ ("xgetbv": "=a" (lo), "=d" (hi) : "c" (func));

// If we assume that each SSE level implies that all previous levels are supported, we can reduce the check to a single number.
int get_sse_level() {
	int a,b,c,d,e,f;
	cpuid(1,a,b,c,d);

	if ((c & 0x018000000) == 0x018000000) {	// AVX bit, OSXSAVE bit
		xgetbv(0,e,f);
		if ((e & 6) == 6) {
			return 500;			// AVX
		}
	}
	if (c & (1 << 20)) return 420;	// SSE4.2
	if (c & (1 << 19)) return 410;	// SSE4.1
	if (c & (1 <<  9)) return 310;	// SSSE3
	if (c & (1 <<  0)) return 300;	// SSE3
	if (d & (1 << 26)) return 200;	// SSE2
	if (d & (1 << 25)) return 100;	// SSE
	if (d & (1 << 23)) return  10;	// MMX
	return 0;
}

// It is safer to look at the individual bits for each instruction group and optional instructions.
int get_sse_bits() {
	int a,b,c,d,e,f;
	cpuid(1,a,b,c,d);

	int bits = 0;
	if (d & (1 << 23)) bits |= 0x0001;	// MMX
	if (d & (1 << 24)) bits |= 0x0002;	// FXSAVE, FXRSTOR (always with SSE)
	if (d & (1 << 25)) bits |= 0x0004;	// SSE
	if (d & (1 << 26)) bits |= 0x0008;	// SSE2
	if (c & (1 <<  0)) bits |= 0x0010;	// SSE3
	if (c & (1 <<  9)) bits |= 0x0020;	// SSSE3
	if (c & (1 << 19)) bits |= 0x0040;	// SSE4.1
	if (c & (1 << 20)) bits |= 0x0080;	// SSE4.2
	if (c & (1 << 28)) {
		if (c & (1 << 27)) {
			xgetbv(0,e,f);
			if ((e & 6) == 6) {
				bits |= 0x0100;		// AVX
			}
		}
	}
	if (c & (1 <<  3)) bits |= 0x01000;	// MONITOR,MWAIT (SSE3 option)
	if (d & (1 << 19)) bits |= 0x02000;	// CLFLUSH (SSE2 option)
	if (c & (1 <<  1)) bits |= 0x04000;	// PCLMULQDQ (SSE option)
	if (c & (1 << 12)) bits |= 0x08000;	// FMA (AVX option)
	if (c & (1 << 23)) bits |= 0x10000;	// POPCNT

	return bits;
}
	

Intrinsics

When Intel introduced the MMX instruction set, they implemented special "intrinsic" functions in their compiler to manipulate vector values directly from C/C++ code. Microsoft and the GCC developers copied those intrinsics in their own compilers. For the MMX instructions, most assembler instructions have two intrinsics, one formed by prepending _m_ to the assembler mnemonic, and another beginning with _mm_ that is more descriptive. So although assembler mnemonic PUNPCKHBW has an intrinsic _m_punpckhbw, most programmers would use the more descriptive _mm_unpackhi_pi8 instead. There are exceptions such as the bidirectional MOVD (_m_from_int = _mm_cvtsi32_si64 and _m_to_int = _mm_cvtsi64_si32) and composite intrinsics such as _mm_set_pi8 which correspond to more than one assembler instruction. For SSE, many of the _m_ forms were never defined.

The GCC developers implemented a set of intrinsics that correspond to assembler instructions (PBLENDW = __builtin_ia32_pblendw128) but they also provided header files to duplicate the Intel intrinsics. You only need to include one header file: use the one that provides the latest SSE version that you use. The current header files are:

Instruction setHeader file
MMXmmintrin.h
SSExmmintrin.h
SSE2emmintrin.h
SSE3pmmintrin.h
SSSE3tmmintrin.h
SSE4.1smmintrin.h
SSE4.2nmmintrin.h
AVXimmintrin.h

Note that some of the intrinsics documented here have nothing to do with vector operations. Instructions such as PAUSE, POPCOUNT, and CRC32 were introduced by Intel within a group of SSE instructions but they do not really belong there. Note also that some instructions were introduced with an SSE group but they have their own CPUID bit, so you must check for them appropriately.

Intrinsics Reference

MMX SSE SSE2 SSE3 SSSE3 SSE41 SSE42 Search: Placeholder
I SetFun CatData TypeIntrinsicDescription
MMX BITLOGIC INTMMX __m64 _mm_and_si64(__m64 __m1, __m64 __m2) Bit-wise AND the 64-bit values in M1 and M2.
MMX BITLOGIC INTMMX __m64 _m_pand(__m64 __m1, __m64 __m2) Bit-wise AND the 64-bit values in M1 and M2.
MMX BITLOGIC INTMMX __m64 _mm_andnot_si64(__m64 __m1, __m64 __m2) Bit-wise complement the 64-bit value in M1 and bit-wise AND it with the 64-bit value in M2.
MMX BITLOGIC INTMMX __m64 _m_pandn(__m64 __m1, __m64 __m2) Bit-wise complement the 64-bit value in M1 and bit-wise AND it with the 64-bit value in M2.
MMX BITLOGIC INTMMX __m64 _mm_or_si64(__m64 __m1, __m64 __m2) Bit-wise inclusive OR the 64-bit values in M1 and M2.
MMX BITLOGIC INTMMX __m64 _m_por(__m64 __m1, __m64 __m2) Bit-wise inclusive OR the 64-bit values in M1 and M2.
MMX BITLOGIC INTMMX __m64 _mm_xor_si64(__m64 __m1, __m64 __m2) Bit-wise exclusive OR the 64-bit values in M1 and M2.
MMX BITLOGIC INTMMX __m64 _m_pxor(__m64 __m1, __m64 __m2) Bit-wise exclusive OR the 64-bit values in M1 and M2.
MMX BITSHIFT INTMMX __m64 _mm_sll_pi16(__m64 __m, __m64 __count) Shift four 16-bit values in M left by COUNT.
MMX BITSHIFT INTMMX __m64 _m_psllw(__m64 __m, __m64 __count) Shift four 16-bit values in M left by COUNT.
MMX BITSHIFT INTMMX __m64 _mm_slli_pi16(__m64 __m, int __count) Shift four 16-bit values in M left by COUNT.
MMX BITSHIFT INTMMX __m64 _m_psllwi(__m64 __m, int __count) Shift four 16-bit values in M left by COUNT.
MMX BITSHIFT INTMMX __m64 _mm_sll_pi32(__m64 __m, __m64 __count) Shift two 32-bit values in M left by COUNT.
MMX BITSHIFT INTMMX __m64 _m_pslld(__m64 __m, __m64 __count) Shift two 32-bit values in M left by COUNT.
MMX BITSHIFT INTMMX __m64 _mm_slli_pi32(__m64 __m, int __count) Shift two 32-bit values in M left by COUNT.
MMX BITSHIFT INTMMX __m64 _m_pslldi(__m64 __m, int __count) Shift two 32-bit values in M left by COUNT.
MMX BITSHIFT INTMMX __m64 _mm_sll_si64(__m64 __m, __m64 __count) Shift the 64-bit value in M left by COUNT.
MMX BITSHIFT INTMMX __m64 _m_psllq(__m64 __m, __m64 __count) Shift the 64-bit value in M left by COUNT.
MMX BITSHIFT INTMMX __m64 _mm_slli_si64(__m64 __m, int __count) Shift the 64-bit value in M left by COUNT.
MMX BITSHIFT INTMMX __m64 _m_psllqi(__m64 __m, int __count) Shift the 64-bit value in M left by COUNT.
MMX BITSHIFT INTMMX __m64 _mm_sra_pi16(__m64 __m, __m64 __count) Shift four 16-bit values in M right by COUNT; shift in the sign bit.
MMX BITSHIFT INTMMX __m64 _m_psraw(__m64 __m, __m64 __count) Shift four 16-bit values in M right by COUNT; shift in the sign bit.
MMX BITSHIFT INTMMX __m64 _mm_srai_pi16(__m64 __m, int __count) Shift four 16-bit values in M right by COUNT; shift in the sign bit.
MMX BITSHIFT INTMMX __m64 _m_psrawi(__m64 __m, int __count) Shift four 16-bit values in M right by COUNT; shift in the sign bit.
MMX BITSHIFT INTMMX __m64 _mm_sra_pi32(__m64 __m, __m64 __count) Shift two 32-bit values in M right by COUNT; shift in the sign bit.
MMX BITSHIFT INTMMX __m64 _m_psrad(__m64 __m, __m64 __count) Shift two 32-bit values in M right by COUNT; shift in the sign bit.
MMX BITSHIFT INTMMX __m64 _mm_srai_pi32(__m64 __m, int __count) Shift two 32-bit values in M right by COUNT; shift in the sign bit.
MMX BITSHIFT INTMMX __m64 _m_psradi(__m64 __m, int __count) Shift two 32-bit values in M right by COUNT; shift in the sign bit.
MMX BITSHIFT INTMMX __m64 _mm_srl_pi16(__m64 __m, __m64 __count) Shift four 16-bit values in M right by COUNT; shift in zeros.
MMX BITSHIFT INTMMX __m64 _m_psrlw(__m64 __m, __m64 __count) Shift four 16-bit values in M right by COUNT; shift in zeros.
MMX BITSHIFT INTMMX __m64 _mm_srli_pi16(__m64 __m, int __count) Shift four 16-bit values in M right by COUNT; shift in zeros.
MMX BITSHIFT INTMMX __m64 _m_psrlwi(__m64 __m, int __count) Shift four 16-bit values in M right by COUNT; shift in zeros.
MMX BITSHIFT INTMMX __m64 _mm_srl_pi32(__m64 __m, __m64 __count) Shift two 32-bit values in M right by COUNT; shift in zeros.
MMX BITSHIFT INTMMX __m64 _m_psrld(__m64 __m, __m64 __count) Shift two 32-bit values in M right by COUNT; shift in zeros.
MMX BITSHIFT INTMMX __m64 _mm_srli_pi32(__m64 __m, int __count) Shift two 32-bit values in M right by COUNT; shift in zeros.
MMX BITSHIFT INTMMX __m64 _m_psrldi(__m64 __m, int __count) Shift two 32-bit values in M right by COUNT; shift in zeros.
MMX BITSHIFT INTMMX __m64 _mm_srl_si64(__m64 __m, __m64 __count) Shift the 64-bit value in M left by COUNT; shift in zeros.
MMX BITSHIFT INTMMX __m64 _m_psrlq(__m64 __m, __m64 __count) Shift the 64-bit value in M left by COUNT; shift in zeros.
MMX BITSHIFT INTMMX __m64 _mm_srli_si64(__m64 __m, int __count) Shift the 64-bit value in M left by COUNT; shift in zeros.
MMX BITSHIFT INTMMX __m64 _m_psrlqi(__m64 __m, int __count) Shift the 64-bit value in M left by COUNT; shift in zeros.
MMX COMPARE INTMMX __m64 _mm_cmpeq_pi8(__m64 __m1, __m64 __m2) Compare eight 8-bit values. The result of the comparison is 0xFF if the test is true and zero if false.
MMX COMPARE INTMMX __m64 _m_pcmpeqb(__m64 __m1, __m64 __m2) Compare eight 8-bit values. The result of the comparison is 0xFF if the test is true and zero if false.
MMX COMPARE INTMMX __m64 _mm_cmpgt_pi8(__m64 __m1, __m64 __m2) Compare eight 8-bit values. The result of the comparison is 0xFF if the test is true and zero if false.
MMX COMPARE INTMMX __m64 _m_pcmpgtb(__m64 __m1, __m64 __m2) Compare eight 8-bit values. The result of the comparison is 0xFF if the test is true and zero if false.
MMX COMPARE INTMMX __m64 _mm_cmpeq_pi16(__m64 __m1, __m64 __m2) Compare four 16-bit values. The result of the comparison is 0xFFFF if the test is true and zero if false.
MMX COMPARE INTMMX __m64 _m_pcmpeqw(__m64 __m1, __m64 __m2) Compare four 16-bit values. The result of the comparison is 0xFFFF if the test is true and zero if false.
MMX COMPARE INTMMX __m64 _mm_cmpgt_pi16(__m64 __m1, __m64 __m2) Compare four 16-bit values. The result of the comparison is 0xFFFF if the test is true and zero if false.
MMX COMPARE INTMMX __m64 _m_pcmpgtw(__m64 __m1, __m64 __m2) Compare four 16-bit values. The result of the comparison is 0xFFFF if the test is true and zero if false.
MMX COMPARE INTMMX __m64 _mm_cmpeq_pi32(__m64 __m1, __m64 __m2) Compare two 32-bit values. The result of the comparison is 0xFFFFFFFF if the test is true and zero if false.
MMX COMPARE INTMMX __m64 _m_pcmpeqd(__m64 __m1, __m64 __m2) Compare two 32-bit values. The result of the comparison is 0xFFFFFFFF if the test is true and zero if false.
MMX COMPARE INTMMX __m64 _mm_cmpgt_pi32(__m64 __m1, __m64 __m2) Compare two 32-bit values. The result of the comparison is 0xFFFFFFFF if the test is true and zero if false.
MMX COMPARE INTMMX __m64 _m_pcmpgtd(__m64 __m1, __m64 __m2) Compare two 32-bit values. The result of the comparison is 0xFFFFFFFF if the test is true and zero if false.
MMX CONVERT INTMMX __m64 _mm_cvtsi32_si64(int __i) Convert I to a __m64 object. The integer is zero-extended to 64-bits.
MMX CONVERT INTMMX __m64 _m_from_int(int __i) Convert I to a __m64 object. The integer is zero-extended to 64-bits.
MMX CONVERT INTMMX __m64 _m_from_int64(long long __i) Convert I to a __m64 object.
MMX CONVERT INTMMX __m64 _mm_cvtsi64_m64(long long __i) Convert I to a __m64 object.
MMX CONVERT INTMMX __m64 _mm_cvtsi64x_si64(long long __i) Convert I to a __m64 object.
MMX CONVERT INTMMX __m64 _mm_set_pi64x(long long __i) Convert I to a __m64 object.
SSE CONVERT INTMMX __m64 _mm_cvtps_pi32(__m128 __A) Convert two lowest floats in vector to doubleword integers in MMX register. Round per MXCSR.
SSE CONVERT INTMMX __m64 _mm_cvt_ps2pi(__m128 __A) Convert two lowest floats in vector to doubleword integers in MMX register. Round per MXCSR.
SSE CONVERT INTMMX __m64 _mm_cvttps_pi32(__m128 __A) Convert two lowest floats in vector to doubleword integers in MMX register. Round by truncation.
SSE CONVERT INTMMX __m64 _mm_cvtt_ps2pi(__m128 __A) Convert two lowest floats in vector to doubleword integers in MMX register. Round by truncation.
SSE CONVERT INTMMX __m128 _mm_cvtpi32_ps(__m128 __A, __m64 __B) Convert two doubleword integers in B to floats and replace two low elements in A.
SSE CONVERT INTMMX __m128 _mm_cvt_pi2ps(__m128 __A, __m64 __B) Convert two doubleword integers in B to floats and replace two low elements in A.
SSE CONVERT INTMMX __m128 _mm_cvtpi16_ps(__m64 __A) Convert four signed word integers to floats.
SSE CONVERT INTMMX __m128 _mm_cvtpu16_ps(__m64 __A) Convert four unsigned word integers to floats.
SSE CONVERT INTMMX __m128 _mm_cvtpi8_ps(__m64 __A) Convert low four signed bytes to floats.
SSE CONVERT INTMMX __m128 _mm_cvtpu8_ps(__m64 __A) Convert low four unsigned bytes to floats.
SSE CONVERT INTMMX __m128 _mm_cvtpi32x2_ps(__m64 __A, __m64 __B) Convert four signed doubleword integers to floats.
SSE CONVERT INTMMX __m64 _mm_cvtps_pi16(__m128 __A) Convert the four SPFP values in A to four signed 16-bit integers.
SSE CONVERT INTMMX __m64 _mm_cvtps_pi8(__m128 __A) Convert the four SPFP values in A to four signed 8-bit integers.
SSE2 CONVERT INTMMX __m64 _mm_cvtpd_pi32(__m128d __A) Convert two double floats to doubleword integers. Round per MXCSR.
SSE2 CONVERT INTMMX __m64 _mm_cvttpd_pi32(__m128d __A) Convert two double floats to doubleword integers. Round by truncation.
SSE2 CONVERT INTMMX __m128d _mm_cvtpi32_pd(__m64 __A) Convert two doubleword integers to double floats.
SSE2 CONVERT INTMMX __m64 _mm_movepi64_pi64(__m128i __B) Move low quadword from XMM to MMX register.
SSE2 CONVERT INTMMX __m128i _mm_movpi64_epi64(__m64 __A) Move MMX register to low quadword of XMM register.
MMX EXTRACT INTMMX int _mm_cvtsi64_si32(__m64 __i) Convert the lower 32 bits of the __m64 object into an integer.
MMX EXTRACT INTMMX int _m_to_int(__m64 __i) Convert the lower 32 bits of the __m64 object into an integer.
MMX EXTRACT INTMMX long long _m_to_int64(__m64 __i) Convert the __m64 object to a 64bit integer.
MMX EXTRACT INTMMX long long _mm_cvtm64_si64(__m64 __i) Convert the __m64 object to a 64bit integer.
MMX EXTRACT INTMMX long long _mm_cvtsi64_si64x(__m64 __i) Convert the __m64 object to a 64bit integer.
SSE EXTRACT INTMMX int _mm_extract_pi16(__m64 const __A, int const __N) Extracts one of the four words of A. The selector N must be immediate.
SSE EXTRACT INTMMX int _m_pextrw(__m64 const __A, int const __N) Extracts one of the four words of A. The selector N must be immediate.
SSE EXTRACT INTMMX int _mm_movemask_pi8(__m64 __A) Create an 8-bit mask of the signs of 8-bit values.
SSE EXTRACT INTMMX int _m_pmovmskb(__m64 __A) Create an 8-bit mask of the signs of 8-bit values.
SSE INSERT INTMMX __m64 _mm_insert_pi16(__m64 const __A, int const __D, int const __N) Inserts word D into one of four words of A. The selector N must be immediate.
SSE INSERT INTMMX __m64 _m_pinsrw(__m64 const __A, int const __D, int const __N) Inserts word D into one of four words of A. The selector N must be immediate.
SSE LOADSTORE INTMMX void _mm_stream_pi(__m64 * __P, __m64 __A) Write value to memory without polluting caches.
SSE MATHOP INTMMX __m64 _mm_max_pi16(__m64 __A, __m64 __B) Compute the element-wise maximum of signed 16-bit values.
SSE MATHOP INTMMX __m64 _m_pmaxsw(__m64 __A, __m64 __B) Compute the element-wise maximum of signed 16-bit values.
SSE MATHOP INTMMX __m64 _mm_max_pu8(__m64 __A, __m64 __B) Compute the element-wise maximum of unsigned 8-bit values.
SSE MATHOP INTMMX __m64 _m_pmaxub(__m64 __A, __m64 __B) Compute the element-wise maximum of unsigned 8-bit values.
SSE MATHOP INTMMX __m64 _mm_min_pi16(__m64 __A, __m64 __B) Compute the element-wise minimum of signed 16-bit values.
SSE MATHOP INTMMX __m64 _m_pminsw(__m64 __A, __m64 __B) Compute the element-wise minimum of signed 16-bit values.
SSE MATHOP INTMMX __m64 _mm_min_pu8(__m64 __A, __m64 __B) Compute the element-wise minimum of unsigned 8-bit values.
SSE MATHOP INTMMX __m64 _m_pminub(__m64 __A, __m64 __B) Compute the element-wise minimum of unsigned 8-bit values.
SSE MATHOP INTMMX __m64 _mm_mulhi_pu16(__m64 __A, __m64 __B) Multiply four unsigned 16-bit values in A by four unsigned 16-bit values in B and produce the high 16 bits of the 32-bit results.
SSE MATHOP INTMMX __m64 _m_pmulhuw(__m64 __A, __m64 __B) Multiply four unsigned 16-bit values in A by four unsigned 16-bit values in B and produce the high 16 bits of the 32-bit results.
SSE MATHOP INTMMX __m64 _mm_avg_pu8(__m64 __A, __m64 __B) Compute the rounded averages of the unsigned 8-bit values in A and B.
SSE MATHOP INTMMX __m64 _m_pavgb(__m64 __A, __m64 __B) Compute the rounded averages of the unsigned 8-bit values in A and B.
SSE MATHOP INTMMX __m64 _mm_avg_pu16(__m64 __A, __m64 __B) Compute the rounded averages of the unsigned 16-bit values in A and B.
SSE MATHOP INTMMX __m64 _m_pavgw(__m64 __A, __m64 __B) Compute the rounded averages of the unsigned 16-bit values in A and B.
SSE MATHOP INTMMX __m64 _mm_sad_pu8(__m64 __A, __m64 __B) Compute the sum of the absolute differences of the unsigned 8-bit values in A and B. Return the value in the lower 16-bit word; the upper words are cleared.
SSE MATHOP INTMMX __m64 _m_psadbw(__m64 __A, __m64 __B) Compute the sum of the absolute differences of the unsigned 8-bit values in A and B. Return the value in the lower 16-bit word; the upper words are cleared.
SSE MATHOP INTMMX __m64 _mm_add_pi8(__m64 __m1, __m64 __m2) Add the 8-bit values in M1 to the 8-bit values in M2.
SSE MATHOP INTMMX __m64 _m_paddb(__m64 __m1, __m64 __m2) Add the 8-bit values in M1 to the 8-bit values in M2.
SSE MATHOP INTMMX __m64 _mm_add_pi16(__m64 __m1, __m64 __m2) Add the 16-bit values in M1 to the 16-bit values in M2.
SSE MATHOP INTMMX __m64 _m_paddw(__m64 __m1, __m64 __m2) Add the 16-bit values in M1 to the 16-bit values in M2.
SSE MATHOP INTMMX __m64 _mm_add_pi32(__m64 __m1, __m64 __m2) Add the 32-bit values in M1 to the 32-bit values in M2.
SSE MATHOP INTMMX __m64 _m_paddd(__m64 __m1, __m64 __m2) Add the 32-bit values in M1 to the 32-bit values in M2.
SSE MATHOP INTMMX __m64 _mm_adds_pi8(__m64 __m1, __m64 __m2) Add the 8-bit values in M1 to the 8-bit values in M2 using signed saturated arithmetic.
SSE MATHOP INTMMX __m64 _m_paddsb(__m64 __m1, __m64 __m2) Add the 8-bit values in M1 to the 8-bit values in M2 using signed saturated arithmetic.
SSE MATHOP INTMMX __m64 _mm_adds_pi16(__m64 __m1, __m64 __m2) Add the 16-bit values in M1 to the 16-bit values in M2 using signed saturated arithmetic.
SSE MATHOP INTMMX __m64 _m_paddsw(__m64 __m1, __m64 __m2) Add the 16-bit values in M1 to the 16-bit values in M2 using signed saturated arithmetic.
SSE MATHOP INTMMX __m64 _mm_adds_pu8(__m64 __m1, __m64 __m2) Add the 8-bit values in M1 to the 8-bit values in M2 using unsigned saturated arithmetic.
SSE MATHOP INTMMX __m64 _m_paddusb(__m64 __m1, __m64 __m2) Add the 8-bit values in M1 to the 8-bit values in M2 using unsigned saturated arithmetic.
SSE MATHOP INTMMX __m64 _mm_adds_pu16(__m64 __m1, __m64 __m2) Add the 16-bit values in M1 to the 16-bit values in M2 using unsigned saturated arithmetic.
SSE MATHOP INTMMX __m64 _m_paddusw(__m64 __m1, __m64 __m2) Add the 16-bit values in M1 to the 16-bit values in M2 using unsigned saturated arithmetic.
SSE MATHOP INTMMX __m64 _mm_sub_pi8(__m64 __m1, __m64 __m2) Subtract the 8-bit values in M2 from the 8-bit values in M1.
SSE MATHOP INTMMX __m64 _m_psubb(__m64 __m1, __m64 __m2) Subtract the 8-bit values in M2 from the 8-bit values in M1.
SSE MATHOP INTMMX __m64 _mm_sub_pi16(__m64 __m1, __m64 __m2) Subtract the 16-bit values in M2 from the 16-bit values in M1.
SSE MATHOP INTMMX __m64 _m_psubw(__m64 __m1, __m64 __m2) Subtract the 16-bit values in M2 from the 16-bit values in M1.
SSE MATHOP INTMMX __m64 _mm_sub_pi32(__m64 __m1, __m64 __m2) Subtract the 32-bit values in M2 from the 32-bit values in M1.
SSE MATHOP INTMMX __m64 _m_psubd(__m64 __m1, __m64 __m2) Subtract the 32-bit values in M2 from the 32-bit values in M1.
SSE MATHOP INTMMX __m64 _mm_subs_pi8(__m64 __m1, __m64 __m2) Subtract the 8-bit values in M2 from the 8-bit values in M1 using signed saturating arithmetic.
SSE MATHOP INTMMX __m64 _m_psubsb(__m64 __m1, __m64 __m2) Subtract the 8-bit values in M2 from the 8-bit values in M1 using signed saturating arithmetic.
SSE MATHOP INTMMX __m64 _mm_subs_pi16(__m64 __m1, __m64 __m2) Subtract the 16-bit values in M2 from the 16-bit values in M1 using signed saturating arithmetic.
SSE MATHOP INTMMX __m64 _m_psubsw(__m64 __m1, __m64 __m2) Subtract the 16-bit values in M2 from the 16-bit values in M1 using signed saturating arithmetic.
SSE MATHOP INTMMX __m64 _mm_subs_pu8(__m64 __m1, __m64 __m2) Subtract the 8-bit values in M2 from the 8-bit values in M1 using unsigned saturating arithmetic.
SSE MATHOP INTMMX __m64 _m_psubusb(__m64 __m1, __m64 __m2) Subtract the 8-bit values in M2 from the 8-bit values in M1 using unsigned saturating arithmetic.
SSE MATHOP INTMMX __m64 _mm_subs_pu16(__m64 __m1, __m64 __m2) Subtract the 16-bit values in M2 from the 16-bit values in M1 using unsigned saturating arithmetic.
SSE MATHOP INTMMX __m64 _m_psubusw(__m64 __m1, __m64 __m2) Subtract the 16-bit values in M2 from the 16-bit values in M1 using unsigned saturating arithmetic.
SSE MATHOP INTMMX __m64 _mm_madd_pi16(__m64 __m1, __m64 __m2) Multiply four 16-bit values in M1 by four 16-bit values in M2 producing four 32-bit intermediate results, which are then summed by pairs to produce two 32-bit results.
SSE MATHOP INTMMX __m64 _m_pmaddwd(__m64 __m1, __m64 __m2) Multiply four 16-bit values in M1 by four 16-bit values in M2 producing four 32-bit intermediate results, which are then summed by pairs to produce two 32-bit results.
SSE MATHOP INTMMX __m64 _mm_mulhi_pi16(__m64 __m1, __m64 __m2) Multiply four signed 16-bit values in M1 by four signed 16-bit values in M2 and produce the high 16 bits of the 32-bit results.
SSE MATHOP INTMMX __m64 _m_pmulhw(__m64 __m1, __m64 __m2) Multiply four signed 16-bit values in M1 by four signed 16-bit values in M2 and produce the high 16 bits of the 32-bit results.
SSE MATHOP INTMMX __m64 _mm_mullo_pi16(__m64 __m1, __m64 __m2) Multiply four 16-bit values in M1 by four 16-bit values in M2 and produce the low 16 bits of the results.
SSE MATHOP INTMMX __m64 _m_pmullw(__m64 __m1, __m64 __m2) Multiply four 16-bit values in M1 by four 16-bit values in M2 and produce the low 16 bits of the results.
SSE2 MATHOP INTMMX __m64 _mm_add_si64(__m64 __m1, __m64 __m2) Add the 64-bit value in M1 to the 64-bit value in M2.
SSE2 MATHOP INTMMX __m64 _mm_sub_si64(__m64 __m1, __m64 __m2) Subtract the 64-bit value in M2 from the 64-bit value in M1.
SSE2 MATHOP INTMMX __m64 _mm_mul_su32(__m64 __A, __m64 __B) Multiply low unsigned doublewords and returns quadword result.
SSSE3 MATHOP INTMMX __m64 _mm_abs_pi8(__m64 __X) Get absolute values of signed elements.
SSSE3 MATHOP INTMMX __m64 _mm_abs_pi16(__m64 __X) Get absolute values of signed elements.
SSSE3 MATHOP INTMMX __m64 _mm_abs_pi32(__m64 __X) Get absolute values of signed elements.
SSSE3 MATHOP INTMMX __m64 _mm_hadd_pi16(__m64 __X, __m64 __Y) Horizontal addition across vectors. Returns [[Xi0+Xi1] [Xi2+Xi3] [Yi0+Yi1] [Yi2+Yi3]].
SSSE3 MATHOP INTMMX __m64 _mm_hadd_pi32(__m64 __X, __m64 __Y) Horizontal addition across vectors. Returns [[Xi0+Xi1] [Yi0+Yi1]].
SSSE3 MATHOP INTMMX __m64 _mm_hadds_pi16(__m64 __X, __m64 __Y) Horizontal addition across vectors with signed saturation. Returns [[Xi0+Xi1] [Xi2+Xi3] [Yi0+Yi1] [Yi2+Yi3]].
SSSE3 MATHOP INTMMX __m64 _mm_hsub_pi16(__m64 __X, __m64 __Y) Horizontal subtraction across vectors. Returns [[Xi0-Xi1] [Xi2-Xi3] [Yi0-Yi1] [Yi2-Yi3]].
SSSE3 MATHOP INTMMX __m64 _mm_hsub_pi32(__m64 __X, __m64 __Y) Horizontal subtraction across vectors. Returns [[Xi0-Xi1] [Yi0-Yi1]].
SSSE3 MATHOP INTMMX __m64 _mm_hsubs_pi16(__m64 __X, __m64 __Y) Horizontal subtraction across vectors with signed saturation. Returns [[Xi0-Xi1] [Xi2-Xi3] [Yi0-Yi1] [Yi2-Yi3]].
SSSE3 MATHOP INTMMX __m64 _mm_maddubs_pi16(__m64 __X, __m64 __Y) Multiplies vertically each unsigned byte of X with the corresponding signed byte of Y, producing intermediate signed 16-bit integers. Each adjacent pair of signed words is added and the saturated result is returned.
SSSE3 MATHOP INTMMX __m64 _mm_mulhrs_pi16(__m64 __X, __m64 __Y) Multiplies vertically each signed 16-bit integer from X with the corresponding signed 16-bit integer of Y, producing intermediate, signed 32-bit integers. Each intermediate 32-bit integer is truncated to the 18 most significant bits. Rounding is always performed by adding 1 to the least significant bit of the 18-bit intermediate result. The final result is obtained by selecting the 16 bits immediately to the right of the most significant bit of each 18-bit intermediate result and packed.
SSSE3 MATHOP INTMMX __m64 _mm_sign_pi8(__m64 __X, __m64 __Y) Multiply element in X by {1, 0, -1} depending on sign of corresponding element in Y.
SSSE3 MATHOP INTMMX __m64 _mm_sign_pi16(__m64 __X, __m64 __Y) Multiply element in X by {1, 0, -1} depending on sign of corresponding element in Y.
SSSE3 MATHOP INTMMX __m64 _mm_sign_pi32(__m64 __X, __m64 __Y) Multiply element in X by {1, 0, -1} depending on sign of corresponding element in Y.
SSSE3 MATHOP INTMMX __m64 _mm_alignr_pi8(__m64 __X, __m64 __Y, int __N) Concatenates X and Y into an intermediate composite, shifts the composite at byte granularity to the right by a constant immediate, and extracts the right-aligned result.
MMX OTHER INTMMX void _mm_empty(void ) Empty the multimedia state.
MMX OTHER INTMMX void _m_empty(void ) Empty the multimedia state.
MMX SET INTMMX __m64 _mm_setzero_si64(void ) Creates a 64-bit zero.
MMX SET INTMMX __m64 _mm_set_pi32(int __i1, int __i0) Creates a vector of two 32-bit values; I0 is least significant.
MMX SET INTMMX __m64 _mm_set_pi16(short __w3, short __w2, short __w1, short __w0) Creates a vector of four 16-bit values; W0 is least significant.
MMX SET INTMMX __m64 _mm_set_pi8(char __b7, ... char __b0) Creates a vector of eight 8-bit values; B0 is least significant.
MMX SET INTMMX __m64 _mm_setr_pi32(int __i0, int __i1) Creates a vector of two 32-bit values; I0 is least significant.
MMX SET INTMMX __m64 _mm_setr_pi16(short __w0, short __w1, short __w2, short __w3) Creates a vector of four 16-bit values; W0 is least significant.
MMX SET INTMMX __m64 _mm_setr_pi8(char __b0, ... char __b7) Creates a vector of eight 8-bit values; B0 is least significant.
MMX SET INTMMX __m64 _mm_set1_pi32(int __i) Creates a vector of two 32-bit values, both elements containing I.
MMX SET INTMMX __m64 _mm_set1_pi16(short __w) Creates a vector of four 16-bit values, all elements containing W.
MMX SET INTMMX __m64 _mm_set1_pi8(char __b) Creates a vector of eight 8-bit values, all elements containing B.
MMX SHUFFLE INTMMX __m64 _mm_packs_pi16(__m64 __m1, __m64 __m2) Pack the four 16-bit values from M1 into the lower four 8-bit values of the result, and the four 16-bit values from M2 into the upper four 8-bit values of the result, all with signed saturation.
MMX SHUFFLE INTMMX __m64 _m_packsswb(__m64 __m1, __m64 __m2) Pack the four 16-bit values from M1 into the lower four 8-bit values of the result, and the four 16-bit values from M2 into the upper four 8-bit values of the result, all with signed saturation.
MMX SHUFFLE INTMMX __m64 _mm_packs_pi32(__m64 __m1, __m64 __m2) Pack the two 32-bit values from M1 in to the lower two 16-bit values of the result, and the two 32-bit values from M2 into the upper two 16-bit values of the result, all with signed saturation.
MMX SHUFFLE INTMMX __m64 _m_packssdw(__m64 __m1, __m64 __m2) Pack the two 32-bit values from M1 in to the lower two 16-bit values of the result, and the two 32-bit values from M2 into the upper two 16-bit values of the result, all with signed saturation.
MMX SHUFFLE INTMMX __m64 _mm_packs_pu16(__m64 __m1, __m64 __m2) Pack the four 16-bit values from M1 into the lower four 8-bit values of the result, and the four 16-bit values from M2 into the upper four 8-bit values of the result, all with unsigned saturation.
MMX SHUFFLE INTMMX __m64 _m_packuswb(__m64 __m1, __m64 __m2) Pack the four 16-bit values from M1 into the lower four 8-bit values of the result, and the four 16-bit values from M2 into the upper four 8-bit values of the result, all with unsigned saturation.
MMX SHUFFLE INTMMX __m64 _mm_unpackhi_pi8(__m64 __m1, __m64 __m2) Interleave the four 8-bit values from the high half of M1 with the four 8-bit values from the high half of M2.
MMX SHUFFLE INTMMX __m64 _m_punpckhbw(__m64 __m1, __m64 __m2) Interleave the four 8-bit values from the high half of M1 with the four 8-bit values from the high half of M2.
MMX SHUFFLE INTMMX __m64 _mm_unpackhi_pi16(__m64 __m1, __m64 __m2) Interleave the two 16-bit values from the high half of M1 with the two 16-bit values from the high half of M2.
MMX SHUFFLE INTMMX __m64 _m_punpckhwd(__m64 __m1, __m64 __m2) Interleave the two 16-bit values from the high half of M1 with the two 16-bit values from the high half of M2.
MMX SHUFFLE INTMMX __m64 _mm_unpackhi_pi32(__m64 __m1, __m64 __m2) Interleave the 32-bit value from the high half of M1 with the 32-bit value from the high half of M2.
MMX SHUFFLE INTMMX __m64 _m_punpckhdq(__m64 __m1, __m64 __m2) Interleave the 32-bit value from the high half of M1 with the 32-bit value from the high half of M2.
MMX SHUFFLE INTMMX __m64 _mm_unpacklo_pi8(__m64 __m1, __m64 __m2) Interleave the four 8-bit values from the low half of M1 with the four 8-bit values from the low half of M2.
MMX SHUFFLE INTMMX __m64 _m_punpcklbw(__m64 __m1, __m64 __m2) Interleave the four 8-bit values from the low half of M1 with the four 8-bit values from the low half of M2.
MMX SHUFFLE INTMMX __m64 _mm_unpacklo_pi16(__m64 __m1, __m64 __m2) Interleave the two 16-bit values from the low half of M1 with the two 16-bit values from the low half of M2.
MMX SHUFFLE INTMMX __m64 _m_punpcklwd(__m64 __m1, __m64 __m2) Interleave the two 16-bit values from the low half of M1 with the two 16-bit values from the low half of M2.
MMX SHUFFLE INTMMX __m64 _mm_unpacklo_pi32(__m64 __m1, __m64 __m2) Interleave the 32-bit value from the low half of M1 with the 32-bit value from the low half of M2.
MMX SHUFFLE INTMMX __m64 _m_punpckldq(__m64 __m1, __m64 __m2) Interleave the 32-bit value from the low half of M1 with the 32-bit value from the low half of M2.
SSE SHUFFLE INTMMX __m64 _mm_shuffle_pi16(__m64 __A, int __N) Return a combination of the four 16-bit values in A. The selector must be an immediate.
SSE SHUFFLE INTMMX __m64 _m_pshufw(__m64 __A, int __N) Return a combination of the four 16-bit values in A. The selector must be an immediate.
SSE SHUFFLE INTMMX void _mm_maskmove_si64(__m64 __A, __m64 __N, char * __P) Conditionally store byte elements of A into P. The high bit of each byte in the selector N determines whether the corresponding byte from A is stored.
SSE SHUFFLE INTMMX void _m_maskmovq(__m64 __A, __m64 __N, char * __P) Conditionally store byte elements of A into P. The high bit of each byte in the selector N determines whether the corresponding byte from A is stored.
SSSE3 SHUFFLE INTMMX __m64 _mm_shuffle_pi8(__m64 __X, __m64 __Y) Permute bytes in X. For each byte in Y, if the high bit is set, the corresponding byte in X is zeroed out. Otherwise, the low bits of the Y byte specify the source of the byte in X.
SSE2 BITLOGIC INTSSE __m128i _mm_and_si128(__m128i __A, __m128i __B) Bitwise logic
SSE2 BITLOGIC INTSSE __m128i _mm_andnot_si128(__m128i __A, __m128i __B) Bitwise logic
SSE2 BITLOGIC INTSSE __m128i _mm_or_si128(__m128i __A, __m128i __B) Bitwise logic
SSE2 BITLOGIC INTSSE __m128i _mm_xor_si128(__m128i __A, __m128i __B) Bitwise logic
SSE2 BITSHIFT INTSSE __m128i _mm_slli_epi16(__m128i __A, int __B) Shift left logical (shift in zeros) by immediate count B. (No byte forms.)
SSE2 BITSHIFT INTSSE __m128i _mm_slli_epi32(__m128i __A, int __B) Shift left logical (shift in zeros) by immediate count B. (No byte forms.)
SSE2 BITSHIFT INTSSE __m128i _mm_slli_epi64(__m128i __A, int __B) Shift left logical (shift in zeros) by immediate count B. (No byte forms.)
SSE2 BITSHIFT INTSSE __m128i _mm_slli_si128(__m128i __A, int __B) Shift left logical (shift in zeros) by immediate count B. (No byte forms.)
SSE2 BITSHIFT INTSSE __m128i _mm_srli_epi16(__m128i __A, int __B) Shift right logical (shift in zeros) by immediate count B. (No byte forms.)
SSE2 BITSHIFT INTSSE __m128i _mm_srli_epi32(__m128i __A, int __B) Shift right logical (shift in zeros) by immediate count B. (No byte forms.)
SSE2 BITSHIFT INTSSE __m128i _mm_srli_epi64(__m128i __A, int __B) Shift right logical (shift in zeros) by immediate count B. (No byte forms.)
SSE2 BITSHIFT INTSSE __m128i _mm_srli_si128(__m128i __A, int __B) Shift right logical (shift in zeros) by immediate count B. (No byte forms.)
SSE2 BITSHIFT INTSSE __m128i _mm_srai_epi16(__m128i __A, int __B) Shift right arithmetic (duplicate sign bit) by immediate count B. (No byte or 128-bit forms.)
SSE2 BITSHIFT INTSSE __m128i _mm_srai_epi32(__m128i __A, int __B) Shift right arithmetic (duplicate sign bit) by immediate count B. (No byte or 128-bit forms.)
SSE2 BITSHIFT INTSSE __m128i _mm_sll_epi16(__m128i __A, __m128i __B) Shift left logical (shift in zeros) by count in low 64 bits of B. (No byte forms.)
SSE2 BITSHIFT INTSSE __m128i _mm_sll_epi32(__m128i __A, __m128i __B) Shift left logical (shift in zeros) by count in low 64 bits of B. (No byte forms.)
SSE2 BITSHIFT INTSSE __m128i _mm_sll_epi64(__m128i __A, __m128i __B) Shift left logical (shift in zeros) by count in low 64 bits of B. (No byte forms.)
SSE2 BITSHIFT INTSSE __m128i _mm_sll_si128(__m128i __A, __m128i __B) Shift left logical (shift in zeros) by count in low 64 bits of B. (No byte forms.)
SSE2 BITSHIFT INTSSE __m128i _mm_srl_epi16(__m128i __A, __m128i __B) Shift right logical (shift in zeros) by count in low 64 bits of B. (No byte forms.)
SSE2 BITSHIFT INTSSE __m128i _mm_srl_epi32(__m128i __A, __m128i __B) Shift right logical (shift in zeros) by count in low 64 bits of B. (No byte forms.)
SSE2 BITSHIFT INTSSE __m128i _mm_srl_epi64(__m128i __A, __m128i __B) Shift right logical (shift in zeros) by count in low 64 bits of B. (No byte forms.)
SSE2 BITSHIFT INTSSE __m128i _mm_srl_si128(__m128i __A, __m128i __B) Shift right logical (shift in zeros) by count in low 64 bits of B. (No byte forms.)
SSE2 BITSHIFT INTSSE __m128i _mm_sra_epi16(__m128i __A, __m128i __B) Shift right arithmetic (duplicate sign bit) by count in low 64 bits of B. (No byte or 128-bit forms.)
SSE2 BITSHIFT INTSSE __m128i _mm_sra_epi32(__m128i __A, __m128i __B) Shift right arithmetic (duplicate sign bit) by count in low 64 bits of B. (No byte or 128-bit forms.)
SSE2 COMPARE INTSSE __m128i _mm_cmpeq_epi8(__m128i __A, __m128i __B) Compare packed integers.
SSE2 COMPARE INTSSE __m128i _mm_cmpeq_epi16(__m128i __A, __m128i __B) Compare packed integers.
SSE2 COMPARE INTSSE __m128i _mm_cmpeq_epi32(__m128i __A, __m128i __B) Compare packed integers.
SSE2 COMPARE INTSSE __m128i _mm_cmplt_epi8(__m128i __A, __m128i __B) Compare packed integers.
SSE2 COMPARE INTSSE __m128i _mm_cmplt_epi16(__m128i __A, __m128i __B) Compare packed integers.
SSE2 COMPARE INTSSE __m128i _mm_cmplt_epi32(__m128i __A, __m128i __B) Compare packed integers.
SSE2 COMPARE INTSSE __m128i _mm_cmpgt_epi8(__m128i __A, __m128i __B) Compare packed integers.
SSE2 COMPARE INTSSE __m128i _mm_cmpgt_epi16(__m128i __A, __m128i __B) Compare packed integers.
SSE2 COMPARE INTSSE __m128i _mm_cmpgt_epi32(__m128i __A, __m128i __B) Compare packed integers.
SSE41 COMPARE INTSSE __m128i _mm_cmpeq_epi64(__m128i __X, __m128i __Y) Packed integer 64-bit comparison, zeroing or filling with ones corresponding parts of result.
SSE41 COMPARE INTSSE int _mm_testz_si128(__m128i __M, __m128i __V) Packed integer 128-bit bitwise comparison. Return 1 if (__V & __M) == 0.
SSE41 COMPARE INTSSE int _mm_testc_si128(__m128i __M, __m128i __V) Packed integer 128-bit bitwise comparison. Return 1 if (__V & ~__M) == 0.
SSE41 COMPARE INTSSE int _mm_testnzc_si128(__m128i __M, __m128i __V) Packed integer 128-bit bitwise comparison. Return 1 if (__V & __M) != 0 && (__V & ~__M) != 0.
SSE42 COMPARE INTSSE __m128i _mm_cmpgt_epi64(__m128i __X, __m128i __Y) Packed integer 64-bit comparison, zeroing or filling with ones corresponding parts of result.
SSE2 CONVERT INTSSE __m128i _mm_move_epi64(__m128i __A) Move vector between registers.
SSE41 CONVERT INTSSE __m128i _mm_cvtepi8_epi32(__m128i __X) Packed integer sign-extension.
SSE41 CONVERT INTSSE __m128i _mm_cvtepi16_epi32(__m128i __X) Packed integer sign-extension.
SSE41 CONVERT INTSSE __m128i _mm_cvtepi8_epi64(__m128i __X) Packed integer sign-extension.
SSE41 CONVERT INTSSE __m128i _mm_cvtepi32_epi64(__m128i __X) Packed integer sign-extension.
SSE41 CONVERT INTSSE __m128i _mm_cvtepi16_epi64(__m128i __X) Packed integer sign-extension.
SSE41 CONVERT INTSSE __m128i _mm_cvtepi8_epi16(__m128i __X) Packed integer sign-extension.
SSE41 CONVERT INTSSE __m128i _mm_cvtepu8_epi32(__m128i __X) Packed integer zero-extension.
SSE41 CONVERT INTSSE __m128i _mm_cvtepu16_epi32(__m128i __X) Packed integer zero-extension.
SSE41 CONVERT INTSSE __m128i _mm_cvtepu8_epi64(__m128i __X) Packed integer zero-extension.
SSE41 CONVERT INTSSE __m128i _mm_cvtepu32_epi64(__m128i __X) Packed integer zero-extension.
SSE41 CONVERT INTSSE __m128i _mm_cvtepu16_epi64(__m128i __X) Packed integer zero-extension.
SSE41 CONVERT INTSSE __m128i _mm_cvtepu8_epi16(__m128i __X) Packed integer zero-extension.
SSE2 EXTRACT INTSSE int _mm_cvtsi128_si32(__m128i __A) Extract low 32-bit integer from vector.
SSE2 EXTRACT INTSSE long long _mm_cvtsi128_si64(__m128i __A) Extract low 64-bit integer from vector.
SSE2 EXTRACT INTSSE long long _mm_cvtsi128_si64x(__m128i __A) Extract low 64-bit integer from vector.
SSE2 EXTRACT INTSSE int _mm_movemask_epi8(__m128i __A) Extract sign bits of byte components into integer.
SSE2 EXTRACT INTSSE int _mm_extract_epi16(__m128i const __A, int const __N) Extract word at index N from vector. No sign extension.
SSE2 EXTRACT INTSSE void _mm_maskmoveu_si128(__m128i __A, __m128i __B, char * __C) Write bytes of A to memory based on mask in B. If high bit of a byte in B is set, byte is written. C may be unaligned.
SSE41 EXTRACT INTSSE int _mm_extract_epi8(__m128i __X, const int __N) Extract integer from packed integer array element of X selected by index N.
SSE41 EXTRACT INTSSE int _mm_extract_epi32(__m128i __X, const int __N) Extract integer from packed integer array element of X selected by index N.
SSE41 EXTRACT INTSSE long long _mm_extract_epi64(__m128i __X, const int __N) Extract integer from packed integer array element of X selected by index N.
SSE2 INSERT INTSSE __m128i _mm_insert_epi16(__m128i const __A, int const __D, int const __N) Insert word D into vector at index N.
SSE41 INSERT INTSSE __m128i _mm_insert_epi8(__m128i __D, int __S, const int __N) Insert integer, S, into packed integer array element of D selected by index N.
SSE41 INSERT INTSSE __m128i _mm_insert_epi32(__m128i __D, int __S, const int __N) Insert integer, S, into packed integer array element of D selected by index N.
SSE41 INSERT INTSSE __m128i _mm_insert_epi64(__m128i __D, long long __S, const int __N) Insert integer, S, into packed integer array element of D selected by index N.
SSE LOADSTORE INTSSE __m128i _mm_stream_load_si128(__m128i * __X) Load double quadword using non-temporal aligned hint.
SSE2 LOADSTORE INTSSE __m128i _mm_load_si128(__m128i const * __P) Load 128-bit integer from aligned address.
SSE2 LOADSTORE INTSSE __m128i _mm_loadu_si128(__m128i const * __P) Load 128-bit integer from unaligned address.
SSE2 LOADSTORE INTSSE void _mm_store_si128(__m128i * __P, __m128i __B) Store 128-bit integer at aligned address.
SSE2 LOADSTORE INTSSE void _mm_storeu_si128(__m128i * __P, __m128i __B) Store 128-bit integer at unaligned address.
SSE2 LOADSTORE INTSSE __m128i _mm_loadl_epi64(__m128i const * __P) Load 64-bit integer into low element of vector.
SSE2 LOADSTORE INTSSE void _mm_storel_epi64(__m128i * __P, __m128i __B) Store 64-bit integer into low element of vector.
SSE2 LOADSTORE INTSSE void _mm_stream_si32(int * __A, int __B) Write value to memory without polluting caches.
SSE2 LOADSTORE INTSSE void _mm_stream_si128(__m128i * __A, __m128i __B) Write value to memory without polluting caches.
SSE2 MATHOP INTSSE __m128i _mm_add_epi8(__m128i __A, __m128i __B) Integer math, wraparound
SSE2 MATHOP INTSSE __m128i _mm_add_epi16(__m128i __A, __m128i __B) Integer math, wraparound
SSE2 MATHOP INTSSE __m128i _mm_add_epi32(__m128i __A, __m128i __B) Integer math, wraparound
SSE2 MATHOP INTSSE __m128i _mm_add_epi64(__m128i __A, __m128i __B) Integer math, wraparound
SSE2 MATHOP INTSSE __m128i _mm_sub_epi8(__m128i __A, __m128i __B) Integer math, wraparound
SSE2 MATHOP INTSSE __m128i _mm_sub_epi16(__m128i __A, __m128i __B) Integer math, wraparound
SSE2 MATHOP INTSSE __m128i _mm_sub_epi32(__m128i __A, __m128i __B) Integer math, wraparound
SSE2 MATHOP INTSSE __m128i _mm_sub_epi64(__m128i __A, __m128i __B) Integer math, wraparound
SSE2 MATHOP INTSSE __m128i _mm_adds_epi8(__m128i __A, __m128i __B) Integer math, signed saturation
SSE2 MATHOP INTSSE __m128i _mm_adds_epi16(__m128i __A, __m128i __B) Integer math, signed saturation
SSE2 MATHOP INTSSE __m128i _mm_subs_epi8(__m128i __A, __m128i __B) Integer math, signed saturation
SSE2 MATHOP INTSSE __m128i _mm_subs_epi16(__m128i __A, __m128i __B) Integer math, signed saturation
SSE2 MATHOP INTSSE __m128i _mm_adds_epu8(__m128i __A, __m128i __B) Integer math, unsigned saturation
SSE2 MATHOP INTSSE __m128i _mm_adds_epu16(__m128i __A, __m128i __B) Integer math, unsigned saturation
SSE2 MATHOP INTSSE __m128i _mm_subs_epu8(__m128i __A, __m128i __B) Integer math, unsigned saturation
SSE2 MATHOP INTSSE __m128i _mm_subs_epu16(__m128i __A, __m128i __B) Integer math, unsigned saturation
SSE2 MATHOP INTSSE __m128i _mm_madd_epi16(__m128i __A, __m128i __B) Multiply packed word integers and add adjacent doublewords.
SSE2 MATHOP INTSSE __m128i _mm_mullo_epi16(__m128i __A, __m128i __B) Multiply packed words and return low half of each result.
SSE2 MATHOP INTSSE __m128i _mm_mulhi_epi16(__m128i __A, __m128i __B) Multiply packed signed words and return high half of each result.
SSE2 MATHOP INTSSE __m128i _mm_mulhi_epu16(__m128i __A, __m128i __B) Multiply packed unsigned words and return high half of each result.
SSE2 MATHOP INTSSE __m128i _mm_mul_epu32(__m128i __A, __m128i __B) Multiply first and third unsigned doublewords and returns quadword results.
SSE2 MATHOP INTSSE __m128i _mm_max_epi16(__m128i __A, __m128i __B) Get min/max of signed words.
SSE2 MATHOP INTSSE __m128i _mm_min_epi16(__m128i __A, __m128i __B) Get min/max of signed words.
SSE2 MATHOP INTSSE __m128i _mm_max_epu8(__m128i __A, __m128i __B) Get min/max of unsigned bytes.
SSE2 MATHOP INTSSE __m128i _mm_min_epu8(__m128i __A, __m128i __B) Get min/max of unsigned bytes.
SSE2 MATHOP INTSSE __m128i _mm_avg_epu8(__m128i __A, __m128i __B) Average unsigned components.
SSE2 MATHOP INTSSE __m128i _mm_avg_epu16(__m128i __A, __m128i __B) Average unsigned components.
SSE2 MATHOP INTSSE __m128i _mm_sad_epu8(__m128i __A, __m128i __B) Sum of absolute differences of low unsigned bytes are stored in low half. Sum of absolute differences of high unsigned bytes are stored in high half.
SSE41 MATHOP INTSSE __m128i _mm_min_epi8(__m128i __X, __m128i __Y) Min/max packed integer instructions.
SSE41 MATHOP INTSSE __m128i _mm_max_epi8(__m128i __X, __m128i __Y) Min/max packed integer instructions.
SSE41 MATHOP INTSSE __m128i _mm_min_epu16(__m128i __X, __m128i __Y) Min/max packed integer instructions.
SSE41 MATHOP INTSSE __m128i _mm_max_epu16(__m128i __X, __m128i __Y) Min/max packed integer instructions.
SSE41 MATHOP INTSSE __m128i _mm_min_epi32(__m128i __X, __m128i __Y) Min/max packed integer instructions.
SSE41 MATHOP INTSSE __m128i _mm_max_epi32(__m128i __X, __m128i __Y) Min/max packed integer instructions.
SSE41 MATHOP INTSSE __m128i _mm_min_epu32(__m128i __X, __m128i __Y) Min/max packed integer instructions.
SSE41 MATHOP INTSSE __m128i _mm_max_epu32(__m128i __X, __m128i __Y) Min/max packed integer instructions.
SSE41 MATHOP INTSSE __m128i _mm_mullo_epi32(__m128i __X, __m128i __Y) Packed integer 32-bit multiplication with truncation of upper halves of results.
SSE41 MATHOP INTSSE __m128i _mm_mul_epi32(__m128i __X, __m128i __Y) Packed integer 32-bit multiplication of 2 pairs of operands with two 64-bit results.
SSE41 MATHOP INTSSE __m128i _mm_mpsadbw_epu8(__m128i __X, __m128i __Y, const int __M) Sum absolute 8-bit integer difference of adjacent groups of 4 byte integers in the first 2 operands. Starting offsets within operands are determined by the 3rd mask operand.
SSE41 MATHOP INTSSE __m128i _mm_minpos_epu16(__m128i __X) Return horizontal packed word minimum and its index in bits [15:0] and bits [18:16] respectively.
SSSE3 MATHOP INTSSE __m128i _mm_abs_epi8(__m128i __X) Get absolute values of signed elements.
SSSE3 MATHOP INTSSE __m128i _mm_abs_epi16(__m128i __X) Get absolute values of signed elements.
SSSE3 MATHOP INTSSE __m128i _mm_abs_epi32(__m128i __X) Get absolute values of signed elements.
SSSE3 MATHOP INTSSE __m128i _mm_hadd_epi16(__m128i __X, __m128i __Y) Horizontal addition across vectors. Returns [[Xi0+Xi1] [Xi2+Xi3] ... [Yi4+Yi5] [Yi6+Yi7]].
SSSE3 MATHOP INTSSE __m128i _mm_hadd_epi32(__m128i __X, __m128i __Y) Horizontal addition across vectors. Returns [[Xi0-Xi1] ... [Yi2-Yi3]].
SSSE3 MATHOP INTSSE __m128i _mm_hadds_epi16(__m128i __X, __m128i __Y) Horizontal addition across vectors with signed saturation. Returns [[Xi0+Xi1] [Xi2+Xi3] ... [Yi4+Yi5] [Yi6+Yi7]].
SSSE3 MATHOP INTSSE __m128i _mm_hsub_epi16(__m128i __X, __m128i __Y) Horizontal subtraction across vectors. Returns [[Xi0+Xi1] [Xi2+Xi3] ... [Yi4+Yi5] [Yi6+Yi7]].
SSSE3 MATHOP INTSSE __m128i _mm_hsub_epi32(__m128i __X, __m128i __Y) Horizontal subtraction across vectors. Returns [[Xi0-Xi1] ... [Yi2-Yi3]].
SSSE3 MATHOP INTSSE __m128i _mm_hsubs_epi16(__m128i __X, __m128i __Y) Horizontal subtraction across vectors with signed saturation. Returns [[Xi0+Xi1] [Xi2+Xi3] ... [Yi4+Yi5] [Yi6+Yi7]].
SSSE3 MATHOP INTSSE __m128i _mm_maddubs_epi16(__m128i __X, __m128i __Y) Multiplies vertically each unsigned byte of X with the corresponding signed byte of Y, producing intermediate signed 16-bit integers. Each adjacent pair of signed words is added and the saturated result is packed.
SSSE3 MATHOP INTSSE __m128i _mm_mulhrs_epi16(__m128i __X, __m128i __Y) Multiplies vertically each signed 16-bit integer from X with the corresponding signed 16-bit integer of Y, producing intermediate, signed 32-bit integers. Each intermediate 32-bit integer is truncated to the 18 most significant bits. Rounding is always performed by adding 1 to the least significant bit of the 18-bit intermediate result. The final result is obtained by selecting the 16 bits immediately to the right of the most significant bit of each 18-bit intermediate result and packed.
SSSE3 MATHOP INTSSE __m128i _mm_sign_epi8(__m128i __X, __m128i __Y) Multiply element in X by {1, 0, -1} depending on sign of corresponding element in Y.
SSSE3 MATHOP INTSSE __m128i _mm_sign_epi16(__m128i __X, __m128i __Y) Multiply element in X by {1, 0, -1} depending on sign of corresponding element in Y.
SSSE3 MATHOP INTSSE __m128i _mm_sign_epi32(__m128i __X, __m128i __Y) Multiply element in X by {1, 0, -1} depending on sign of corresponding element in Y.
SSSE3 MATHOP INTSSE __m128i _mm_alignr_epi8(__m128i __X, __m128i __Y, int __N) Concatenates X and Y into an intermediate composite, shifts the composite at byte granularity to the right by a constant immediate, and extracts the right-aligned result.
SSE2 SET INTSSE __m128i _mm_set_epi64x(long long __q1, long long __q0) Create vector from elements, lowest element last.
SSE2 SET INTSSE __m128i _mm_set_epi64(__m64 __q1, __m64 __q0) Create vector from elements, lowest element last.
SSE2 SET INTSSE __m128i _mm_set_epi32(int __q3, int __q2, int __q1, int __q0) Create vector from elements, lowest element last.
SSE2 SET INTSSE __m128i _mm_set_epi16(short __q7, ... short __q0) Create vector from elements, lowest element last.
SSE2 SET INTSSE __m128i _mm_set_epi8(char __q15, ... char __q00) Create vector from elements, lowest element last.
SSE2 SET INTSSE __m128i _mm_setr_epi64(__m64 __q0, __m64 __q1) Create vector from elements, lowest element first.
SSE2 SET INTSSE __m128i _mm_setr_epi32(int __q0, int __q1, int __q2, int __q3) Create vector from elements, lowest element first.
SSE2 SET INTSSE __m128i _mm_setr_epi16(short __q0, ... short __q7) Create vector from elements, lowest element first.
SSE2 SET INTSSE __m128i _mm_setr_epi8(char __q00, ... char __q15) Create vector from elements, lowest element first.
SSE2 SET INTSSE __m128i _mm_setzero_si128(void ) Create a vector of zeros.
SSE2 SET INTSSE __m128i _mm_cvtsi32_si128(int __A) Set low doubleword of vector to A and clear high bits.
SSE2 SET INTSSE __m128i _mm_cvtsi64_si128(long long __A) Set low quadword of vector to A and clear high bits.
SSE2 SET INTSSE __m128i _mm_cvtsi64x_si128(long long __A) Set low quadword of vector to A and clear high bits.
SSE2 SET INTSSE __m128i _mm_set1_epi64x(long long __A) Set all components of the vector to A.
SSE2 SET INTSSE __m128i _mm_set1_epi64(__m64 __A) Set all components of the vector to A.
SSE2 SET INTSSE __m128i _mm_set1_epi32(int __A) Set all components of the vector to A.
SSE2 SET INTSSE __m128i _mm_set1_epi16(short __A) Set all components of the vector to A.
SSE2 SET INTSSE __m128i _mm_set1_epi8(char __A) Set all components of the vector to A.
SSE2 SHUFFLE INTSSE __m128i _mm_packs_epi16(__m128i __A, __m128i __B) Pack eight words from each operand into sixteen bytes using signed saturation.
SSE2 SHUFFLE INTSSE __m128i _mm_packs_epi32(__m128i __A, __m128i __B) Pack four doublewords from each operand into eight words using signed saturation.
SSE2 SHUFFLE INTSSE __m128i _mm_packus_epi16(__m128i __A, __m128i __B) Pack eight words from each operand into sixteen bytes using unsigned saturation.
SSE2 SHUFFLE INTSSE __m128i _mm_unpackhi_epi8(__m128i __A, __m128i __B) Unpack and interleave high components of operands.
SSE2 SHUFFLE INTSSE __m128i _mm_unpackhi_epi16(__m128i __A, __m128i __B) Unpack and interleave high components of operands.
SSE2 SHUFFLE INTSSE __m128i _mm_unpackhi_epi32(__m128i __A, __m128i __B) Unpack and interleave high components of operands.
SSE2 SHUFFLE INTSSE __m128i _mm_unpackhi_epi64(__m128i __A, __m128i __B) Unpack and interleave high components of operands.
SSE2 SHUFFLE INTSSE __m128i _mm_unpacklo_epi8(__m128i __A, __m128i __B) Unpack and interleave low components of operands.
SSE2 SHUFFLE INTSSE __m128i _mm_unpacklo_epi16(__m128i __A, __m128i __B) Unpack and interleave low components of operands.
SSE2 SHUFFLE INTSSE __m128i _mm_unpacklo_epi32(__m128i __A, __m128i __B) Unpack and interleave low components of operands.
SSE2 SHUFFLE INTSSE __m128i _mm_unpacklo_epi64(__m128i __A, __m128i __B) Unpack and interleave low components of operands.
SSE2 SHUFFLE INTSSE __m128i _mm_shufflehi_epi16(__m128i __A, int __B) Shuffle high words of input based on fields of __B.
SSE2 SHUFFLE INTSSE __m128i _mm_shufflelo_epi16(__m128i __A, int __B) Shuffle low words of input based on fields of __B.
SSE2 SHUFFLE INTSSE __m128i _mm_shuffle_epi32(__m128i __A, int __B) Shuffle doublewords of input based on fields of __B.
SSE41 SHUFFLE INTSSE __m128i _mm_blend_epi16(__m128i __X, __m128i __Y, const int __M) Integer blend instructions - select data from 2 sources using constant/variable mask.
SSE41 SHUFFLE INTSSE __m128i _mm_blendv_epi8(__m128i __X, __m128i __Y, __m128i __M) Integer blend instructions - select data from 2 sources using constant/variable mask.
SSE41 SHUFFLE INTSSE __m128i _mm_packus_epi32(__m128i __X, __m128i __Y) Pack 8 double words from 2 operands into 8 words of result with unsigned saturation.
SSSE3 SHUFFLE INTSSE __m128i _mm_shuffle_epi8(__m128i __X, __m128i __Y) Permute bytes in X. For each byte in Y, if the high bit is set, the corresponding byte in X is zeroed out. Otherwise, the low bits of the Y byte specify the source of the byte in X.
SSE42 STRING INTSSE __m128i _mm_cmpistrm(__m128i __X, __m128i __Y, const int __M) Intrinsics for text/string processing.
SSE42 STRING INTSSE int _mm_cmpistri(__m128i __X, __m128i __Y, const int __M) Intrinsics for text/string processing.
SSE42 STRING INTSSE __m128i _mm_cmpestrm(__m128i __X, int __LX, __m128i __Y, int __LY, const int __M) Intrinsics for text/string processing.
SSE42 STRING INTSSE int _mm_cmpestri(__m128i __X, int __LX, __m128i __Y, int __LY, const int __M) Intrinsics for text/string processing.
SSE42 STRING INTSSE int _mm_cmpistra(__m128i __X, __m128i __Y, const int __M) Intrinsics for text/string processing and reading values of EFlags.
SSE42 STRING INTSSE int _mm_cmpistrc(__m128i __X, __m128i __Y, const int __M) Intrinsics for text/string processing and reading values of EFlags.
SSE42 STRING INTSSE int _mm_cmpistro(__m128i __X, __m128i __Y, const int __M) Intrinsics for text/string processing and reading values of EFlags.
SSE42 STRING INTSSE int _mm_cmpistrs(__m128i __X, __m128i __Y, const int __M) Intrinsics for text/string processing and reading values of EFlags.
SSE42 STRING INTSSE int _mm_cmpistrz(__m128i __X, __m128i __Y, const int __M) Intrinsics for text/string processing and reading values of EFlags.
SSE42 STRING INTSSE int _mm_cmpestra(__m128i __X, int __LX, __m128i __Y, int __LY, const int __M) Intrinsics for text/string processing and reading values of EFlags.
SSE42 STRING INTSSE int _mm_cmpestrc(__m128i __X, int __LX, __m128i __Y, int __LY, const int __M) Intrinsics for text/string processing and reading values of EFlags.
SSE42 STRING INTSSE int _mm_cmpestro(__m128i __X, int __LX, __m128i __Y, int __LY, const int __M) Intrinsics for text/string processing and reading values of EFlags.
SSE42 STRING INTSSE int _mm_cmpestrs(__m128i __X, int __LX, __m128i __Y, int __LY, const int __M) Intrinsics for text/string processing and reading values of EFlags.
SSE42 STRING INTSSE int _mm_cmpestrz(__m128i __X, int __LX, __m128i __Y, int __LY, const int __M) Intrinsics for text/string processing and reading values of EFlags.
SSE2 CASTING MIXED __m128 _mm_castpd_ps(__m128d __A) Type conversion for compiler. No value modification.
SSE2 CASTING MIXED __m128i _mm_castpd_si128(__m128d __A) Type conversion for compiler. No value modification.
SSE2 CASTING MIXED __m128d _mm_castps_pd(__m128 __A) Type conversion for compiler. No value modification.
SSE2 CASTING MIXED __m128i _mm_castps_si128(__m128 __A) Type conversion for compiler. No value modification.
SSE2 CASTING MIXED __m128 _mm_castsi128_ps(__m128i __A) Type conversion for compiler. No value modification.
SSE2 CASTING MIXED __m128d _mm_castsi128_pd(__m128i __A) Type conversion for compiler. No value modification.
SSE CONVERT MIXED int _mm_cvtss_si32(__m128 __A) Convert lowest float in vector to integer. Round per MXCSR.
SSE CONVERT MIXED int _mm_cvt_ss2si(__m128 __A) Convert lowest float in vector to integer. Round per MXCSR.
SSE CONVERT MIXED int _mm_cvttss_si32(__m128 __A) Convert lowest float in vector to integer. Round by truncation.
SSE CONVERT MIXED int _mm_cvtt_ss2si(__m128 __A) Convert lowest float in vector to integer. Round by truncation.
SSE CONVERT MIXED long long _mm_cvtss_si64(__m128 __A) Convert lowest float in vector to quadword integer. Round per MXCSR.
SSE CONVERT MIXED long long _mm_cvtss_si64x(__m128 __A) Convert lowest float in vector to quadword integer. Round per MXCSR.
SSE CONVERT MIXED long long _mm_cvttss_si64(__m128 __A) Convert lowest float in vector to quadword integer. Round by truncation.
SSE CONVERT MIXED long long _mm_cvttss_si64x(__m128 __A) Convert lowest float in vector to quadword integer. Round by truncation.
SSE CONVERT MIXED __m128 _mm_cvtsi32_ss(__m128 __A, int __B) Convert B to a float and replace low element in A.
SSE CONVERT MIXED __m128 _mm_cvt_si2ss(__m128 __A, int __B) Convert B to a float and replace low element in A.
SSE CONVERT MIXED __m128 _mm_cvtsi64_ss(__m128 __A, long long __B) Convert B to a float and replace low element in A.
SSE CONVERT MIXED __m128 _mm_cvtsi64x_ss(__m128 __A, long long __B) Convert B to a float and replace low element in A.
SSE2 CONVERT MIXED __m128d _mm_cvtepi32_pd(__m128i __A) Convert two lowest doubleword integers to double floats.
SSE2 CONVERT MIXED __m128 _mm_cvtepi32_ps(__m128i __A) Convert four doubleword integers to four single floats.
SSE2 CONVERT MIXED __m128i _mm_cvtpd_epi32(__m128d __A) Convert two double floats to lowest doubleword integers. High half is cleared. Round per MXCSR.
SSE2 CONVERT MIXED __m128 _mm_cvtpd_ps(__m128d __A) Convert two double floats to lowest single floats. High half is cleared.
SSE2 CONVERT MIXED __m128i _mm_cvttpd_epi32(__m128d __A) Convert two double floats to lowest doubleword integers. High half is cleared. Round by truncation.
SSE2 CONVERT MIXED __m128i _mm_cvtps_epi32(__m128d __A) Convert four single floats to doubleword integers. Round per MXCSR.
SSE2 CONVERT MIXED __m128i _mm_cvttps_epi32(__m128d __A) Convert four single floats to doubleword integers. Round by truncation.
SSE2 CONVERT MIXED __m128d _mm_cvtpd_ps(__m128 __A) Convert two lowest single floats to double floats.
SSE2 CONVERT MIXED int _mm_cvtsd_si32(__m128d __A) Convert lowest double float to integer. Round per MXCSR.
SSE2 CONVERT MIXED long long _mm_cvtsd_si64(__m128d __A) Convert lowest double float to integer. Round per MXCSR.
SSE2 CONVERT MIXED long long _mm_cvtsd_si64x(__m128d __A) Convert lowest double float to integer. Round per MXCSR.
SSE2 CONVERT MIXED int _mm_cvttsd_si32(__m128d __A) Convert lowest double float to integer. Round by truncation.
SSE2 CONVERT MIXED long long _mm_cvttsd_si64(__m128d __A) Convert lowest double float to integer. Round by truncation.
SSE2 CONVERT MIXED long long _mm_cvttsd_si64x(__m128d __A) Convert lowest double float to integer. Round by truncation.
SSE2 CONVERT MIXED __m128 _mm_cvtsd_ss(__m128 __A, __m128d __B) Convert lowest double float to lowest single float. High bits are copied from B?
SSE2 CONVERT MIXED __m128d _mm_cvtsi32_sd(__m128d __A, int __B) Convert signed doubleword integer to lowest double float. High bits are copied from B?
SSE2 CONVERT MIXED __m128d _mm_cvtsi64_sd(__m128d __A, long long __B) Convert signed quadword integer to lowest double float. High bits are copied from B?
SSE2 CONVERT MIXED __m128d _mm_cvtsi64x_sd(__m128d __A, long long __B) Convert signed quadword integer to lowest double float. High bits are copied from B?
SSE2 CONVERT MIXED __m128d _mm_cvtss_sd(__m128d __A, __m128 __B) Convert lowest single float to lowest double float. High bits are copied from B?
SSE FPUMODE OTHER unsigned int _mm_getcsr(void ) Get/set contents of the control register.
SSE FPUMODE OTHER void _mm_setcsr(unsigned int __I) Get/set contents of the control register.
SSE FPUMODE OTHER unsigned int _MM_GET_EXCEPTION_STATE(void ) Read bits from the control register.
SSE FPUMODE OTHER unsigned int _MM_GET_EXCEPTION_MASK(void ) Read bits from the control register.
SSE FPUMODE OTHER unsigned int _MM_GET_ROUNDING_MODE(void ) Read bits from the control register.
SSE FPUMODE OTHER unsigned int _MM_GET_FLUSH_ZERO_MODE(void ) Read bits from the control register.
SSE FPUMODE OTHER void _MM_SET_EXCEPTION_STATE(unsigned int __mask) Set bits in the control register.
SSE FPUMODE OTHER void _MM_SET_EXCEPTION_MASK(unsigned int __mask) Set bits in the control register.
SSE FPUMODE OTHER void _MM_SET_ROUNDING_MODE(unsigned int __mode) Set bits in the control register.
SSE FPUMODE OTHER void _MM_SET_FLUSH_ZERO_MODE(unsigned int __mode) Set bits in the control register.
SSE3 FPUMODE OTHER void _MM_SET_DENORMALS_ZERO_MODE(int mode) Set bits in the control register.
SSE3 FPUMODE OTHER int _MM_GET_DENORMALS_ZERO_MODE(void ) Read bits from the control register.
SSE42 MATHOP OTHER int _mm_popcnt_u32(unsigned int __X) Calculate a number of bits set to 1.
SSE42 MATHOP OTHER long long _mm_popcnt_u64(unsigned long long __X) Calculate a number of bits set to 1.
SSE42 MATHOP OTHER unsigned int _mm_crc32_u8(unsigned int __C, unsigned char __V) Accumulate CRC32 (polynomial 0x11EDC6F41) value.
SSE42 MATHOP OTHER unsigned int _mm_crc32_u16(unsigned int __C, unsigned short __V) Accumulate CRC32 (polynomial 0x11EDC6F41) value.
SSE42 MATHOP OTHER unsigned int _mm_crc32_u32(unsigned int __C, unsigned int __V) Accumulate CRC32 (polynomial 0x11EDC6F41) value.
SSE42 MATHOP OTHER unsigned long long _mm_crc32_u64(unsigned long long __C, unsigned long long __V) Accumulate CRC32 (polynomial 0x11EDC6F41) value.
SSE MEMORY OTHER void _mm_prefetch(void * __P, enum _mm_hint __I) Loads one cache line from address P to a location "closer" to the processor. The selector I specifies the type of prefetch operation. Obsolete in Pentium 4+ which perform speculative prefetches?
SSE MEMORY OTHER void _mm_sfence(void ) Store fence. Serializing instruction for cache manipulation.
SSE2 MEMORY OTHER void _mm_clflush(void const * __A) Flush cache line. Note: check CPUID bit for availability.
SSE2 MEMORY OTHER void _mm_lfence(void ) Load fence. Serializing instruction for cache manipulation.
SSE2 MEMORY OTHER void _mm_mfence(void ) Memory fence. Serializing instruction for cache manipulation.
SSE3 MEMORY OTHER void _mm_monitor(void const * __P, unsigned int __E, unsigned int __H) Specify address to watch for future _mm_mwait() call.
SSE3 MEMORY OTHER void _mm_mwait(unsigned int __E, unsigned int __H) Enter a low power state and wait for store operation at address specified by _mm_monitor() or certain system-defined events. Used for power management or multiprocessor synchronization.
SSE3 LOADSTORE INTSSE __m128i _mm_lddqu_si128(__m128i const * __P) Load value from unaligned address where cache line splits are a performance problem.
SSE OTHER OTHER void _mm_pause(void ) Pause processor for an implementation specific amount of time. May save power.
SSE BITLOGIC REAL32 __m128 _mm_and_ps(__m128 __A, __m128 __B) Bitwise logic
SSE BITLOGIC REAL32 __m128 _mm_andnot_ps(__m128 __A, __m128 __B) Bitwise logic
SSE BITLOGIC REAL32 __m128 _mm_or_ps(__m128 __A, __m128 __B) Bitwise logic
SSE BITLOGIC REAL32 __m128 _mm_xor_ps(__m128 __A, __m128 __B) Bitwise logic
SSE COMPARE REAL32 __m128 _mm_cmpeq_ss(__m128 __A, __m128 __B) Compare low elements.
SSE COMPARE REAL32 __m128 _mm_cmplt_ss(__m128 __A, __m128 __B) Compare low elements.
SSE COMPARE REAL32 __m128 _mm_cmple_ss(__m128 __A, __m128 __B) Compare low elements.
SSE COMPARE REAL32 __m128 _mm_cmpgt_ss(__m128 __A, __m128 __B) Compare low elements.
SSE COMPARE REAL32 __m128 _mm_cmpge_ss(__m128 __A, __m128 __B) Compare low elements.
SSE COMPARE REAL32 __m128 _mm_cmpneq_ss(__m128 __A, __m128 __B) Compare low elements.
SSE COMPARE REAL32 __m128 _mm_cmpnlt_ss(__m128 __A, __m128 __B) Compare low elements.
SSE COMPARE REAL32 __m128 _mm_cmpnle_ss(__m128 __A, __m128 __B) Compare low elements.
SSE COMPARE REAL32 __m128 _mm_cmpngt_ss(__m128 __A, __m128 __B) Compare low elements.
SSE COMPARE REAL32 __m128 _mm_cmpnge_ss(__m128 __A, __m128 __B) Compare low elements.
SSE COMPARE REAL32 __m128 _mm_cmpord_ss(__m128 __A, __m128 __B) Compare low elements. Unordered means one or both operands is a NaN.
SSE COMPARE REAL32 __m128 _mm_cmpunord_ss(__m128 __A, __m128 __B) Compare low elements. Unordered means one or both operands is a NaN.
SSE COMPARE REAL32 __m128 _mm_cmpeq_ps(__m128 __A, __m128 __B) Compare all elements.
SSE COMPARE REAL32 __m128 _mm_cmplt_ps(__m128 __A, __m128 __B) Compare all elements.
SSE COMPARE REAL32 __m128 _mm_cmple_ps(__m128 __A, __m128 __B) Compare all elements.
SSE COMPARE REAL32 __m128 _mm_cmpgt_ps(__m128 __A, __m128 __B) Compare all elements.
SSE COMPARE REAL32 __m128 _mm_cmpge_ps(__m128 __A, __m128 __B) Compare all elements.
SSE COMPARE REAL32 __m128 _mm_cmpneq_ps(__m128 __A, __m128 __B) Compare all elements.
SSE COMPARE REAL32 __m128 _mm_cmpnlt_ps(__m128 __A, __m128 __B) Compare all elements.
SSE COMPARE REAL32 __m128 _mm_cmpnle_ps(__m128 __A, __m128 __B) Compare all elements.
SSE COMPARE REAL32 __m128 _mm_cmpngt_ps(__m128 __A, __m128 __B) Compare all elements.
SSE COMPARE REAL32 __m128 _mm_cmpnge_ps(__m128 __A, __m128 __B) Compare all elements.
SSE COMPARE REAL32 __m128 _mm_cmpord_ps(__m128 __A, __m128 __B) Compare all elements. Unordered means one or both operands is a NaN.
SSE COMPARE REAL32 __m128 _mm_cmpunord_ps(__m128 __A, __m128 __B) Compare all elements. Unordered means one or both operands is a NaN.
SSE COMPARE REAL32 int _mm_comieq_ss(__m128 __A, __m128 __B) Compare elements. Throws exception on QNaN or SNaN.
SSE COMPARE REAL32 int _mm_comilt_ss(__m128 __A, __m128 __B) Compare elements. Throws exception on QNaN or SNaN.
SSE COMPARE REAL32 int _mm_comile_ss(__m128 __A, __m128 __B) Compare elements. Throws exception on QNaN or SNaN.
SSE COMPARE REAL32 int _mm_comigt_ss(__m128 __A, __m128 __B) Compare elements. Throws exception on QNaN or SNaN.
SSE COMPARE REAL32 int _mm_comige_ss(__m128 __A, __m128 __B) Compare elements. Throws exception on QNaN or SNaN.
SSE COMPARE REAL32 int _mm_comineq_ss(__m128 __A, __m128 __B) Compare elements. Throws exception on QNaN or SNaN.
SSE COMPARE REAL32 int _mm_ucomieq_ss(__m128 __A, __m128 __B) Compare elements. Tolerates QNaN but throws exception on SNaN.
SSE COMPARE REAL32 int _mm_ucomilt_ss(__m128 __A, __m128 __B) Compare elements. Tolerates QNaN but throws exception on SNaN.
SSE COMPARE REAL32 int _mm_ucomile_ss(__m128 __A, __m128 __B) Compare elements. Tolerates QNaN but throws exception on SNaN.
SSE COMPARE REAL32 int _mm_ucomigt_ss(__m128 __A, __m128 __B) Compare elements. Tolerates QNaN but throws exception on SNaN.
SSE COMPARE REAL32 int _mm_ucomige_ss(__m128 __A, __m128 __B) Compare elements. Tolerates QNaN but throws exception on SNaN.
SSE COMPARE REAL32 int _mm_ucomineq_ss(__m128 __A, __m128 __B) Compare elements. Tolerates QNaN but throws exception on SNaN.
SSE EXTRACT REAL32 int _mm_movemask_ps(__m128 __A) Extract sign bits of all components into integer.
SSE EXTRACT REAL32 float _mm_cvtss_f32(__m128 __A) Extract low element of vector.
SSE EXTRACT REAL32 __m128 _mm_move_ss(__m128 __A, __m128 __B) Sets the low SPFP value of A from the low value of B.
SSE41 EXTRACT REAL32 int _mm_extract_ps(__m128 __X, const int __N) Extract binary representation of single precision float from packed single precision array element of X selected by index N.
SSE41 INSERT REAL32 __m128 _mm_insert_ps(__m128 __D, __m128 __S, const int __N) Insert single precision float into packed single precision array element selected by index N. The bits [7-6] of N define S index, the bits [5-4] define D index, and bits [3-0] define zeroing mask for D.
SSE LOADSTORE REAL32 void _mm_stream_ps(float * __P, __m128 __A) Write value to memory without polluting caches.
SSE LOADSTORE REAL32 __m128 _mm_loadh_pi(__m128 __A, __m64 const * __P) Sets the upper two SPFP values with 64-bits of data loaded from P; the lower two values are passed through from A.
SSE LOADSTORE REAL32 void _mm_storeh_pi(__m64 * __P, __m128 __A) Stores the upper two SPFP values of A into P.
SSE LOADSTORE REAL32 __m128 _mm_loadl_pi(__m128 __A, __m64 const * __P) Sets the lower two SPFP values with 64-bits of data loaded from P; the upper two values are passed through from A.
SSE LOADSTORE REAL32 void _mm_storel_pi(__m64 * __P, __m128 __A) Stores the lower two SPFP values of A into P.
SSE LOADSTORE REAL32 __m128 _mm_load1_ps(float const * __P) Create a vector with all four elements equal to *P.
SSE LOADSTORE REAL32 __m128 _mm_load_ps1(float const * __P) Create a vector with all four elements equal to *P.
SSE LOADSTORE REAL32 __m128 _mm_load_ps(float const * __P) Load four SPFP values from P. The address must be 16-byte aligned.
SSE LOADSTORE REAL32 __m128 _mm_loadu_ps(float const * __P) Load four SPFP values from P. The address need not be 16-byte aligned.
SSE LOADSTORE REAL32 __m128 _mm_loadr_ps(float const * __P) Load four SPFP values in reverse order. The address must be aligned.
SSE LOADSTORE REAL32 void _mm_store_ss(float * __P, __m128 __A) Stores the lower SPFP value.
SSE LOADSTORE REAL32 void _mm_store_ps(float * __P, __m128 __A) Store four SPFP values. The address must be 16-byte aligned.
SSE LOADSTORE REAL32 void _mm_storeu_ps(float * __P, __m128 __A) Store four SPFP values. The address need not be 16-byte aligned.
SSE LOADSTORE REAL32 void _mm_store1_ps(float * __P, __m128 __A) Store the lower SPFP value across four words. (Duplicate low value.)
SSE LOADSTORE REAL32 void _mm_store_ps1(float * __P, __m128 __A) Store the lower SPFP value across four words. (Duplicate low value.)
SSE LOADSTORE REAL32 void _mm_storer_ps(float * __P, __m128 __A) Store four SPFP values in reverse order. The address must be aligned.
SSE3 LOADSTORE REAL32 __m128 _mm_movehdup_ps(__m128 __X) Load vector [f0 f1 f2 f3] from address X, then duplicate elements 1 and 3. Return [f1 f1 f3 f3].
SSE3 LOADSTORE REAL32 __m128 _mm_moveldup_ps(__m128 __X) Load vector [f0 f1 f2 f3] from address X, then duplicate elements 0 and 2. Return [f0 f0 f2 f2].
SSE MATHOP REAL32 __m128 _mm_add_ss(__m128 __A, __m128 __B) Basic math on low elements. Copy high bits of A.
SSE MATHOP REAL32 __m128 _mm_sub_ss(__m128 __A, __m128 __B) Basic math on low elements. Copy high bits of A.
SSE MATHOP REAL32 __m128 _mm_mul_ss(__m128 __A, __m128 __B) Basic math on low elements. Copy high bits of A.
SSE MATHOP REAL32 __m128 _mm_div_ss(__m128 __A, __m128 __B) Basic math on low elements. Copy high bits of A.
SSE MATHOP REAL32 __m128 _mm_min_ss(__m128 __A, __m128 __B) Basic math on low elements. Copy high bits of A.
SSE MATHOP REAL32 __m128 _mm_max_ss(__m128 __A, __m128 __B) Basic math on low elements. Copy high bits of A.
SSE MATHOP REAL32 __m128 _mm_add_ps(__m128 __A, __m128 __B) Basic math on all elements.
SSE MATHOP REAL32 __m128 _mm_sub_ps(__m128 __A, __m128 __B) Basic math on all elements.
SSE MATHOP REAL32 __m128 _mm_mul_ps(__m128 __A, __m128 __B) Basic math on all elements.
SSE MATHOP REAL32 __m128 _mm_div_ps(__m128 __A, __m128 __B) Basic math on all elements.
SSE MATHOP REAL32 __m128 _mm_min_ps(__m128 __A, __m128 __B) Basic math on all elements.
SSE MATHOP REAL32 __m128 _mm_max_ps(__m128 __A, __m128 __B) Basic math on all elements.
SSE MATHOP REAL32 __m128 _mm_sqrt_ps(__m128 __A) Get Square root of each element.
SSE MATHOP REAL32 __m128 _mm_sqrt_ss(__m128 __A) Get square root of low component of A. High bits are unchanged.
SSE MATHOP REAL32 __m128 _mm_rcp_ps(__m128 __A) Get reciprocal of each element.
SSE MATHOP REAL32 __m128 _mm_rcp_ss(__m128 __A) Get reciprocal of low element. High bits are unchanged.
SSE MATHOP REAL32 __m128 _mm_sqrt_ps(__m128 __A) Get reciprocal of square root of each element.
SSE MATHOP REAL32 __m128 _mm_sqrt_ss(__m128 __A) Get reciprocal of square root of low component of A. High bits are unchanged.
SSE3 MATHOP REAL32 __m128 _mm_addsub_ps(__m128 __X, __m128 __Y) Adds odd-numbered SPFP values of X with the corresponding SPFP values from Y; returns results in odd-numbered values. Subtracts even-numbered SPFP values from Y from the corresponding SPFP values in X; returns results in even-numbered values.
SSE3 MATHOP REAL32 __m128 _mm_hadd_ps(__m128 __X, __m128 __Y) Horizontal addition across vectors. Returns [[Xf0+Xf1] [Xf2+Xf3] [Yf0+Yf1] [Yf2+Yf3]].
SSE3 MATHOP REAL32 __m128 _mm_hsub_ps(__m128 __X, __m128 __Y) Horizontal subtraction across vectors. Returns [[Xf0-Xf1] [Xf2-Xf3] [Yf0-Yf1] [Yf2-Yf3]].
SSE41 MATHOP REAL32 __m128 _mm_dp_ps(__m128 __X, __m128 __Y, const int __M) Dot product instructions with mask-defined summing and zeroing parts of result.
SSE SET REAL32 __m128 _mm_set_ss(float __F) Create a vector with element 0 as F and the rest zero.
SSE SET REAL32 __m128 _mm_set1_ps(float __F) Set all elements of vector to same value
SSE SET REAL32 __m128 _mm_set_ps1(float __F) Set all elements of vector to same value
SSE SET REAL32 __m128 _mm_load_ss(float const * __P) Create a vector with element 0 as *P and the rest zero.
SSE SET REAL32 __m128d _mm_setzero_sd(void ) Create a vector of zeros.
SSE SET REAL32 __m128 _mm_set_ps(const float __Z, const float __Y, const float __X, const float __W) Create the vector [Z Y X W].
SSE SET REAL32 __m128 _mm_setr_ps(float __Z, float __Y, float __X, float __W) Create the vector [W X Y Z].
SSE SHUFFLE REAL32 __m128 _mm_shuffle_ps(__m128 __A, __m128 __B, int __mask) Moves two of the four packed single-precision floating-point values from the destination operand (first operand) into the low quadword of the destination operand; moves two of the four packed single-precision floating-point values from the source operand (second operand) into to the high quadword of the destination operand.
SSE SHUFFLE REAL32 __m128 _mm_unpackhi_ps(__m128 __A, __m128 __B) Unpack and interleave high components of inputs.
SSE SHUFFLE REAL32 __m128 _mm_unpacklo_ps(__m128 __A, __m128 __B) Unpack and interleave low components of inputs.
SSE SHUFFLE REAL32 __m128 _mm_movehl_ps(__m128 __A, __m128 __B) Moves the upper two values of B into the lower two values of A.
SSE SHUFFLE REAL32 __m128 _mm_movelh_ps(__m128 __A, __m128 __B) Moves the lower two values of B into the upper two values of A.
SSE SHUFFLE REAL32 void _MM_TRANSPOSE4_PS(__m128& row0, __m128& row1, __m128& row2, __m128& row3) Transpose the 4x4 matrix composed of row[0-3]. (MACRO)
SSE41 SHUFFLE REAL32 __m128 _mm_blend_ps(__m128 __X, __m128 __Y, const int __M) Single precision floating point blend instructions - select data from 2 sources using constant/variable mask.
SSE41 SHUFFLE REAL32 __m128 _mm_blendv_ps(__m128 __X, __m128 __Y, __m128 __M) Single precision floating point blend instructions - select data from 2 sources using constant/variable mask.
SSE2 BITLOGIC REAL64 __m128d _mm_and_pd(__m128d __A, __m128d __B) Bitwise logic
SSE2 BITLOGIC REAL64 __m128d _mm_andnot_pd(__m128d __A, __m128d __B) Bitwise logic
SSE2 BITLOGIC REAL64 __m128d _mm_or_pd(__m128d __A, __m128d __B) Bitwise logic
SSE2 BITLOGIC REAL64 __m128d _mm_xor_pd(__m128d __A, __m128d __B) Bitwise logic
SSE2 COMPARE REAL64 __m128d _mm_cmpeq_sd(__m128d __A, __m128d __B) Compare low elements.
SSE2 COMPARE REAL64 __m128d _mm_cmplt_sd(__m128d __A, __m128d __B) Compare low elements.
SSE2 COMPARE REAL64 __m128d _mm_cmple_sd(__m128d __A, __m128d __B) Compare low elements.
SSE2 COMPARE REAL64 __m128d _mm_cmpgt_sd(__m128d __A, __m128d __B) Compare low elements.
SSE2 COMPARE REAL64 __m128d _mm_cmpge_sd(__m128d __A, __m128d __B) Compare low elements.
SSE2 COMPARE REAL64 __m128d _mm_cmpneq_sd(__m128d __A, __m128d __B) Compare low elements.
SSE2 COMPARE REAL64 __m128d _mm_cmpnlt_sd(__m128d __A, __m128d __B) Compare low elements.
SSE2 COMPARE REAL64 __m128d _mm_cmpnle_sd(__m128d __A, __m128d __B) Compare low elements.
SSE2 COMPARE REAL64 __m128d _mm_cmpngt_sd(__m128d __A, __m128d __B) Compare low elements.
SSE2 COMPARE REAL64 __m128d _mm_cmpnge_sd(__m128d __A, __m128d __B) Compare low elements.
SSE2 COMPARE REAL64 __m128d _mm_cmpord_sd(__m128d __A, __m128d __B) Compare low elements. Unordered means one or both operands is a NaN.
SSE2 COMPARE REAL64 __m128d _mm_cmpunord_sd(__m128d __A, __m128d __B) Compare low elements. Unordered means one or both operands is a NaN.
SSE2 COMPARE REAL64 __m128d _mm_cmpeq_pd(__m128d __A, __m128d __B) Compare low elements.
SSE2 COMPARE REAL64 __m128d _mm_cmplt_pd(__m128d __A, __m128d __B) Compare low elements.
SSE2 COMPARE REAL64 __m128d _mm_cmple_pd(__m128d __A, __m128d __B) Compare low elements.
SSE2 COMPARE REAL64 __m128d _mm_cmpgt_pd(__m128d __A, __m128d __B) Compare low elements.
SSE2 COMPARE REAL64 __m128d _mm_cmpge_pd(__m128d __A, __m128d __B) Compare low elements.
SSE2 COMPARE REAL64 __m128d _mm_cmpneq_pd(__m128d __A, __m128d __B) Compare low elements.
SSE2 COMPARE REAL64 __m128d _mm_cmpnlt_pd(__m128d __A, __m128d __B) Compare low elements.
SSE2 COMPARE REAL64 __m128d _mm_cmpnle_pd(__m128d __A, __m128d __B) Compare low elements.
SSE2 COMPARE REAL64 __m128d _mm_cmpngt_pd(__m128d __A, __m128d __B) Compare low elements.
SSE2 COMPARE REAL64 __m128d _mm_cmpnge_pd(__m128d __A, __m128d __B) Compare low elements.
SSE2 COMPARE REAL64 __m128d _mm_cmpord_pd(__m128d __A, __m128d __B) Compare all elements. Unordered means one or both operands is a NaN.
SSE2 COMPARE REAL64 __m128d _mm_cmpunord_pd(__m128d __A, __m128d __B) Compare all elements. Unordered means one or both operands is a NaN.
SSE2 COMPARE REAL64 int _mm_comieq_sd(__m128d __A, __m128d __B) Compare elements. Throws exception on QNaN or SNaN.
SSE2 COMPARE REAL64 int _mm_comilt_sd(__m128d __A, __m128d __B) Compare elements. Throws exception on QNaN or SNaN.
SSE2 COMPARE REAL64 int _mm_comile_sd(__m128d __A, __m128d __B) Compare elements. Throws exception on QNaN or SNaN.
SSE2 COMPARE REAL64 int _mm_comigt_sd(__m128d __A, __m128d __B) Compare elements. Throws exception on QNaN or SNaN.
SSE2 COMPARE REAL64 int _mm_comige_sd(__m128d __A, __m128d __B) Compare elements. Throws exception on QNaN or SNaN.
SSE2 COMPARE REAL64 int _mm_comineq_sd(__m128d __A, __m128d __B) Compare elements. Throws exception on QNaN or SNaN.
SSE2 COMPARE REAL64 int _mm_ucomieq_sd(__m128d __A, __m128d __B) Compare elements. Tolerates QNaN but throws exception on SNaN.
SSE2 COMPARE REAL64 int _mm_ucomilt_sd(__m128d __A, __m128d __B) Compare elements. Tolerates QNaN but throws exception on SNaN.
SSE2 COMPARE REAL64 int _mm_ucomile_sd(__m128d __A, __m128d __B) Compare elements. Tolerates QNaN but throws exception on SNaN.
SSE2 COMPARE REAL64 int _mm_ucomigt_sd(__m128d __A, __m128d __B) Compare elements. Tolerates QNaN but throws exception on SNaN.
SSE2 COMPARE REAL64 int _mm_ucomige_sd(__m128d __A, __m128d __B) Compare elements. Tolerates QNaN but throws exception on SNaN.
SSE2 COMPARE REAL64 int _mm_ucomineq_sd(__m128d __A, __m128d __B) Compare elements. Tolerates QNaN but throws exception on SNaN.
SSE41 CONVERT REAL64 __m128d _mm_round_pd(__m128d __V, const int __M) Packed/scalar double precision floating point rounding.
SSE41 CONVERT REAL64 __m128d _mm_round_sd(__m128d __D, __m128d __V, const int __M) Packed/scalar double precision floating point rounding.
SSE41 CONVERT REAL64 __m128d _mm_ceil_pd(__m128d V) Packed/scalar double precision floating point rounding.
SSE41 CONVERT REAL64 __m128d _mm_ceil_sd(__m128d __D, __m128d __V) Packed/scalar double precision floating point rounding.
SSE41 CONVERT REAL64 __m128d _mm_floor_pd(__m128d V) Packed/scalar double precision floating point rounding.
SSE41 CONVERT REAL64 __m128d _mm_floor_sd(__m128d __D, __m128d __V) Packed/scalar double precision floating point rounding.
SSE41 CONVERT REAL64 __m128 _mm_round_ps(__m128 __V, const int __M) Packed/scalar single precision floating point rounding.
SSE41 CONVERT REAL64 __m128 _mm_round_ss(__m128 __D, __m128 __V, const int __M) Packed/scalar single precision floating point rounding.
SSE41 CONVERT REAL64 __m128d _mm_ceil_ps(__m128d V) Packed/scalar single precision floating point rounding.
SSE41 CONVERT REAL64 __m128d _mm_ceil_ss(__m128d __D, __m128d __V) Packed/scalar single precision floating point rounding.
SSE41 CONVERT REAL64 __m128d _mm_floor_ps(__m128d V) Packed/scalar single precision floating point rounding.
SSE41 CONVERT REAL64 __m128d _mm_floor_ss(__m128d __D, __m128d __V) Packed/scalar single precision floating point rounding.
SSE2 EXTRACT REAL64 double _mm_cvtsd_f64(__m128d __A) Extract lower value of DPFP.
SSE2 EXTRACT REAL64 int _mm_movemask_pd(__m128d __A) Extract sign bits of all components into integer.
SSE2 LOADSTORE REAL64 __m128d _mm_move_sd(__m128d __A, __m128d __B) Sets the low DPFP value of A from the low value of B.
SSE2 LOADSTORE REAL64 __m128d _mm_load_pd(double const * __P) Load two DPFP values from P. The address must be 16-byte aligned.
SSE2 LOADSTORE REAL64 __m128d _mm_loadu_pd(double const * __P) Load two DPFP values from P. The address need not be 16-byte aligned.
SSE2 LOADSTORE REAL64 __m128d _mm_load1_pd(double const * __P) Create a vector with both elements equal to *P.
SSE2 LOADSTORE REAL64 __m128d _mm_load_pd1(double const * __P) Create a vector with both elements equal to *P.
SSE2 LOADSTORE REAL64 __m128d _mm_load_sd(double const * __P) Create a vector with element 0 as *P and the rest zero.
SSE2 LOADSTORE REAL64 __m128d _mm_loadr_pd(double const * __P) Load two DPFP values in reverse order. The address must be aligned.
SSE2 LOADSTORE REAL64 void _mm_store_pd(double * __P, __m128d __A) Store two DPFP values. The address must be 16-byte aligned.
SSE2 LOADSTORE REAL64 void _mm_storeu_pd(double * __P, __m128d __A) Store two DPFP values. The address need not be 16-byte aligned.
SSE2 LOADSTORE REAL64 void _mm_store_sd(double * __P, __m128d __A) Stores the lower DPFP value.
SSE2 LOADSTORE REAL64 void _mm_storel_pd(double * __P, __m128d __A) Stores the lower DPFP value.
SSE2 LOADSTORE REAL64 void _mm_storeh_pd(double * __P, __m128d __A) Stores the upper DPFP value.
SSE2 LOADSTORE REAL64 void _mm_store1_pd(double * __P, __m128d __A) Store the lower DPFP value across two words. (Duplicate the value.) The address must be 16-byte aligned.
SSE2 LOADSTORE REAL64 void _mm_store_pd1(double * __P, __m128d __A) Store the lower DPFP value across two words. (Duplicate the value.) The address must be 16-byte aligned.
SSE2 LOADSTORE REAL64 void _mm_storer_pd(double * __P, __m128d __A) Store two DPFP values in reverse order. The address must be aligned.
SSE2 LOADSTORE REAL64 __m128d _mm_loadh_pd(__m128d __A, double const * __B) Load double float from memory into high double. Copy low bits from A? Unaligned address.
SSE2 LOADSTORE REAL64 __m128d _mm_loadl_pd(__m128d __A, double const * __B) Load double float from memory into low double. Copy high bits from A? Unaligned address.
SSE2 LOADSTORE REAL64 void _mm_stream_pd(double * __A, __m128d __B) Write value to memory without polluting caches.
SSE3 LOADSTORE REAL64 __m128d _mm_loaddup_pd(double const * __P) Load vector [d0 d1] from address X, then duplicate element 0. Return [d0 d0]. Synonym for _mm_load1_pd().
SSE2 MATHOP REAL64 __m128d _mm_add_sd(__m128d __A, __m128d __B) Basic math on low elements. Copy high bits of A.
SSE2 MATHOP REAL64 __m128d _mm_sub_sd(__m128d __A, __m128d __B) Basic math on low elements. Copy high bits of A.
SSE2 MATHOP REAL64 __m128d _mm_mul_sd(__m128d __A, __m128d __B) Basic math on low elements. Copy high bits of A.
SSE2 MATHOP REAL64 __m128d _mm_div_sd(__m128d __A, __m128d __B) Basic math on low elements. Copy high bits of A.
SSE2 MATHOP REAL64 __m128d _mm_min_sd(__m128d __A, __m128d __B) Basic math on low elements. Copy high bits of A.
SSE2 MATHOP REAL64 __m128d _mm_max_sd(__m128d __A, __m128d __B) Basic math on low elements. Copy high bits of A.
SSE2 MATHOP REAL64 __m128d _mm_add_pd(__m128d __A, __m128d __B) Basic math on all elements.
SSE2 MATHOP REAL64 __m128d _mm_sub_pd(__m128d __A, __m128d __B) Basic math on all elements.
SSE2 MATHOP REAL64 __m128d _mm_mul_pd(__m128d __A, __m128d __B) Basic math on all elements.
SSE2 MATHOP REAL64 __m128d _mm_div_pd(__m128d __A, __m128d __B) Basic math on all elements.
SSE2 MATHOP REAL64 __m128d _mm_min_pd(__m128d __A, __m128d __B) Basic math on all elements.
SSE2 MATHOP REAL64 __m128d _mm_max_pd(__m128d __A, __m128d __B) Basic math on all elements.
SSE2 MATHOP REAL64 __m128d _mm_sqrt_pd(__m128d __A) Square root of each element.
SSE2 MATHOP REAL64 __m128d _mm_sqrt_sd(__m128d __A, __m128d __B) Get square root of low half of A and copy top half of B.
SSE3 MATHOP REAL64 __m128d _mm_addsub_pd(__m128d __X, __m128d __Y) Adds odd-numbered DPFP values of X with the corresponding DPFP values from Y; returns results in odd-numbered values. Subtracts even-numbered DPFP values from Y from the corresponding DPFP values in X; returns results in even-numbered values.
SSE3 MATHOP REAL64 __m128d _mm_hadd_pd(__m128d __X, __m128d __Y) Horizontal addition across vectors. Returns [[Xd0+Xd1] [Yd0+Yd1]].
SSE3 MATHOP REAL64 __m128d _mm_hsub_pd(__m128d __X, __m128d __Y) Horizontal subtraction across vectors. Returns [[Xd0-Xd1] [Yd0-Yd1]].
SSE41 MATHOP REAL64 __m128d _mm_dp_pd(__m128d __X, __m128d __Y, const int __M) Dot product instructions with mask-defined summing and zeroing parts of result.
SSE2 SET REAL64 __m128d _mm_set_sd(double __F) Create a vector with element 0 as F and the rest zero.
SSE2 SET REAL64 __m128d _mm_set1_pd(double __F) Set all elements of vector to same value
SSE2 SET REAL64 __m128d _mm_set_pd1(double __F) Set all elements of vector to same value
SSE2 SET REAL64 __m128d _mm_set_pd(double __W, double __X) Create a vector with the lower value X and upper value W.
SSE2 SET REAL64 __m128d _mm_setr_pd(double __W, double __X) Create a vector with the lower value W and upper value X.
SSE2 SET REAL64 __m128d _mm_setzero_pd(void ) Create a vector of zeros.
SSE2 SHUFFLE REAL64 __m128d _mm_shuffle_pd(__m128d a, __m128d b, unsigned int imm8) Select double float components. Bit 0 of imm8 selects low double. Bit 1 of imm8 selects high double.
SSE2 SHUFFLE REAL64 __m128d _mm_unpackhi_pd(__m128d __A, __m128d __B) Unpack and interleave high components of inputs.
SSE2 SHUFFLE REAL64 __m128d _mm_unpacklo_pd(__m128d __A, __m128d __B) Unpack and interleave low components of inputs.
SSE3 SHUFFLE REAL64 __m128d _mm_movedup_pd(__m128d __X) Duplicate low element.
SSE41 SHUFFLE REAL64 __m128d _mm_blend_pd(__m128d __X, __m128d __Y, const int __M) Double precision floating point blend instructions - select data from 2 sources using constant/variable mask.
SSE41 SHUFFLE REAL64 __m128d _mm_blendv_pd(__m128d __X, __m128d __Y, __m128d __M) Double precision floating point blend instructions - select data from 2 sources using constant/variable mask.

Alignment

Loading or storing SSE 128-bit values should be done on 128-bit aligned addresses except when using instructions that explicitly provide unaligned access (MOVUPS, MOVUPD, MOVDQU, LOADDQU). The corresponding intrinsics are _mm_storeu_ps, _mm_storeu_pd, _mm_storeu_si128, _mm_loadu_ps, _mm_loadu_pd, _mm_loadu_si128.

When moving to AVX instructions, then see Intel Volume 1, Section 13.3, (AVX) Memory Alignment:

References

2012-04-23

Added a reference to the Intel C++ Intrinsics Reference document.