fft_simd

A simple demo shows how to use the SIMD,Single Instruction Multiple Data, to optimize and accelerate the FFT algorithm.

Task 1. Implement standard DFT algorithm with C++. See task1_solution
Task 2. Rewrite the DFT with Cooley–Tukey FFT algorithm. See task2_solution
Task 3. Ues SIMD to accelerate the FFT alogrithm. See blews in this page.
Task 4. Make a time consuming contrast among these algorithms.See the table as blews:

Algorithm Time consum

Standard dft 606 ms

FFT 74 ms (out-off-place algorithm) ,7ms (in-place)

FFT with SIMD 1.011 ms

Introduction to SIMD

Referecnr the wiki of SIMD.

SIMD instructions chosen

We use SSE2 in this project. SSE2 (Streaming SIMD Extensions 2) is one of the Intel SIMD, it supports two double-precision-floate operations as its MMX rigister are extended from 64 to 128 bits.

SIMD instructions exercise

To be familiar with SSE2 instructions, we use SSE2 to compute the complex multiplication as an exercise.

Complex multiplication can be described as follows:

$$ if A = a+bj,B=c+dj,then,A*B=(ac-bd)+(bc+ad)*j $$

We use a figure to summary and described the complex multiplication using SSE2:

You can find the code in the complex_mul(double * A, double * B) function in the project.

Using SIMD to accelerate the FFT

The FFT algorithm was introduced in task2_solution, reviewing the codes, we can find that the combing part consum most time. So we can use SIMD to rewrite this part. Blew is the orginal code:

			...
            // combine the odd and even parts
			complex<double> e = X[k];			// even
			complex<double> o = X[k + N / 2];   // odd
												// w is the "twiddle-factor"
			complex<double> w = exp(complex<double>(0, -2.*PI*k / N));
			X[k] = e + w * o;
			X[k + N / 2] = e - w * o;
			...

Well, parameters e,o and w are complex type, and our goal becomes using SIMD to process a complex addition, a complex subtraction and a complex multiplication.

First, using a struct type ,comp,to replace the C++ built-in complex type.

struct comp {
	double real =0;
	double imag = 0;
};

The comp struct pack two double-type value, real part and imag part. So we can use a comp-pointer to access a complex number. In fact, a comp takes up 8 bytes,128 bit, and it can be loaded into a SSE2's MMX rigister.

Rewrite the original codes with SSE2 instructions.

			__m128d o = _mm_load_pd( (double *)&X[k + N/2 ] );   // odd
			double cc = cos(-2.*PI*k / N);
			double ss = sin(-2.*PI*k / N);
			__m128d wr =  _mm_set_pd( cc,cc );		// cc 
			__m128d wi =  _mm_set_pd( ss, ss );		// dd 
			// compute the w*o
			wr = _mm_mul_pd(o,wr);					// ac|bc
			__m128d n1 = _mm_shuffle_pd(o,o,_MM_SHUFFLE2(0,1) ); // invert
			wi = _mm_mul_pd(n1, wi);				// bd|ad
			n1 = _mm_sub_pd(wr, wi);				// ac-bd|x
			wr = _mm_add_pd(wr, wi);				// x|bc+ad
			n1 = _mm_shuffle_pd(n1,wr,_MM_SHUFFLE2(1,0));// select ac-bd|bc+ad

			o = _mm_load_pd((double *)&X[k]);		// load even part
			wr = _mm_add_pd(o, n1);					// compute even part, X_e + w * X_o;
			wi = _mm_sub_pd(o, n1);					// compute odd part,  X_e - w * X_o;
			_mm_store_pd((double *)&X[k],wr);
			_mm_store_pd((double *)&X[k + N/2],wi);

We use 4 MMX rigisters, and cost 4 load/store , 2 set ,6 ops( +-*/) , 2 reorder operations. It's obvious that the time consuming of 1024-point FFT was decreased from 7ms to 1.01ms !

Some useful link:.

FFT:

https://en.wikipedia.org/wiki/Fast_Fourier_transform Introduction to FFT

https://en.wikipedia.org/wiki/Cooley%E2%80%93Tukey_FFT_algorithm Cooley–Tukey FFT algorithm

SIMD:

SIMD and sse2 introduction:

https://www.intel.com/content/www/us/en/support/articles/000005779/processors.html

Intel(R) C++ Compiler Intrinsics Reference - Intel® Software:

https://software.intel.com/sites/default/files/ee/11/6373

Streaming SIMD Extensions 2 Instructions:

https://msdn.microsoft.com/zh-cn/library/kcwz153a%28v=vs.110%29.aspx?f=255&MSPPError=-2147217396

Intrinsics Guide:

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#cats=Load&expand=3296

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
SIMD		SIMD
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
SIMD.sln		SIMD.sln
complex_mul.png		complex_mul.png
task1_solution.md		task1_solution.md
task2_solution.md		task2_solution.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIMD

SIMD

.gitattributes

.gitattributes

.gitignore

.gitignore

README.md

README.md

SIMD.sln

SIMD.sln

complex_mul.png

complex_mul.png

task1_solution.md

task1_solution.md

task2_solution.md

task2_solution.md

Repository files navigation

fft_simd

Introduction to SIMD

SIMD instructions chosen

SIMD instructions exercise

Using SIMD to accelerate the FFT

Some useful link:.

About

Releases

Packages

Languages

Algorithm	Time consum
Standard dft	606 ms
FFT	74 ms (out-off-place algorithm) ,7ms (in-place)
FFT with SIMD	1.011 ms

jagger2048/fft_simd

Folders and files

Latest commit

History

Repository files navigation

fft_simd

Introduction to SIMD

SIMD instructions chosen

SIMD instructions exercise

Using SIMD to accelerate the FFT

Some useful link:.

About

Topics

Resources

Stars

Watchers

Forks

Languages