Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monte-Carlo OpenMP benchmark is extremally slow due to use of not-thread safe rand(). Patch included to speed it up 170x! #6

Open
baryluk opened this issue Dec 13, 2018 · 0 comments

Comments

@baryluk
Copy link

baryluk commented Dec 13, 2018

As in title,

user@debian:~/FinanceBench/Monte-Carlo/OpenMP$ make
g++ -O3 -march=native -fopenmp monteCarloEngine.c -o monteCarloEngine.exe
user@debian:~/FinanceBench/Monte-Carlo/OpenMP$ ./monteCarloEngine.exe
Number of Samples: 400000

Run on CPU using OpenMP
Processing time on CPU using OpenMP: 33599.273438 (ms)
Average Price (CPU computation): 8.096899

Run on CPU
Processing time on CPU: 4020.650879 (ms)
Average Price (CPU computation): 8.085914

Speedup Using OpenMP: 0.119665

user@debian:~/FinanceBench/Monte-Carlo/OpenMP$

gcc 8.2.0, amd64

I have 16 core, and 32 physical threads. AMD ThreadRipper 2950X.

When benchmark runs I noticed two things:

  1. only 800% of CPU used (8 cores/8 threads), instead 3200%.
  2. 85% of each core time is spent in kernel, probably doing futexes or something (strace is indeed showing a lot of live spinning on futex, I guess this is gcc openmp implementation thingy).

Instead of OpenMP being 8 times faster, I am actually getting OpenMP version be about 8-9 times slower than single threaded normal code path! This is horrendous.

If I edit the #pragma omp in monteCarloKernelsCpu.c, to use 32 threads, it indeed starts to use 32 threads (why 8 is hardcoded??!?), it uses 3200% of CPU. However the time spent in kernel grows to 95% on each core!. Speedup: 0.109!!, so even worse.

Solution, do not use rand() and remove all unnecessary omp stuff:

void monteCarloGpuKernelCpuOpenMP(float* const __restrict samplePrices, float* const __restrict sampleWeights, const float* __restrict times, const float dt, const monteCarloOptionStruct* const __restrict optionStructs, const int numSamples)
{
	unsigned int seed = time(NULL);
	#pragma omp parallel
	{
		unsigned int my_id = omp_get_thread_num();
		unsigned int my_seed = seed + my_id;
		#pragma omp for schedule(static, 1000)
		for (size_t numSample = 0; numSample < numSamples; numSample++)
		{
			// Declare and initialize the path.
			float path[SEQUENCE_LENGTH];
			initializePathCpu(path);

			const int optionStructNum = 0;

			getPathCpu(path, numSample, dt, optionStructs[optionStructNum], &my_seed);
			const float price = getPriceCpu(path[SEQUENCE_LENGTH-1]);
		
			samplePrices[numSample] = price;
			sampleWeights[numSample] = DEFAULT_SEQ_WEIGHT;
		}
	}
}

In getPathCpu:

void getPathCpu(float* path, size_t sampleNum, float dt, monteCarloOptionStruct optionStruct, unsigned int* seedp)
{
        path[0] = getProcessValX0Cpu(optionStruct);

        for (size_t i=1; i<SEQUENCE_LENGTH; i++) 
	{
            	float t = i*dt; 
		float randVal = ((float)rand_r(seedp)) / ((float)RAND_MAX);
		float inverseCumRandVal = compInverseNormDistCpu(randVal); 
            	path[i] = processEvolveCpu(t, path[i-1], dt, inverseCumRandVal, optionStruct); 
        }
}

And CPU kernel, adjusted:

void monteCarloGpuKernelCpu(float* samplePrices, float* sampleWeights, float* times, float dt, monteCarloOptionStruct* optionStructs, int numSamples)
{
	unsigned int seed = time(NULL);
	for (size_t numSample = 0; numSample < numSamples; numSample++)
	{
		//declare and initialize the path
		float path[SEQUENCE_LENGTH];
		initializePathCpu(path);

		int optionStructNum = 0;

		getPathCpu(path, numSample, dt, optionStructs[optionStructNum], &seed);
		float price = getPriceCpu(path[SEQUENCE_LENGTH-1]);
	
		samplePrices[numSample] = price;
		sampleWeights[numSample] = DEFAULT_SEQ_WEIGHT;
	}
}

Result?

Processing time on CPU using OpenMP: 188.975006 (ms)
Processing time on CPU: 3519.522949 (ms)

Speedup Using OpenMP: 18.624277

Still correct (actually finally correct) computations.

So, in total my patch makes it 171 times faster than before!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant