Archives Discussions

chrisgregg · ‎11-03-2009

OpenCL runs significantly faster on my 2.66GHz core 2 Duo than on my Radeon 4350 GPU

Hi, all,

I'm beginning a CPU/GPU research project, and I decided to use OpenCL because of its ability to run code on both platforms. I have a 2.66GHz core 2 Duo running Ubuntu 9.04, and I have a Radeon 4350 GPU. I've installed the OpenCL beta driver and the ati stream sdk (v2, beta4) and have started looking at some benchmarks for the sample code.

Interestingly, of the OpenCL samples from the SDK I've tried (MatrixMultiply, MatrixTranspose, MerseneTwister, and a couple of others), the code mostly runs significantly faster on the CPU than on the GPU. The first part of the attached snippet shows the results from multiplying two 2048x2048 matrices together, and the CPU beats the GPU by a factor of 1.16. The difference for MerseneTwister is over 8x better for the CPU.

So, I'm curious: (1) is this an expected result? (2) should I be re-writing the code to better take into account the GPU's architecture?

As the title of the post says, I'm looking for a sanity check here, to make sure I'm not doing something screwy with the mini-benchmarks. Thanks!

-Chris

$ ./MatrixMultiplication --device gpu -x 2048 -y 2048 -z 2048 -t -q MatrixA MatrixB Time(sec) 2048x2048 2048x2048 95.0832 $ ./MatrixMultiplication --device cpu -x 2048 -y 2048 -z 2048 -t -q MatrixA MatrixB Time(sec) 2048x2048 2048x2048 81.8752 $ ./MersenneTwister -q -t --device cpu -x 1000000 Generated Numbers Time(sec) Numbers/sec 2000000 0.506 3.95257e+06 $ ./MersenneTwister -q -t --device gpu -x 1000000 Generated Numbers Time(sec) Numbers/sec 2000000 4.133 483910

n0thing · ‎11-04-2009

Interesting results indeed! But here is what I am getting on Phenom X4 9650 and Radeon 5770 on vista 32bit SP2. Your GPU has only 2 SIMD units compared to 10 of mine though but your dual core processor is much faster than my quad ahem.

The timing calculation includes setup time and kernel time and as you can see kernel time on the gpu is much lower than on cpu. You need to run the kernel for a sufficient amount of time to able to see a substantial speedup. One thing you can do is to run your kernel for a number of iterations as there are size limitations for allocation of a buffer on GPU(128 MB I think).

Note that kernel time also includes the transfers over PCIe bus.

MatrixMultiplication.exe --device cpu -x 2048 -y 2048 -z 2048 -t -q MatrixA MatrixB Time KernelTime 2048x2048 2048x2048 140.612 139.926 MatrixMultiplication.exe --device gpu -x 2048 -y 2048 -z 2048 -t -q MatrixA MatrixB Time KernelTime 2048x2048 2048x2048 3.3639 0.826539 MersenneTwister.exe -q -t --device cpu -x 1000000 Generated Numbers Time kernelTime Numbers/sec 2000000 0.976359 0.1577 2.04843e+006 MersenneTwister.exe -q -t --device gpu -x 1000000 Generated Numbers Time kernelTime Numbers/sec 2000000 1.89666 0.0705628 1.05449e+006

nou · ‎11-04-2009

did you use profiling information?

clEnqueueNDRangeKernel(queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, &event); clWaitForEvents(1, &event); long long profiling_start,profiling_end; clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(long long), &profiling_start, NULL); clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(long long), &profiling_end, NULL);

chrisgregg · ‎11-04-2009

Originally posted by: nou did you use profiling information?

(edit)

Looking at the original stock code, yes, profiling was turned on.

chrisgregg · ‎11-16-2009

I thought I'd post a little update on this. Once I delved into the code a bit more, I found that the default block size was 8. Once I changed this (and once I modified the code so it didn't give me an error that it was set too high), many of the examples run much faster on the gpu than before.

nou · ‎11-16-2009

in another thread was suggested that group size should by equal to wavefront size which is 64 for 48xx and 58xx.

chrisgregg · ‎11-04-2009

Interesting -- thanks for the information. I guess my GPU is performing as well as it can. I'll try the suggestion to run the kernel multiple times to see what happens. I'll also take a look at the KernelTime calculations. Thanks for the tips!

Archives Discussions

Looking for a sanity check, re: performance of OpenCL samples on CPU -vs- GPU