Dear people,
I have a simple kernel that does integer matrix x vector product. I need to calculate each matrix by 64k vectors and I need to do that for 1k matrix. matrix is 64 x 2 (int)
When I do it with openCL I have 1.3 times than when i do it with brook+ in the same system with the same code (minor changes to get compiled), so I kindly ask if someone could point me to things that I can do to improve opencl performance.
Things that I did:
- I create the buffers only once in the program and use clEnqueueWriteBuffer to update.
- I have only one command queue so I enqueue the buffer updates, enqueue the kernel execution, enqueue the reading of the results (clEnqueueReadBuffer) and only wait for the event associated with clEnqueueReadBuffer.
- the host pointers are 256 bytes aligned (posix_memalign)
- kernel is full vectorized (I use int4 as data type)
- globalThreads is 16384 and localThreads is 32 and 64 (there was no difference in timing with both values)
system hardware:
Processor Q6600
Motherboard Intel DG41RQ with 4GB 800MHz DDR2 Ram
Video card Ati Radeon HD 4350
System software:
OS OpenSuse 11.0 (no updates, vanilla install)
Driver for opencl: catalyst 9.12 + hotfix
SDK for opencl SDK 2.0 x86_64
Driver for brook+: catalyst 9.11
SDK for brook+: SDK 1.4 beta x86_64
By the way, I took the IL code generated by opencl and put it in the Stream Kernel Analyzer and it says that the opencl code is 10x slower than brook+ IL code, so my question is: It makes sense to do that comparison or the comparison is not valid? Thanks in advance for any advice to improve the opencl performance.
best regards,
Alfonso Lopez
well if you use local memory from OpenCL then it is slower on 4xxx card.
Thanks for pointing that; yes, I know it, I have read that local memory is emulated in global memory for 4xxx cards, so I am not using local memory
link: http://aphnetworks.com/news/2009/12/24/amd-ati-radeon-hd-4000-will-have-limited-opencl-performance
By the way, using the profiler in visual studio 2008 for that kernel, it says that kernel time is 2.1 miliseconds(ALUPacking 81.6%), so I should have aproximately 2 seconds of kernel computing for 1k matrix, but the entire process spent 13.5 seconds to run, so I suspect a really bad memory transfer speed.
best regards,
Alfonso
did you retrive profiling information from event? because 10 second could spent on compilation of kernels.
There was a post from Micah sometime ago, when running OpenCL kernels HD4xxx will not cache data reads so Brook+ should be faster on memory intensive kernels.
BTW, what about DirectCompute? There is something special to do to allow caching?
Thanks nou, I will check compiling time, maybe this is the bottleneck.
Thanks Eduardo, yes, this is a memory intensive kernel, so I will try to upgrade to a 5xxx card to check.
best regards,
Alfonso
Originally posted by: eduardoschardong There was a post from Micah sometime ago, when running OpenCL kernels HD4xxx will not cache data reads so Brook+ should be faster on memory intensive kernels.
BTW, what about DirectCompute? There is something special to do to allow caching?
So is this to imply that global memory is cached on the 5870??? Where did you find this information?