I have a simple kernel that does integer matrix x vector product. I need to calculate each matrix by 64k vectors and I need to do that for 1k matrix. matrix is 64 x 2 (int)
When I do it with openCL I have 1.3 times than when i do it with brook+ in the same system with the same code (minor changes to get compiled), so I kindly ask if someone could point me to things that I can do to improve opencl performance.
Things that I did:
- I create the buffers only once in the program and use clEnqueueWriteBuffer to update.
- I have only one command queue so I enqueue the buffer updates, enqueue the kernel execution, enqueue the reading of the results (clEnqueueReadBuffer) and only wait for the event associated with clEnqueueReadBuffer.
- the host pointers are 256 bytes aligned (posix_memalign)
- kernel is full vectorized (I use int4 as data type)
- globalThreads is 16384 and localThreads is 32 and 64 (there was no difference in timing with both values)
Motherboard Intel DG41RQ with 4GB 800MHz DDR2 Ram
Video card Ati Radeon HD 4350
OS OpenSuse 11.0 (no updates, vanilla install)
Driver for opencl: catalyst 9.12 + hotfix
SDK for opencl SDK 2.0 x86_64
Driver for brook+: catalyst 9.11
SDK for brook+: SDK 1.4 beta x86_64
By the way, I took the IL code generated by opencl and put it in the Stream Kernel Analyzer and it says that the opencl code is 10x slower than brook+ IL code, so my question is: It makes sense to do that comparison or the comparison is not valid? Thanks in advance for any advice to improve the opencl performance.