Brook+ still better?

Discussion created by afo on Jan 13, 2010
Latest reply on Jan 17, 2010 by ryta1203
Brook+ is 1.3x better than opencl?

Dear people,

I have a simple kernel that does integer matrix x vector product. I need to calculate each matrix by 64k vectors and I need to do that for 1k matrix. matrix is 64 x 2 (int)

When I do it with openCL I have 1.3 times than when i do it with brook+ in the same system with the same code (minor changes to get compiled), so I kindly ask if someone could point me to things that I can do to improve opencl performance.

Things that I did:

- I create the buffers only once in the program and use clEnqueueWriteBuffer to update.

- I have only one command queue so I enqueue the buffer updates, enqueue the kernel execution, enqueue the reading of the results (clEnqueueReadBuffer) and only wait for the event associated with clEnqueueReadBuffer.

- the host pointers are 256 bytes aligned (posix_memalign)

- kernel is full vectorized (I use int4 as data type)

- globalThreads is 16384 and localThreads is 32 and 64 (there was no difference in timing with both values)

system hardware:

Processor Q6600

Motherboard Intel DG41RQ with 4GB 800MHz DDR2 Ram

Video card Ati Radeon HD 4350


System software:

OS OpenSuse 11.0 (no updates, vanilla install)

Driver for opencl: catalyst 9.12 + hotfix

SDK for opencl SDK 2.0 x86_64

Driver for brook+: catalyst 9.11

SDK for brook+: SDK 1.4 beta x86_64

By the way, I took the IL code generated by opencl and put it in the Stream Kernel Analyzer and it says that the opencl code is 10x slower than brook+ IL code, so my question is: It makes sense to do that comparison or the comparison is not valid? Thanks in advance for any advice to improve the opencl performance.

best regards,

Alfonso Lopez