6 Replies Latest reply on Jan 17, 2010 2:16 PM by ryta1203

    Brook+ still better?

    afo
      Brook+ is 1.3x better than opencl?

      Dear people,

      I have a simple kernel that does integer matrix x vector product. I need to calculate each matrix by 64k vectors and I need to do that for 1k matrix. matrix is 64 x 2 (int)

      When I do it with openCL I have 1.3 times than when i do it with brook+ in the same system with the same code (minor changes to get compiled), so I kindly ask if someone could point me to things that I can do to improve opencl performance.

      Things that I did:

      - I create the buffers only once in the program and use clEnqueueWriteBuffer to update.

      - I have only one command queue so I enqueue the buffer updates, enqueue the kernel execution, enqueue the reading of the results (clEnqueueReadBuffer) and only wait for the event associated with clEnqueueReadBuffer.

      - the host pointers are 256 bytes aligned (posix_memalign)

      - kernel is full vectorized (I use int4 as data type)

      - globalThreads is 16384 and localThreads is 32 and 64 (there was no difference in timing with both values)

      system hardware:

      Processor Q6600

      Motherboard Intel DG41RQ with 4GB 800MHz DDR2 Ram

      Video card Ati Radeon HD 4350

       

      System software:

      OS OpenSuse 11.0 (no updates, vanilla install)

      Driver for opencl: catalyst 9.12 + hotfix

      SDK for opencl SDK 2.0 x86_64

      Driver for brook+: catalyst 9.11

      SDK for brook+: SDK 1.4 beta x86_64

      By the way, I took the IL code generated by opencl and put it in the Stream Kernel Analyzer and it says that the opencl code is 10x slower than brook+ IL code, so my question is: It makes sense to do that comparison or the comparison is not valid? Thanks in advance for any advice to improve the opencl performance.

      best regards,

      Alfonso Lopez