Archives Discussions

abeamud · ‎02-22-2013

Hi all:

I'm using pyopencl 2012.1 with a Radeon HD 6450 with the drivers included in ubuntu 12.04 (fglrx_8.960).

With the attached python script (is a test only), the GPU give me a poor performane (40x slowness) only changing two elements in the vector size...

Time for ASIZE: 29120 [GPU]: 0.296946 s

Time for ASIZE: 29120 [CPU]: 0.354775 s

Time for ASIZE: 29122 [GPU]: 11.4285 s

Time for ASIZE: 29122 [CPU]: 0.429958 s

It's a problem with my graphic card?

Thanks

nou · ‎02-22-2013

problem is your global size. 29120 can be divided by 64 so you get optimal performace. but 29122 factorized is 2*14561 so it can run only with local size 2.

View solution in original post

nou · ‎02-22-2013

problem is your global size. 29120 can be divided by 64 so you get optimal performace. but 29122 factorized is 2*14561 so it can run only with local size 2.

abeamud · ‎02-25-2013

This value (64), is defined by the hardware or by the opencl framework?... How I can get this value?

Thank you for your response.

himanshu_gautam · ‎02-25-2013

Check out

CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE

property using clGetKernelWorkgroupInfo() API

That will help. You need query your kernel object to get that.

nou · ‎02-25-2013

it is defined by HW. most AMD cards need 64. low end AMD cards have 32. nVidia use 32 and from Intel OpenCL programong guide for their accerleator card it seems like it use 16/32 width.

himanshu_gautam · ‎02-22-2013

I think nou answered it right.

Although I am not familiar with pyOpenCL, I believe the following line launches the kernel

exec_evt = prg.test(queue, a.shape, None, a_buf, b_buf, dest_buf)

a.shape == global size == 29120 or 29122

None == local size ==> Find out a suitable local size (Is this correct?)

As nou put it, 29122 is not divisible by 64, 128, 192 or 256.

Also, Since 14561 is a prime number, 2 is the only option available for local size.

+ Enabling Profiling will slow down your operations. Try to use external timers to measure time. You might get better numbers.

german · ‎02-23-2013

On HD5xxx/HD6xxx the global size has to be divisible by 64, 128, 192 or 256 for optimal performance.

HD7xxx series (GCN architecture) supports partial launches. You should have the same performance for 29120 or 29122.

Archives Discussions

Strange behaviour with different vector sizes