cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

abeamud
Journeyman III

Strange behaviour with different vector sizes

Hi all:

I'm using pyopencl 2012.1 with a Radeon HD 6450 with the drivers included in ubuntu 12.04 (fglrx_8.960).

With the attached python script (is a test only), the GPU give me a poor performane (40x slowness) only changing two elements in the vector size...

Time for ASIZE: 29120 [GPU]: 0.296946 s

Time for ASIZE: 29120 [CPU]: 0.354775 s

Time for ASIZE: 29122 [GPU]: 11.4285 s

Time for ASIZE: 29122 [CPU]: 0.429958 s

It's a problem with my graphic card?

Thanks

0 Likes
1 Solution
nou
Exemplar

problem is your global size. 29120 can be divided by 64 so you get optimal performace. but 29122 factorized is 2*14561 so it can run only with local size 2.

View solution in original post

0 Likes
6 Replies
nou
Exemplar

problem is your global size. 29120 can be divided by 64 so you get optimal performace. but 29122 factorized is 2*14561 so it can run only with local size 2.

0 Likes

This value (64), is defined by the hardware or by the opencl framework?... How I can get this value?

Thank you for your response.

0 Likes

 

Check out

CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE

property using clGetKernelWorkgroupInfo() API

That will help. You need query your kernel object to get that.

it is defined by HW. most AMD cards need 64. low end AMD cards have 32. nVidia use 32 and from Intel OpenCL programong guide for their accerleator card it seems like it use 16/32 width.

himanshu_gautam
Grandmaster

I think nou answered it right.

Although I am not familiar with pyOpenCL, I believe the following line launches the kernel

exec_evt = prg.test(queue, a.shape, None, a_buf, b_buf, dest_buf)

a.shape == global size == 29120 or 29122

None == local size ==> Find out a suitable local size (Is this correct?)

As nou put it, 29122 is not divisible by 64, 128, 192 or 256.

Also, Since 14561 is a prime number, 2 is the only option available for local size.

+ Enabling Profiling will slow down your operations. Try to use external timers to measure time. You might get better numbers.

german
Staff

On HD5xxx/HD6xxx the global size has to be divisible by 64, 128, 192 or 256 for optimal performance.

HD7xxx series (GCN architecture) supports partial launches. You should have the same performance for 29120 or 29122.