Archives Discussions

ntrolls · ‎12-29-2009

Hi, is there anyone who is running OpenCL on HD4850? I cannot use a local work group size larger than 64, regardless of what my kernel is, whereas the device tells me the maximum size of work group is 1024. What am I missing?

MicahVillmow · ‎12-29-2009

Are you using a barrier?

ntrolls · ‎12-29-2009

Yes. Local memory fence for tiled matrix multiplication. Would that be why?

hazeman · ‎12-29-2009

Originally posted by: ntrolls Hi, is there anyone who is running OpenCL on HD4850? I cannot use a local work group size larger than 64, regardless of what my kernel is, whereas the device tells me the maximum size of work group is 1024. What am I missing?

With v2.0 group size is limited to 64 on 4xxx cards ( some problems with barrier on RV7xx ). Generaly OpenCL for 4xxx series is more on the lines "it works enough to be advertised, but forget about using it for any resonable computations".

ntrolls · ‎12-29-2009

With v2.0 group size is limited to 64 on 4xxx cards ( some problems with barrier on RV7xx ). Generaly OpenCL for 4xxx series is more on the lines "it works enough to be advertised, but forget about using it for any resonable computations".

But then why would CL_DEVICE_MAX_WORK_GROUP_SIZE return 1024...?

MicahVillmow · ‎12-29-2009

Yes, the barrier on the 4XXX series is a software barrier which can cause problems in corner cases. If you want to work around it, please use __attribute__((reqd_work_group_size(X, Y, Z))) on your kernel and we will compile for exactly that group size.

ntrolls · ‎12-29-2009

Thanks a million - you're the first person who shed a real light on this! I never would have guessed such a thing..

I'm running this on Snow Leopard 10.6.2 with a Java wrapper. I added __attribute__((reqd_work_group_size(16, 16,1))) at the beginning of my kernel code and it still complains that 16x16 is an invalid work group size.

I think I'm almost there... any idea?

MicahVillmow · ‎12-29-2009

Try 256, 1, 1 instead of 16, 16, 1.

MicahVillmow · ‎12-29-2009

ntrolls,
There is a difference between the largest size that the device can support and the largest that a particular kernel can support.

ntrolls · ‎12-29-2009

Originally posted by: MicahVillmow ntrolls, There is a difference between the largest size that the device can support and the largest that a particular kernel can support.

Yes, I know. But I even tried a kernel that does not do anything (it simply returns) and still could not assign 16x16 local work group size - don't know if this little experiment makes any sense, but there it is for what it's worth.

And no... (256,1,1) still does not work.

MicahVillmow · ‎12-29-2009

ntrolls,
You should test this on our drivers, it should work. We can't help that much with snow leopard as that is all handled by Apple.

ntrolls · ‎12-29-2009

Thanks Micha, you've been a great help. So Apple's OpenCL implementation on ATI cards is completely independent? That's interesting...

Well, for now, I will stay clear of the tiled algorithm for matrix multiplication

tdeneau · ‎10-13-2010

ntrolls,

Just a note that if you're interested in accessing OpenCL from Java, you may want to take a look at the Aparapi tool at

http://developer.amd.com/aparapi

Aparapi allows you to write your parallel kernel code in Java.

Archives Discussions

Local work group size on HD 4850