cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

ntrolls
Journeyman III

Local work group size on HD 4850

Hi, is there anyone who is running OpenCL on HD4850? I cannot use a local work group size larger than 64, regardless of what my kernel is, whereas the device tells me the maximum size of work group is 1024. What am I missing?

0 Likes
12 Replies

Are you using a barrier?
0 Likes

Yes. Local memory fence for tiled matrix multiplication. Would that be why?

0 Likes
hazeman
Adept II

Originally posted by: ntrolls Hi, is there anyone who is running OpenCL on HD4850? I cannot use a local work group size larger than 64, regardless of what my kernel is, whereas the device tells me the maximum size of work group is 1024. What am I missing?

 

With v2.0 group size is limited to 64 on 4xxx cards ( some problems with barrier on RV7xx ). Generaly OpenCL for 4xxx series is more on the lines "it works enough to be advertised, but forget about using it for any resonable computations".

 

0 Likes

With v2.0 group size is limited to 64 on 4xxx cards ( some problems with barrier on RV7xx ). Generaly OpenCL for 4xxx series is more on the lines "it works enough to be advertised, but forget about using it for any resonable computations".


But then why would CL_DEVICE_MAX_WORK_GROUP_SIZE return 1024...?

0 Likes

Yes, the barrier on the 4XXX series is a software barrier which can cause problems in corner cases. If you want to work around it, please use __attribute__((reqd_work_group_size(X, Y, Z))) on your kernel and we will compile for exactly that group size.

0 Likes

Thanks a million - you're the first person who shed a real light on this! I never would have guessed such a thing..

I'm running this on Snow Leopard 10.6.2 with a Java wrapper. I added __attribute__((reqd_work_group_size(16, 16,1))) at the beginning of my kernel code and it still complains that 16x16 is an invalid work group size.

I think I'm almost there... any idea?

0 Likes

Try 256, 1, 1 instead of 16, 16, 1.
0 Likes

ntrolls,
There is a difference between the largest size that the device can support and the largest that a particular kernel can support.
0 Likes

Originally posted by: MicahVillmow ntrolls, There is a difference between the largest size that the device can support and the largest that a particular kernel can support.


Yes, I know. But I even tried a kernel that does not do anything (it simply returns) and still could not assign 16x16 local work group size - don't know if this little experiment makes any sense, but there it is for what it's worth.

And no... (256,1,1) still does not work.

0 Likes

ntrolls,
You should test this on our drivers, it should work. We can't help that much with snow leopard as that is all handled by Apple.
0 Likes

Thanks Micha, you've been a great help. So Apple's OpenCL implementation on ATI cards is completely independent? That's interesting...

Well, for now, I will stay clear of the tiled algorithm for matrix multiplication

0 Likes

ntrolls,

Just a note that if you're interested in accessing OpenCL from Java, you may want to take a look at the Aparapi tool at

http://developer.amd.com/aparapi

Aparapi allows you to write your parallel kernel code in Java.

 

0 Likes