Hi, is there anyone who is running OpenCL on HD4850? I cannot use a local work group size larger than 64, regardless of what my kernel is, whereas the device tells me the maximum size of work group is 1024. What am I missing?
Yes. Local memory fence for tiled matrix multiplication. Would that be why?
Originally posted by: ntrolls Hi, is there anyone who is running OpenCL on HD4850? I cannot use a local work group size larger than 64, regardless of what my kernel is, whereas the device tells me the maximum size of work group is 1024. What am I missing?
With v2.0 group size is limited to 64 on 4xxx cards ( some problems with barrier on RV7xx ). Generaly OpenCL for 4xxx series is more on the lines "it works enough to be advertised, but forget about using it for any resonable computations".
With v2.0 group size is limited to 64 on 4xxx cards ( some problems with barrier on RV7xx ). Generaly OpenCL for 4xxx series is more on the lines "it works enough to be advertised, but forget about using it for any resonable computations".
But then why would CL_DEVICE_MAX_WORK_GROUP_SIZE return 1024...?
Thanks a million - you're the first person who shed a real light on this! I never would have guessed such a thing..
I'm running this on Snow Leopard 10.6.2 with a Java wrapper. I added __attribute__((reqd_work_group_size(16, 16,1))) at the beginning of my kernel code and it still complains that 16x16 is an invalid work group size.
I think I'm almost there... any idea?
Originally posted by: MicahVillmow ntrolls, There is a difference between the largest size that the device can support and the largest that a particular kernel can support.
Yes, I know. But I even tried a kernel that does not do anything (it simply returns) and still could not assign 16x16 local work group size - don't know if this little experiment makes any sense, but there it is for what it's worth.
And no... (256,1,1) still does not work.
Thanks Micha, you've been a great help. So Apple's OpenCL implementation on ATI cards is completely independent? That's interesting...
Well, for now, I will stay clear of the tiled algorithm for matrix multiplication
ntrolls,
Just a note that if you're interested in accessing OpenCL from Java, you may want to take a look at the Aparapi tool at
http://developer.amd.com/aparapi
Aparapi allows you to write your parallel kernel code in Java.