Could you share your code so I can understand the issue better?
Typically, you should spawn as many work-items as many elements are there in the output matrix.
We did not see an update. Were you able to figure this out?
Hi everybody, sorry for the long time passed away but I've been deeply involved in this project cause it was part of my (let's call it) thesis inside an enterprise, so I worked hard to accomplish my duties and then had some other private problems.
At the end I solved the issue just working on a one dimensional work space, modifying the kernel slightly and then moving the program on a better machine equipped with an AMD GPU card with 20 compute units, extremely faster than the previous one. I still have no clear the working flow using memory buffers and two or three dimensional work space, since if memory buffers for definition are accessed as one-dimensional buffers there's no way to move on, except if memory buffers are substituted by image buffers, which can accessed as two or three dimensional buffers, if I understood correctly and if it's possible to use them for numerical computation pruposes.
As working flow I mean, for example, a 2D work space matrix multiplication called with memory buffers, and the data computation of all the elements of the "matrix" done concurrently (or just row by row?). It seems to me, in this case, no sense executing matrix multiplication on a 2D grid with memory buffers accessed as mono-dimensional objects.
Anyway, now I'm gonna face new problems that I'll talk about in an other post to avoid "overloading" this one and to "balance requests" among all the different posts.
Unfortunately I cannot share the code at all for policy rights right now. Hope there will be no more restrictions in the near future, since this limits a lot my possibilities to search for help.
Sorry again if I did not update the status of my post.