I am working on an Acer Inspire with an Intel Core i7-720QM CPU, using OpenCL 1.2, ATI Mobility Radeon HD 5650 with 5 compute units and maximum work-item sizes per dimension of 256.
I've got to make very long array multiplication between arrays of matrix A=10x172032 and B=35x172032, not a standard matrix multiplication but a multiplication of the ten rows of A for every row of B, and so on for every row of B (in reality it's more complex, involving short2 data format, but the problem I'm facing now is the enormous length). I am able to do without problems the multiplication for one row (172032 work items), but due to the limited performance of my card I guess it's quite impossible to to do all the process (A x B) invoking the kernel once, so I would have liked at least to multiply the ten rows of A for a single row of B at the same time.
1) I understood that the buffer objects are accessed as mono-dimensional arrays, but lying down on one dimension the A matrix would require 10x172032 work-items, and that's not feasible at all. Infact just for curiosity I tried but the laptop gets stacked.
2) Then I thought to work on a 2-dimensional space (y=10, x=172032). I've tried working with smaller matrices (10x10) and just trying to access the data on the rows of A at the same time, but every line (on the y-index) accesses the same row at the same time. Maybe this has been completely stupid as a choice, but it's not easy trying to understand how the process works on different dimensions.
3) So I thought that the only way to multiply at the same time every row of A for a single row of B could be using work-groups and setting synchronization commands, exploiting 5 compute units of my card. But this would require to cut the single row of 172032 elements on smaller rows of 256 elements, is it correct?
I would like someone to give me just some indications about the correct path to follow, since my thoughts are fighting each other at the moment.
Thank You in advance
Hi everybody, sorry for the long time passed away but I've been deeply involved in this project cause it was part of my (let's call it) thesis inside an enterprise, so I worked hard to accomplish my duties and then had some other private problems.
At the end I solved the issue just working on a one dimensional work space, modifying the kernel slightly and then moving the program on a better machine equipped with an AMD GPU card with 20 compute units, extremely faster than the previous one. I still have no clear the working flow using memory buffers and two or three dimensional work space, since if memory buffers for definition are accessed as one-dimensional buffers there's no way to move on, except if memory buffers are substituted by image buffers, which can accessed as two or three dimensional buffers, if I understood correctly and if it's possible to use them for numerical computation pruposes.
As working flow I mean, for example, a 2D work space matrix multiplication called with memory buffers, and the data computation of all the elements of the "matrix" done concurrently (or just row by row?). It seems to me, in this case, no sense executing matrix multiplication on a 2D grid with memory buffers accessed as mono-dimensional objects.
Anyway, now I'm gonna face new problems that I'll talk about in an other post to avoid "overloading" this one and to "balance requests" among all the different posts.
Unfortunately I cannot share the code at all for policy rights right now. Hope there will be no more restrictions in the near future, since this limits a lot my possibilities to search for help.
Sorry again if I did not update the status of my post.