I am working on an Acer Inspire with an Intel Core i7-720QM CPU, using OpenCL 1.2, ATI Mobility Radeon HD 5650 with 5 compute units and maximum work-item sizes per dimension of 256.
I've got to make very long array multiplication between arrays of matrix A=10x172032 and B=35x172032, not a standard matrix multiplication but a multiplication of the ten rows of A for every row of B, and so on for every row of B (in reality it's more complex, involving short2 data format, but the problem I'm facing now is the enormous length). I am able to do without problems the multiplication for one row (172032 work items), but due to the limited performance of my card I guess it's quite impossible to to do all the process (A x B) invoking the kernel once, so I would have liked at least to multiply the ten rows of A for a single row of B at the same time.
1) I understood that the buffer objects are accessed as mono-dimensional arrays, but lying down on one dimension the A matrix would require 10x172032 work-items, and that's not feasible at all. Infact just for curiosity I tried but the laptop gets stacked.
2) Then I thought to work on a 2-dimensional space (y=10, x=172032). I've tried working with smaller matrices (10x10) and just trying to access the data on the rows of A at the same time, but every line (on the y-index) accesses the same row at the same time. Maybe this has been completely stupid as a choice, but it's not easy trying to understand how the process works on different dimensions.
3) So I thought that the only way to multiply at the same time every row of A for a single row of B could be using work-groups and setting synchronization commands, exploiting 5 compute units of my card. But this would require to cut the single row of 172032 elements on smaller rows of 256 elements, is it correct?
I would like someone to give me just some indications about the correct path to follow, since my thoughts are fighting each other at the moment.
Thank You in advance