Hi everybody,

I am working on an Acer Inspire with an Intel Core i7-720QM CPU, using OpenCL 1.2, ATI Mobility Radeon HD 5650 with 5 compute units and maximum work-item sizes per dimension of 256.

I've got to make very long array multiplication between arrays of matrix A=10x172032 and B=35x172032, not a standard matrix multiplication but a multiplication of the ten rows of A for every row of B, and so on for every row of B (in reality it's more complex, involving short2 data format, but the problem I'm facing now is the enormous length). I am able to do without problems the multiplication for one row (172032 work items), but due to the limited performance of my card I guess it's quite impossible to to do all the process (A x B) invoking the kernel once, so I would have liked at least to multiply the ten rows of A for a single row of B at the same time.

1) I understood that the buffer objects are accessed as mono-dimensional arrays, but lying down on one dimension the A matrix would require 10x172032 work-items, and that's not feasible at all. Infact just for curiosity I tried but the laptop gets stacked.

2) Then I thought to work on a 2-dimensional space (y=10, x=172032). I've tried working with smaller matrices (10x10) and just trying to access the data on the rows of A at the same time, but every line (on the y-index) accesses the same row at the same time. Maybe this has been completely stupid as a choice, but it's not easy trying to understand how the process works on different dimensions.

3) So I thought that the only way to multiply at the same time every row of A for a single row of B could be using work-groups and setting synchronization commands, exploiting 5 compute units of my card. But this would require to cut the single row of 172032 elements on smaller rows of 256 elements, is it correct?

I would like someone to give me just some indications about the correct path to follow, since my thoughts are fighting each other at the moment.

Thank You in advance

Marco

Hi Marco,

Could you share your code so I can understand the issue better?

Typically, you should spawn as many work-items as many elements are there in the output matrix.

Regards,

Ravi