Hi,
I tried matrix addition (h=1024 , w=1024). The global work size: {h/4, w/4}. I use 2 dimensional NDRange.
The kernel code:
__kernel void add(__global *float4 c, __global float4 a, __global float4 b, int h, int w)
{
int i = get_global_id(0);
int j = get_global_id(1);
w = w/4;
c[i*w+j] = a[i*w+j] + b[i*w+j];
}
The problem is, the result is not correct. It did not add all the elements in the matrix. Can someone point out what's wrong with the kernel code?.
I think that the problem lies in the size of workgroup - shouldn't it be {h, w/4} instead? Your alogrithm and size calculates only top quater.
EDIT: And another thing - the expression c[i*w+j] = a[i*w+j] + b[i*w+j] should have i and j variables swapped: c[j*w+i] = a[j*w+i] + b[j*w+i] (If I expect that i is horizontal and j vertical coordinate.)
try first write simple single float version. and you don't need pass h,w. use get_global_size()
I guess that if you use row major ordering the algorithm is correct, but if you're using float4 data type you have to use float4 indexing. Maybe this will work?
__kernel void add(__global *float4 c, __global float4 a, __global float4 b)
{
int i = get_global_id(0);
int j = get_global_id(1);
w = get_global_size(0);
c[i*w+j].x = a[i*w+j].x + b[i*w+j].x;
c[i*w+j].y = a[i*w+j].y + b[i*w+j].y;
c[i*w+j].z = a[i*w+j].z + b[i*w+j].z;
c[i*w+j].w = a[i*w+j].w + b[i*w+j].w;
}