hi all,
imagine that I have a kernel like:
__kernel void sum1 (__global int* a, __global int* b, __global int* c)
{
int tid = 4*get_global_id(0);
for (int i = 0; i < 4; ++i)
c[tid+i] = a[tid+i] + b[tid+i];
}
and I want to vectirize it. So my question is, will this new kernel
__kernel void sum2 (__global int* a, __global int* b, __global int* c)
{
int tid = get_global_id(0);
vstore(vload4(tid,a) + vload4(tid,b), tid, c);
}
run as fast as this one (with a and b converted to cl_int4 on the host side):
__kernel void sum3 (__global int4* a, __global int4* b, __global int4* c)
{
int tid = get_global_id(0);
c[tid] = a[tid] + b[tid];
}
? I mean, do I need to change my host code to vectorize all my arrays (implying additional copies depending on the data structure for the conversion scalar->vector types) and modify the kernel input types or vload/vstore will be equally optimized in terms of memory read/writes and vectorized computations/registers use?