What is the advantage of vload4 over 4 single memory accesses?
Suppose I am loading memory from local memory. Below are two kernels. The second kernel should exhibit no bank conflict.
Does the first have bank conflicts? Because, if one vload is executed per clock, then there should be conflicts in a half wave.
void kernel1() {
int start = get_global_id(0)*4;
int4 test = vload4(start,localBuffer);
}
void kernel2() {
int4 test;
int start = get_global_id(0)*4;
test.x = localBuffer[start];
test.y = localBuffer[start+1];
test.z = localBuffer[start+2];
test.w = localBuffer[start+3];
}