Looking at the generic kernel for multiplying two vectors:
__kernel void mul(__global const float* a, __global const float* b, __global float* c)
gid = get_global_id(0);
c[gid] = a[gid]*b[gid];
Is it possible here to implement coalescing reads/writes? How? If I get the indexing right, within each compute unit the local_id is going in increments one and this for all compute units concurrently?