I have a workgroup of size N, and local array "buffer" of size N.
For each work item in the group, with local id k, I want to calculate
the sum S of all array items with index less than k.
for (int i = 0; i < k; ++i)
S += buffer[i];
Currently, I calculate this naively, as above. Is there a more efficient way of
doing this, where intermediate sums are stored back into buffer, for example?
If you sum every pair at the start in radix style, you can reuse them later. This way you'll need half amount of adds at every thread.
But in reality the required amount of inter-thread communication for this would be so slow through local memory.