The kernel I'm working with takes a 2-d data structure A and some other parameters as input, calculates a result, and then accumulates this value in a single voxel of a large 3-d data structure B at the very end.
// calculations using A
B[outIdx] += result;
where A and B are both global.
My problem is that it seems that the sequential read/write caused by the "+=" is a significant bottleneck in the execution. Replacing that statement with only a read or only a write results in a more than 5x speedup. However, I need to accumulate the result - is there a better way to do this, that doesn't incur the severe penalty of the consecutive read/writes?