Best practices for accumulation?

Discussion created by zkhan on Aug 31, 2010
Latest reply on Sep 1, 2010 by genaganna
How to handle: global_result += kernel_result

The kernel I'm working with takes a 2-d data structure A and some other parameters as input, calculates a result, and then accumulates this value in a single voxel of a large 3-d data structure B at the very end.


// calculations using A




B[outIdx] += result;

where A and B are both global.

My problem is that it seems that the sequential read/write caused by the "+=" is a significant bottleneck in the execution. Replacing that statement with only a read or only a write results in a more than 5x speedup. However, I need to accumulate the result - is there a better way to do this, that doesn't incur the severe penalty of the consecutive read/writes?