I'm doing several statistics/reductions on a labeled image to compute bounding box, area, maximum and minimum values of related pixels in a source image (4 million pixels for both label and source, each). The maximum number of labels in general is close to a 10th of that. Each entry in the label image can occur in any distribution but you will often get large portions of the image concentrating on a few labels.
So far I have 2 implementations:
1 using global datastore and using atomic max/mins/incs. This achieves 30 ms/frame (sucks!)
The other is in lds memory and assumes a small known upper bound and had the workgroup do those atomic reductions in lds. Then the workgroup merges those to the gds via atomic maxs/mins/incs. Higher maximums will start to limit wavefronts and also increase the number of writes. This achieves 2-3ms per frame with a maximum of 512 labels.
So the small upper bound is achievable in some situations but really I'd like to know if there's a better strategy. I'd rather not just go exploring willy nilly more than I already have without checking if there's any suggestions on dealing with this kind of scattered atomic write/inc problem.
One idea I had was splitting things up spatially to distribute the atomic operations over different areas of memory. This didn't help a bit though. I tried varying with odd sizes the stride between elements also seeming to not effect anything.