1 of 1 people found this helpful
With occupancy 3% (I didn't even know this was possible) you are going to be extremely slow, read coalescing or not. SI devices don't have it because they don't need it: given appropriate memory access patterns they naturally produce "packed" writes.
You have probably taken a CPU thread and slapped it in a WI. This is not what the WI is supposed to do, especially for complex problems. Check out VGPR usage, SGPR usage, ScratchRegs and ISA size (find this at the end of the disassembly tab).
I wrote "low vector and scalar unit occupancy" to refer to VALUBusy and SALUBusy
which are low - and not kernel occupancy which is ~30%.
I have inserted numerical values in the array index calculations, and there was a marked