With occupancy 3% (I didn't even know this was possible) you are going to be extremely slow, read coalescing or not. SI devices don't have it because they don't need it: given appropriate memory access patterns they naturally produce "packed" writes.
You have probably taken a CPU thread and slapped it in a WI. This is not what the WI is supposed to do, especially for complex problems. Check out VGPR usage, SGPR usage, ScratchRegs and ISA size (find this at the end of the disassembly tab).