I am doing double precision floating point computations on a 280X. According to the AMD Programming
Guide, the SI chips do not do 64 bit read coalescing and I am getting very low vector and scalar unit
occupancy - between 3-4% according to CodeXL also indicating lots of waiting for memory. Is it at all possible
to alleviate this problem?
With occupancy 3% (I didn't even know this was possible) you are going to be extremely slow, read coalescing or not. SI devices don't have it because they don't need it: given appropriate memory access patterns they naturally produce "packed" writes.
You have probably taken a CPU thread and slapped it in a WI. This is not what the WI is supposed to do, especially for complex problems. Check out VGPR usage, SGPR usage, ScratchRegs and ISA size (find this at the end of the disassembly tab).
I wrote "low vector and scalar unit occupancy" to refer to VALUBusy and SALUBusy
which are low - and not kernel occupancy which is ~30%.
I have inserted numerical values in the array index calculations, and there was a marked