I am implementing a very serial algorithm.
The algorithm is very memory intensive.
Work item utilization is around 25% and occupancy is around 37.5 %.
There is no register spilling.
Each work item i processes N(i) bytes of data. What I am finding is that
if I reduce N(i) by a factor of 5, the kernel time only
goes down by around 20%.
What could be causing this kind of effect? What is the best way of trouble shooting this situation?