I am implementing a very serial algorithm.
The algorithm is very memory intensive.
Work item utilization is around 25% and occupancy is around 37.5 %.
There is no register spilling.
Each work item i processes N(i) bytes of data. What I am finding is that
if I reduce N(i) by a factor of 5, the kernel time only
goes down by around 20%.
What could be causing this kind of effect? What is the best way of trouble shooting this situation?
I would suggest to check the memory utilization first. As the algorithm is memory intensive, reducing the work load might not produce the desired performance boost unless the memory utilization is also improved by same factor.