    OpenCL kernel has poor scaling when amount of data is reduced by order of magnitude


      I am implementing a very serial algorithm.

      The algorithm is very memory intensive.

      Work item utilization is around 25% and occupancy is around 37.5 %.

      There is no register spilling.


      Each work item i processes N(i) bytes of data. What I am finding is that

      if I reduce N(i) by a factor of 5, the kernel time only

      goes down by around 20%.


      What could be causing this kind of effect?  What is the best way of trouble shooting this situation?