Kernel runs slower for local workgroup size greater than 64

Discussion created by gallickgunner on Jan 10, 2019
Latest reply on Jan 17, 2019 by dipak

Hi bros, I'm a CS undergraduate student and I recently wrote a GPU path tracer using OpenCL. If you don't know what path tracing it's basically a method to generate photorealistic images by shooting rays through every pixel and applying light transport algorithms.


So the main reason I opened up this discussion is I noticed something strange. From what I gathered over the internet increasing the local workgroup size i.e. the number of work-items in a workgroup usually increases performance, more-so if they are a power of two and if the total work-items within a workgroup is a multiple of the wave-front size. I know the hardware groups up work-items into group of 64 threads called a wavefront.


Before I talk about the behviour i noticed in my path tracer I'd like to know some basic architectural things.

  1. Can there be more than 1 workgroup active on 1 CU at any given instance?
  2. The GCN white paper states that all the 4 SIMD lanes can have different wavefront active at any instance. Further it states that all 4 SIMDs can execute 1 operation simultaneously however later it states that out of the 7 different types of instructions each SIMD can execute a unique one. To be more specific here is the quote,

    The CU front-end can decode and issue seven different types of instructions: branches, scalar ALU or memory, vector ALU, vector memory, local data share,

    global data share or export, and special instructions. Only issue one instruction of each type can be issued at a time per SIMD, to avoid oversubscribing

    the execution pipelines. To preserve in-order execution, each instruction must also come from a different wavefront

    This means that in one clock cycle only 1 SIMD is allowed to do a memory read/write operation. Am I correct?

  3. If the 16 work items in a SIMD or more accurately in a quarter wave-front all access the same memory location, does the read gets coalesced or the memory is accessed 16 times serially?


My path tracer gives me the highest FPS when i set local workgroup size to 64. Increasing it further reduces FPS. I want to really understand why this is the case. I have 2 arrays of spheres and planes in the constant memory. Every work-item within the workgroup needs to perform an intersection test with every sphere 1 by 1. This means Each work-item will be accessing the same index, lets say,



before trying to check intersection with the next one.  The only difference between the 2 cases are for local workgroup size of 64 we would have 1 wave front per workgroup, where as for 256 we would have 4 wavefronts per workgroup. Why does the setting with 4 wavefronts run slower?