2 Replies Latest reply on Dec 17, 2013 4:58 PM by nuke1234

    vgpr allocation tahiti 7990

    nuke1234

      Hello!

       

      i am trying to allocate two unit16[1024] arrays in private memory (128KB). From what i understood is that the SI Tahiti has 256KB vgprs. When i try to compile the kernel for SI Tahiti with CodeXL i get an insufficient resources error. I am able to compile the kernel with two uint16[512] arrays and it uses only 92 vgprs. is there an internal limit of how many vgprs can be used by one kernel?

       

      best regards

      Karl

        • Re: vgpr allocation tahiti 7990
          sunsetquest

          Hi Karl, Some random notes that I hope might help.  (im not a Open CL programmer so maybe someone else can help more)

           

          Each CU 64K vgprs (aka 256KB) and this is shared between 4 sets(at the minimum) of 64 threads so that leaves 256/thread.   Another way to look at it is you have 256KB / 4 bytes_per_reg = 64K vgprs.  64K vgprs/ 4 wavefronts = 16K vgprs per wavefront.  16K vgprs/ 64 threads = 256 regs.  (also 256 vgprs/thread is the hardware the limit) 

           

          With that said, I am note sure how "two uint16[512]" fit in vgprs because that would be 1024 regs( 2X512). Also vgprs cannot always be accessed via index so maybe this array is in shared memory.  (32KB max)

           

          Performance note: using 92 vgprs will probably not result in good performance.  This will only fit 10 wavefronts/CU so it will not be able to hide latencies very well. Its best to have 16-32 wavefronts in most situations. Using anything over 128 vgprs will always hurt performance.  If your code has lots of main memory access or is a large kernel.  Its best to keep it 64 or less if possible.

          • Re: vgpr allocation tahiti 7990
            nuke1234

            thank you for your answer. i use now __global mem (as i need 128KB per Thread) and try to hide latency via parallel execution of wavefronts (don't know if i got the terminology right). Anyhow the opencl compiler does pretty much what he wants, is not predictable and uses way to many vgprs... but i guess i will find my way somehow.