cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

nuke1234
Journeyman III

vgpr allocation tahiti 7990

Hello!

i am trying to allocate two unit16[1024] arrays in private memory (128KB). From what i understood is that the SI Tahiti has 256KB vgprs. When i try to compile the kernel for SI Tahiti with CodeXL i get an insufficient resources error. I am able to compile the kernel with two uint16[512] arrays and it uses only 92 vgprs. is there an internal limit of how many vgprs can be used by one kernel?

best regards

Karl

0 Likes
2 Replies
sunsetquest
Adept II

Hi Karl, Some random notes that I hope might help.  (im not a Open CL programmer so maybe someone else can help more)

Each CU 64K vgprs (aka 256KB) and this is shared between 4 sets(at the minimum) of 64 threads so that leaves 256/thread.   Another way to look at it is you have 256KB / 4 bytes_per_reg = 64K vgprs.  64K vgprs/ 4 wavefronts = 16K vgprs per wavefront.  16K vgprs/ 64 threads = 256 regs.  (also 256 vgprs/thread is the hardware the limit) 

With that said, I am note sure how "two uint16[512]" fit in vgprs because that would be 1024 regs( 2X512). Also vgprs cannot always be accessed via index so maybe this array is in shared memory.  (32KB max)

Performance note: using 92 vgprs will probably not result in good performance.  This will only fit 10 wavefronts/CU so it will not be able to hide latencies very well. Its best to have 16-32 wavefronts in most situations. Using anything over 128 vgprs will always hurt performance.  If your code has lots of main memory access or is a large kernel.  Its best to keep it 64 or less if possible.

0 Likes
nuke1234
Journeyman III

thank you for your answer. i use now __global mem (as i need 128KB per Thread) and try to hide latency via parallel execution of wavefronts (don't know if i got the terminology right). Anyhow the opencl compiler does pretty much what he wants, is not predictable and uses way to many vgprs... but i guess i will find my way somehow.

0 Likes