cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

iya
Journeyman III

Using all 256 registers?

Hello,

I've written a generator for .il code, and would like to use as many registers as possible. My hardware is a 4850.

As I understand limiting the groupsize to the wavefront size of 64 should be the only requirement, but neither in OpenCL nor in IL was I ever successful of getting the compiler to allocate more than 122 GPRs. A groupsize of 256 can get upto 63.

Am I forgetting something or is it a current compiler limitation?

NumWavefrontPerSIMD = 2 seems to be the problem. Is there a way to limit this to 1?

; ----------------- CS Data ------------------------ ; Input Semantic Mappings ; No input mappings GprPoolSize = 0 CodeLen = 11808;Bytes PGM_END_CF = 0; words(64 bit) PGM_END_ALU = 0; words(64 bit) PGM_END_FETCH = 0; words(64 bit) MaxScratchRegsNeeded = 3 ;AluPacking = 0.0 ;AluClauses = 0 ;PowerThrottleRate = 0.0 ; texResourceUsage[0] = 0x00000000 ; texResourceUsage[1] = 0x00000000 ; texResourceUsage[2] = 0x00000000 ; texResourceUsage[3] = 0x00000000 ; fetch4ResourceUsage[0] = 0x00000000 ; fetch4ResourceUsage[1] = 0x00000000 ; fetch4ResourceUsage[2] = 0x00000000 ; fetch4ResourceUsage[3] = 0x00000000 ; texSamplerUsage = 0x00000000 ; constBufUsage = 0x00000000 ResourcesAffectAlphaOutput[0] = 0x00000000 ResourcesAffectAlphaOutput[1] = 0x00000000 ResourcesAffectAlphaOutput[2] = 0x00000000 ResourcesAffectAlphaOutput[3] = 0x00000000 ;SQ_PGM_RESOURCES = 0x3000027A SQ_PGM_RESOURCES:NUM_GPRS = 122 SQ_PGM_RESOURCES:STACK_SIZE = 2 SQ_PGM_RESOURCES:FETCH_CACHE_LINES = 0 SQ_PGM_RESOURCES:PRIME_CACHE_ENABLE = 1 ; CS Setup Mode = Fast (i.e setup R0.x) ; NumThreadPerGroup = 64 ; NumWavefrontPerSIMD = 2 ; IsMaxNumWavePerSIMD = true ; SetBufferForNumGroup = false

0 Likes
2 Replies
the729
Journeyman III

Hi, iya

AFAIK, although each thread processor has 256 registers, the maximum number of private GPRs that can be used in a thread is 123. This is due to the ISA of the hardware uses only 7-bit for GPR addressing. And (according to the document) at least 4 GPRs are used as cluster temperory registers.

Therefore, if not limited by the LDS usage, you will get NumWaveFrontPerSIMD > 1.

0 Likes

In order to get full utilization of the GPU, two wavefronts need to execute in parallel. The compiler thus is limited to allocating half of the registers available for a single wavefront so that at least two wavefronts can always be executed.
0 Likes