cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Raistmer
Adept II

RD_SCRATCH and MEM_SCRATCH_WRITE

Does they mean register spilling into memory?

AFAIK there are 256 128bit registers per workitem available.
What limit compiler uses to determine when start register spilling?
And how to prevent this?
0 Likes
19 Replies

Raistmer,
There are 64x256 128bit registers per SIMD. If your workgroup is larger than a single wavefront, then those registers are split between the wavefronts. If the # wavefronts * # registers per work-item > ~240, registers start spilling.
0 Likes

Please, explain situation for particular example:
I use execution domain 32x10. My GPU is HD4870.
It has 10 SIMDs.
Profiler shows that 10 wavefronts executed. That is, 1 wavefront per SIMD.
There are only 320 workitems/threads.
If I use 128 float4 registers as private array and few more (~10 more let say) - why registers got spilled?
0 Likes

If the # wavefronts * # registers per work-item > ~240, registers start spilling.


It should be # wavefronts /# SIMDs * #registers per thread, not?
Why total number of wavefronts used, not number of wavefronts per SIMD ?
0 Likes

Raistmer,
It's #wavefronts per workgroup * # registers per work-item is the total number of registers required to execute the kernel. For example, If you have 256 workitems in your work-group(the default setting), which is 4 wavefronts on the 4870, with ~138 registers per work-item, your chip would need ~550 registers per SIMD to execute without spilling. This is more than twice than what is available, so registers are spilled to memory. The only way to get around this is to redesign your kernel to use less registers or limit the number of work-items per workgroup via a kernel attribute.
0 Likes

Understood, thanks a lot!

BTW, does it mean, that with default settings such kernel will be executed only on 3 SIMDs (first 2 get 4 wavefronts each, third - remained 2 wavefronts) and all 7 others SIMDs will remain idle ?
0 Likes

Was not work goup size limmitted to 64 at 47xx radeons?

0 Likes

Originally posted by: Lev

Was not work goup size limmitted to 64 at 47xx radeons?



We talk about 48xx GPU here.
0 Likes

I am sorry, is 48 based one evergreen?

0 Likes

Lev,
That is correct. I forgot about that detail.
0 Likes

4XXX series cards are 7XX based, 5XXX cards are evergreen based.
0 Likes

That way 47 and 48 seems are equal in this feature.

0 Likes

Then no registers should be spilled, right? If group size already 64 and there are far less than 256 float4 registers requested...
Question remains then.
0 Likes

You cannot index registers, they have names, not index. At least in this sdk release. Memory locations have address. Which could be indexed.

I have heard, that evergreen could index registers in theory. But not 47/48. And this is not implemented in compiler yet anyway.

0 Likes

Lev,
There is a hardware feature called the address register, a[], that allows you to index into registers, however, the only way this is used is if you have an array that is small and the compiler heuristics determines that it can use this approach. 128 x float 4 is quite large amount of data to stick in registers, so I"m assuming that the compiler heuristics do not allow this size of an array to be placed in registers.
0 Likes

Does that index register work on 47/48?

0 Likes

Let's don't mix different threads.
Question about private array was answered in another thread, answer is - "128-size array currently always will be spilled".
But I see signs of register spilling in another kernel, where I use only ~20 registers.
And "array" emulated via named variables like float4 d0,d1,..., d15; instead of float4 d[16];
If group size 64 already, then I don't understand why registers are spilled in this kernel too...
0 Likes

OK, I did experiment:
w/o attribute SKA shows 64 GPR and 172 scratch registers for this kernel and HD4870.
With __attribute__((reqd_work_group_size(64, 1, 1))) it shows 69 GPR (why different? ) and 0 scratch registers !
So, attribute setting helped indeed, very good.
But what is default for my GPU remained mystery
0 Likes

Btw, did you kernle work? It may use 256 threads by default and not work on your card.

0 Likes

My kernel work correctly.
With attribute setted to __attribute__((reqd_work_group_size(32, 1, 1))) it works faster indeed but w/o it it produces correct results too.
0 Likes