I have a problem with the register usage of one of my kernels again.
Now I have a kernel which needs roughly 2500 Byte register space.
I am running on a Radeon HD6450 where each Compute Unit has 256 KByte of register space available.
The kernel has the following line before the actual __kernel... definition:
__attribute__((reqd_work_group_size(64, 1, 1))).
My aim here is to let the compiler use the maximum number of registers, because I will execute the kernel with only 128 work-items, or 2 wavefronts. And each wavefront should run on one compute unit.
The problem is now, that the kernel uses spilled registers, which it shouldn't as far as I can see.
- 2500 Byte register space per work-item
- 64 work-items per wavefront
- gives: 160'000 Byte register space per wavefront
and with 256 KByte available per compute-unit, there is more than enough, so I don't understand the spill
Can anyone point out where I lie wrong?
I believe there is also a limit of 256 registers per thread. If they are all 128 bit registers that would give you 4K per thread and you should be fine. But if you're using 32bit scalar registers you might get in trouble.
thanks dmeiser for your response. Where can I find that information about the reg-limit per thread?
I am using float2 variables, not scalar, but also not 128bit.
I've been conducting some more tests. The results are as follows.
I've played around with splitting my variables. Some reside in registers, some in local-store on chip.
I have found a configuration in which my kernel uses 118 VGPRs and 31232 Byte LDS and no Spilled registers.
First thing is, that this algorithm runs much slower than the algorithm without LDS-usage (about factor 2 slower). I'm not sure, why this is. Perhaps because I have local-size set to 32, whereas in the non-LDS-code local-size is 64?
Another thing is now, that when I increase the size of one array (in the "LDS + Register"-code-variant) by 1, the kernel spills a massive amount of registers (96) and uses only 17 VGPRS.
I don't understand why the compiler does not use the 118 like before and only spills one register. Also there seems to be some registers missing (96 + 17 = 113) whereas it should be 118 + 1 = 119 in total.
A lot of confusion here
I hope I was able to express myself in a somewhat understanding way.
ok it appears the 256 register limit applies to gcn architecture. Apparently the limit for Cayman is 128. There is some information on the architecture of the Cayman SIMDs in the following article:
Perhaps because I have local-size set to 32, whereas in the non-LDS-code local-size is 64? <-- yes, you are only using half of your device as half of each wavefront is inactive.
I don't think that is exactly like that, because the CU has 16 stream cores, which are processing a work-item each. So to fully use my device, I have to have a work-group size of minimum 16. The rest is just to hide the latencies, isn't it?
So I agree, that I will definitely see some performance degradation when I set the workgroup-size to 32, but I don't think that the performance will then be exactly half the performance of before.
The wavefront size is 64, that is what is important. Since your work-group size is 32, you are only using half a wavefront, so the other half of the wavefront is idle.
No, the entire register file is not available to a work item, only 1/N is available, where N is the wavefront size for the device. Then half of what is left(usually 256 on pre-GCN hardware) is available per work-item.