cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

KNeumann
Adept II

register usage in kernel again...

Hi,

I have a problem with the register usage of one of my kernels again.

Now I have a kernel which needs roughly 2500 Byte register space.

I am running on a Radeon HD6450 where each Compute Unit has 256 KByte of register space available.

The kernel has the following line before the actual __kernel... definition:

__attribute__((reqd_work_group_size(64, 1, 1))).

My aim here is to let the compiler use the maximum number of registers, because I will execute the kernel with only 128 work-items, or 2 wavefronts. And each wavefront should run on one compute unit.

The problem is now, that the kernel uses spilled registers, which it shouldn't as far as I can see.

Because:

- 2500 Byte register space per work-item

- 64 work-items per wavefront

- gives: 160'000 Byte register space per wavefront

and with 256 KByte available per compute-unit, there is more than enough, so I don't understand the spill

Can anyone point out where I lie wrong?

Thanks

0 Likes
14 Replies
dmeiser
Elite

I believe there is also a limit of 256 registers per thread. If they are all 128 bit registers that would give you 4K per thread and you should be fine.  But if you're using 32bit scalar registers you might get in trouble.

0 Likes

thanks dmeiser for your response. Where can I find that information about the reg-limit per thread?

I am using float2 variables, not scalar, but also not 128bit.

I've been conducting some more tests. The results are as follows.

I've played around with splitting my variables. Some reside in registers, some in local-store on chip.

I have found a configuration in which my kernel uses 118 VGPRs and 31232 Byte LDS and no Spilled registers.

First thing is, that this algorithm runs much slower than the algorithm without LDS-usage (about factor 2 slower). I'm not sure, why this is. Perhaps because I have local-size set to 32, whereas in the non-LDS-code local-size is 64?

Another thing is now, that when I increase the size of one array (in the "LDS + Register"-code-variant) by 1, the kernel spills a massive amount of registers (96) and uses only 17 VGPRS.

I don't understand why the compiler does not use the 118 like before and only spills one register. Also there seems to be some registers missing (96 + 17 = 113) whereas it should be 118 + 1 = 119 in total.

A lot of confusion here

I hope I was able to express myself in a somewhat understanding way.

0 Likes

ok it appears the 256 register limit applies to gcn architecture.  Apparently the limit for Cayman is 128.  There is some information on the architecture of the Cayman SIMDs in the following article:

http://www.realworldtech.com/page.cfm?ArticleID=RWT121410213827&p=5

Cheers

0 Likes

Perhaps because I have local-size set to 32, whereas in the non-LDS-code local-size is 64? <-- yes, you are only using half of your device as half of each wavefront is inactive.

0 Likes

I don't think that is exactly like that, because the CU has 16 stream cores, which are processing a work-item each. So to fully use my device, I have to have a work-group size of minimum 16. The rest is just to hide the latencies, isn't it?

So I agree, that I will definitely see some performance degradation when I set the workgroup-size to 32, but I don't think that the performance will then be exactly half the performance of before.

0 Likes

The wavefront size is 64, that is what is important. Since your work-group size is 32, you are only using half a wavefront, so the other half of the wavefront is idle.

0 Likes

This is architecture dependent. Usually the limit per thread is 1/2 of the register file available to a work-item.

0 Likes

Sorry, I think I misunderstood.

What's the register limit per work-item then?

0 Likes

No, the entire register file is not available to a work item, only 1/N is available, where N is the wavefront size for the device. Then half of what is left(usually 256 on pre-GCN hardware) is available per work-item.

0 Likes

Ok I see.

And what about the other half?

0 Likes

They are used to launch a second wavefront on the device in parallel to the first device to hide latency. The latency on pre-GCN hardware was 8 cycles, and each wavefront takes 4 cycles to execute all 64 work-items.

Thank you so much for that clarification.

It makes more sense to me now

0 Likes

MicahVillmow wrote:

They are used to launch a second wavefront on the device in parallel to the first device to hide latency. The latency on pre-GCN hardware was 8 cycles, and each wavefront takes 4 cycles to execute all 64 work-items.

Has this changed in GCN then? In the APP OpenCL Programming Guide this information is still present (bottom of page 4-45) without any mentioning of changes on GCN hardware.

0 Likes

We are working on the documentation update for GCN.