Showing results for 
Search instead for 
Did you mean: 

Archives Discussions


Reducing VGPRs usage II

I have a kernel that gets only 20% occupancy due to VGPRs usage. Currently, I am using 97 VGPRs as reported by CodeXL.

If I can get this down to 84 registers, then CodeXL reports that I get a bump in occupancy.

Is it realistic to reduce VGPRs from 97 to 84 ? i.e. around 13% reduction.

11 Replies

I have a feeling that  not vgprs is the bottleneck here.

vregs<=128 is not that bad! This way the processor always can choose from 2 threads.

Do you have much memory IO? You can read only 1 dword while doing at least 30 alu instructions.

Do you have enough workitems? (a multiple of 4*no_of_streams would be good)

Kernel size: Is the inner loop of your kernel below 32KB?


Thanks. I do have a lot of memory IO in this kernel: it is an image processing application.

Also, LDS usage is around 2.5 K.

Kernel size is only 9K.

My workgroup size is 64. What do you mean by no_of_streams? 

By the way, I am running this on an HD 7700, so not a lot of graphics horsepower to take advantage of.

Perhaps occupancy will improve with GCN 1.2


Even the GPU ram is at hundreds of GByte/s, it is so slow compared to the Compute Units. They would eat ram at the rate of many TByte/s if they could.

When you do sequential reads on a 7970, the peak is 1 dword read for every 30-40 alu instructions.

On the 7770 things are a bit worse, as you have 1/3 processing power, and 1/4 memory bandwidth.

So in these memory intense tasks try to do as many things on the GPU, if you don't need to calculate that much, try to compress the data, as the decompression math is for free.

Other things are ok:

At <=128regs, 64wfsize, the max LDS is 32KB.

9K code size easily fits in the 32KB cache.

no_of_streams: the total number of streams on your gpu: on 7770 it is 10{cu}*64=640. GCN require at least 4x of that to operate properly especially when you have long lasting coinminer kernels.  For small kernels (I think that's the case now), give it millions of workitems!

>Perhaps occupancy will improve with GCN 1.2

No, as I know, GCN 1.2 introduced the flat memory addressing and a few new instructions, but no significant change in the architecture.

GCN 3 (Or I don't know how to call it poperly), so the new Fury cards doubled the memory bandwidth, by using that cool on-chip HBM memory, but also doubled its processing power, so still try to "1 dword read per 30+ alu instructions" .

Thanks, I really appreciate this information.  Regarding bandwidth, didn't 1.2 introduce delta color compression? Should help with bandwidth.

I can't seem to get much information on how this compression can be used for compute.

I am looking forward to 2016 and the 1 TB/s bandwidth with HBM 2.  With HBM2, global memory bandwidth matches local memory bandwidth.

Do we still need local memory, in this case? I realize that latency will be much lower for local memory.

I like the idea of compressing the data and uncompressing in the kernel.  What sort of compression were you thinking of? It would have to be lossless, of course.


The compression depends on the data ofc. You can send uints, you can acces any bits of them on the gpu, go on...

Don't forget that local memory bandwidth scales with processing power, and also as you mentioned it has a LOT less latency.

When you access 1 byte in the ram, it has a same time that when you acces an aligned 256bytes.

>Do we still need local memory, in this case?

Of course! It is a way to communicate inside a workgroup. Also it has a crossbar, that can be used to directly transfer data across the wavefront.


Thanks. Can you give more detail on "You can send uints...." ? Do you mean using a lossless compression algorithm such as zip to compress the data?

Also, you mentioned LDS bandwidth scaling with processing power:  do you mean that 390X LDS bandwidth > 7700 LDS bandwidth ?

Also, the crossbar you mentioned, is this used when all work items access a single LDS location? i.e. broadcast?


By the way, I have noticed the following phenomenon:  moving a block of code into an inline method can reduce register pressure.  Perhaps this helps the compiler properly scope the variables in the block.


>"moving a block of code into an inline method can reduce register pressure."

Now you are actively practicing Black Magic.


Yes, the compiler should be aggressively inlining methods, but I seen this with my own eyes!!


When your code reaches the gpu, everything is inlined, only loops and ifs are the remaining control elements, no function calls. That's what HD2xxx..HD6xxx cards were capable of. And when HD7xxx came into picture, it became possible to do function calls, but the amd_il compiler still inlined everything.

This high level inlining, unrolling only finetunes a high level optimizer just below OpenCL. I think that's where the big algorithmic optimizations are decided and that's where final VGPRS usage and some other things are originated.

(At least this was true a few years ago, I don't know how it works now with HSA...)


Interesting.  I wish the optimizer was not such a black box:  if only AMD engineers could provide some heuristics about managing registers.

Otherwise, I may need to sell my soul in order to bump occupancy a bit higher.