cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

rwelsch
Adept I

Optimizing code for GCN

Hey,

I have some basic questions concerning GCN and optimizing my code for this architecture.

As far as I understand there is a limit of 10 wave-fronts per SIMD unit and a Compute Unit is composed of 4 SIMD units; so I can run 40 wave-fronts on one compute unit in parallel, right?

And there is also a limit of 16 work-groups per Compute Unit, if the work-groups size is greater than 1 wave-front (64 work items). Does that mean, that I can run 40 work-groups on one Compute Unit if my local work group size is 64 (= 1 wave-front)?

The other limiting resources are then 25 registers per work item (64kB of registers for 10 wave-fronts) and 1.6 kB of local memory (64 kB for 40 work-groups). Is this correct, or do I miss something?

Thanks for all the help, I'm just getting started with GPU computing and OpenCl, etc.

Ralph

0 Likes
1 Solution

rwelsch wrote:
Alright, but as I said, I am mainly memory bound, and so I'm still wondering, if I can run 40 work-groups each containing of 1 wavefront on 1 CU or if I have to run less work-groups of larger size (i.e. 10 work-groups, 256 work items each) to have optimal latency hiding.
If you schedule 40 workgroups of one wavefront each to each CU, then that will completely fill all the resources assuming you are not limited by other factors (such as local memory usage, regsters, etc.)  Another way to fill the CU is to schedule 10 workgroups of four wavefronts per CU.

View solution in original post

0 Likes
9 Replies

Hi Ralph,

Welcome to the GPGPU community.

Just to clarify a bit, the numbers you are referring to are the recommended numbers for a GCN GPU. In most cases actual numbers will depend on the problem you would be working on.

rwelsch wrote:

Hey,

I have some basic questions concerning GCN and optimizing my code for this architecture.

As far as I understand there is a limit of 10 wave-fronts per SIMD unit and a Compute Unit is composed of 4 SIMD units; so I can run 40 wave-fronts on one compute unit in parallel, right?
Actually only a quad-wavefront(16 work-items) can run parallely on a Compute Unit. All these extra work-items are introduced, so hide the latencies of memory i/o for data as well as code.

And there is also a limit of 16 work-groups per Compute Unit, if the work-groups size is greater than 1 wave-front (64 work items). Does that mean, that I can run 40 work-groups on one Compute Unit if my local work group size is 64 (= 1 wave-front)?
Although I have not tested it, but if there is a limit of 16 workgroups per CU, than that means you cannot schedule more than 16 workgroups for that CU. It may be a constraint regarding schedule handling and not how many work-items a CU can handle.

The other limiting resources are then 25 registers per work item (64kB of registers for 10 wave-fronts) and 1.6 kB of local memory (64 kB for 40 work-groups). Is this correct, or do I miss something?

Actually you can have more LDS per workgroup than just 1.6kB. But in this case, somewhat less workgroups will be scheduled on the CU. It should not affect you considerably upto some point, unless you are working on a extremely memory bound problem.

Thanks for all the help, I'm just getting started with GPU computing and OpenCl, etc.

Ralph

Message was edited by: Himanshu Gautam

0 Likes

gautam.himanshu wrote:

Hi Ralph,

Welcome to the GPGPU community.

Just to clarify a bit, the numbers you are referring to are the recommended numbers for a GCN GPU. In most cases actual numbers will depend on the problem you would be working on.

Thanks for the help. So my problem is in a large part memory bound, so I'm trying to hide latency as much as possible.

gautam.himanshu wrote:

As far as I understand there is a limit of 10 wave-fronts per SIMD unit and a Compute Unit is composed of 4 SIMD units; so I can run 40 wave-fronts on one compute unit in parallel, right?
Actually only a quad-wavefront(16 work-items) can run parallely on a Compute Unit. All these extra work-items are introduced, so hide the latencies of memory i/o for data as well as code.

You mean that only a quad-wavefront can run on a SIMD-unit, right? And so 4 quad-wavefronts can run on one CU.

What my question was aiming to was: How many wave-fronts can be scheduled on a SIMD unit or a CU. I think on Fermi cards there is a limit of 1536 threads per multiprocessor and to my understanding it is 40*64=2560 work items in GCN.

gautam.himanshu wrote:

And there is also a limit of 16 work-groups per Compute Unit, if the work-groups size is greater than 1 wave-front (64 work items). Does that mean, that I can run 40 work-groups on one Compute Unit if my local work group size is 64 (= 1 wave-front)?
Although I have not tested it, but if there is a limit of 16 workgroups per CU, than that means you cannot schedule more than 16 workgroups for that CU. It may be a constraint regarding schedule handling and not how many work-items a CU can handle.

I have these numbers from the AMDAPP programming guide. And if the limit is there, it is of course a limitation on how many work-items a CU can handle. And I do not really understand why there is this limitation only if the workgroup is not composed of one wave-front.

For my code it may be best to have as many workgroups as possible running on one CU, while not losing any occupancy/memory hiding. That's why I asked, if I can have 40 work-groups each containing 64 work items running on one CU.

gautam.himanshu wrote:

The other limiting resources are then 25 registers per work item (64kB of registers for 10 wave-fronts) and 1.6 kB of local memory (64 kB for 40 work-groups). Is this correct, or do I miss something?

Actually you can have more LDS per workgroup than just 1.6kB. But in this case, somewhat less workgroups will be scheduled on the CU. It should not affect you considerably upto some point, unless you are working on a extremely memory bound problem.

Yeah, ok. But as I am many memory bound, the limitation to get best latency hiding is 25 registers per work item and 1.6 kB of LDS per workgroup, right?

best regards,

Ralph

0 Likes
realhet
Miniboss

Hi!

In my experiments it turned out that one can use at most 64 vector- and 105 scalar registers for the optimal performance.

I guess that the 10 wavefront/CU limit is just a queue limit, the actual execution is working on 4 wavefronts in paralell. That's why you get no speedup when reducing the used VRegs from 64 to 32 or below. If you have really long kernels (1000 intructions or so), then it's enough to have 4 wavefronts in each CU. If you have smaller kernels, then goes in that 10 element wavefront queue, because that thing which is scheduling the wavefronts will not update the queues on 'every clock cycle'.

Also the 4 SIMD units will work on the same wavefronts, but there can be 4 different program execution paths. You can try to set wavefront size to 16, and everything will be 4x slower.

I feel like we need 4x..8x more work-items in general, compared to Evergreen architecture.

On Evergreen there was 16 threads (4x vliw) and there was 2 of these working in paralell.

On GCN we have 64 threads and 4 of these issued in paralell. (per CU)

Just an example:

SP Mandelbrot made in a way that every thread get its new task when finishing the previous task via buffer_atomic_add_rtn (from an UAV). So basically its 99.9% math and the remaining is memory IO.

In this way you can have only 1 worker thread and even that will calculate all the pixels of the Mandelbrot image.

Then I experimented with different number of threads and VRegCounts:

Best performance was at 32768 threads. (thats 16x more that the stream cores in the 7970)

And was able to push up VRegCount to 64.

The measured performance  (based on the 10 inner loop instructions) was 99.3% of the ideal performance.

My earlier guess was incorrect on the minimum required workitems: I thought that it's enough to have 4x more threads than the streams in gpu, but it was only true when you run ALU exclusive tests. When a small amount of memory IO comes it can be 1.5x faster when you put 8x more threads in (32K).

0 Likes

> Also the 4 SIMD units will work on the same wavefronts, but there can be 4 different program execution paths. You can try to set wavefront size to 16, and everything will be 4x slower.

No, the 4 SIMD units work on different wavefronts.  Each SIMD works on 64 threads at a time, just like on previous chips.  This can easily be verified by dispatching a single wavefront per CU, not 1/4 of a wavefront per workgroup.

If you are ALU bound, then you only need 4 wavefronts per CU to keep things busy.  Just make sure you are *really* ALU bound.  For example, if you dispatch 4 wavefronts per CU for the whole dispatch, then it's easy to become instruction cache limited, unless your code does lots of tight loops inside the cache.  Unrolled loops are more likely to cause instruction cache misses.

We are working on updating the APP SDK Programmer's guide to give more details for GCN-based chips.

Thanks for "busting my myth" and making things clear.

So it's a requirement to have 4*64 workitems/CU, and the VReg usage just have to allow that in a *WEIRD way.

Now it also became clear to me that how the data dependent instructions are handled: While the CU is still calculating work-items 48..63 of the first instruction, it can start to calculate the work-items 0..15 of the second instruction.

Also had that false theory that each SIMD can check the the referring parts of the exec register and if it's all 16 zeros then it will execute something else from another wavefront. (Somehow I thought that conditional execution is just like  either 0 or 16, like on the VLIW chips, but it's now in a 0 or 64 style controlled by the s_cbranch_execz instruction)

Unroll vs. Instr. cache ->  I've realized it since the hd4xxx, really important

* Having a million wavefronts I did some tests in the past, trying to understand how many VReg usage works. I tested 1000 instructions long unrolls with different instructions:

- 4 small v_ instructions plus 1 small s_ instruction -> I can use all the 256 vregs

- 4 big v_ instruction interleaved with 3 big s_ instruction -> I can use 64 vregs only, If I use 256 then it became 4x slower.

0 Likes

> So it's a requirement to have 4*64 workitems/CU, and the VReg usage just have to allow that in a *WEIRD way.

Each SIMD has its own VGPRs and SGPRs.  So if you have 4 wavefronts per CU, them each wavefront will have the full complement of registers available.  This is documented in the AMD SDK Programmer's guide.  Of course, if you schedule more than four wavefronts per CU, then as many wavefronts as can be allocated based on resource limitations will be scheduled to each CU.

0 Likes

Thanks Jeff for the explanation. That was also the way how I assumed things are working.

jeff_golds wrote:

If you are ALU bound, then you only need 4 wavefronts per CU to keep things busy.  Just make sure you are *really* ALU bound.  For example, if you dispatch 4 wavefronts per CU for the whole dispatch, then it's easy to become instruction cache limited, unless your code does lots of tight loops inside the cache.  Unrolled loops are more likely to cause instruction cache misses.

Alright, but as I said, I am mainly memory bound, and so I'm still wondering, if I can run 40 work-groups each containing of 1 wavefront on 1 CU or if I have to run less work-groups of larger size (i.e. 10 work-groups, 256 work items each) to have optimal latency hiding.

jeff_golds wrote:

We are working on updating the APP SDK Programmer's guide to give more details for GCN-based chips.

Great. That would be very helpful

best regards,

Ralph

0 Likes

rwelsch wrote:
Alright, but as I said, I am mainly memory bound, and so I'm still wondering, if I can run 40 work-groups each containing of 1 wavefront on 1 CU or if I have to run less work-groups of larger size (i.e. 10 work-groups, 256 work items each) to have optimal latency hiding.
If you schedule 40 workgroups of one wavefront each to each CU, then that will completely fill all the resources assuming you are not limited by other factors (such as local memory usage, regsters, etc.)  Another way to fill the CU is to schedule 10 workgroups of four wavefronts per CU.
0 Likes
lily33
Journeyman III

Good tips, thanks for sharing!

0 Likes