cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

HD 7970 LDS Bank Conflicts

Hello everyone,

I want to write a kernel that use 4KB local memory per work item. Since only 32KB LDS is available per work group ,hence work group size can be atmost 8. Now if I were to use the total 64KB LDS then I have to shedule 2 work groups per CU. But if I want to eliminate or reduce LDS bank conflicts do I have to consider both the work groups while wrting the codes? Or the work group's memory access are independent of each other i.e local memory  access pattern of one work group doesn't affect the other work group in a CU?

Also note that the work group will only use 8 channels out of 32 channels available. Now will the second workgroup in the CU use same set of 32 channels  used by the first work group or an entirely different set of 32 channels ?

Please Reply.

Thanks,

Sayantan

0 Likes
1 Solution

Bank conflicts are a half-wavefront issue problem, not a workgroup or even full wavefront problem. The way it works is that every cycle 16 lanes of requests are made by both one of SIMD 0 and 1 and one of or SIMD 2 and 3 in the CU. Those requests are serviced as 32-lanes per cycle by the LDS interface and hence conflicts (I think) can occur accross those 32 lanes and 32 banks.

Of course, bank conflicts that delay one wave can affect another wave on another SIMD unit in the CU because it creates a pipeline bubble.

View solution in original post

0 Likes
11 Replies
sh2
Adept II

Local memory is not supposed to be used that way. If you dispatch less than 256 items per CU, GPU will be underutilized anyway. So back conflicts are not important in this case.

0 Likes

sh, are you sure about that 256? I might have missed that, but anyway I think wavefronts are still 64 workitems on 7970. Thus 64 workitems per CU should be enough to utilize the hardware yet not enough to hide latencies (yet hiding latencies has nothing to do with workgroup size, more to do with NDRange sizing).

0 Likes

gat3way wrote:

sh, are you sure about that 256? I might have missed that, but anyway I think wavefronts are still 64 workitems on 7970. Thus 64 workitems per CU should be enough to utilize the hardware yet not enough to hide latencies (yet hiding latencies has nothing to do with workgroup size, more to do with NDRange sizing).

That is not true. In the "Graphics Core Next" architecture one compute unit is build out of 4 SIMD-Units (with 16 cores each) and a wave-front is only executed on one SIMD unit, so one needs to schedule 4 wave-fronts for each compute unit to keep the hardware busy (and more to hide latency).

Hi rwelsch,

I read about that in the APP SDK guide, yet one thing is really unclear to me - in that case specifying a workgroup size of 256 would be ideal as it would guarantee best occupancy. Yet, almost any ALU-intensive kernel I have works noticeably faster (in terms of 5-6% faster on 7970) when I use a workgroup size of 64 instead of 256 provided that my ndrange is high enough and my GPR usage is low enough so that I can have enough wavefronts/CU. This is counter-intuitive in a way, one should expect best occupancy with 256 workitems per group. I am really wondering what's the case with that.

0 Likes

At the moment I don't see, why it is counter intuitive. If you have large enough nd-range you can have best occupancy with a workgroup size of 64 (then you can schedule 40 work-groups on each CU) and with a workgroup size of 256 (then you can schedule 10 work-groups on each CU). But maybe I'm missing something.

To my understanding each wave-front is really handled independently by the scheduler on each CU. So if you have 1 workgroup with 4 wavefronts it should be the same as 4 workgroups with 1 wavefront each.

It is maybe even better to have 1 wave-front per workgroup, because when using barriers, one does not have problems with synchronisation between the wave-fronts (that maybe explains your speedup). Or the generated code is better optimized, or so.

0 Likes

Well I was thinking that using larger workgroup would reduce the wavefront scheduling latency somehow (but if those 4 SIMD units operate independently that would not be the case - still I am wondering how is local memory consistent between those 4 wavefronts within a workgroup then). Most of my ALU-intensive kernels do not involve any local memory usage and barriers so I believe it doesn't have anything to do with barriers (unless the compiler implicitly puts those for some reason - need to check the ISA dump hmmm).

0 Likes

gat3way wrote:

still I am wondering how is local memory consistent between those 4 wavefronts within a workgroup then).

I always thought that the local memory has not to be consistent between threads unless on uses barriers.

0 Likes

This is why we should be careful when we use the term "thread" for something that isn't really a thread. Barriers are needed between threads of course because threads are independent, but threads on the 7970 are actually wavefronts (we often use "thread" to mean work item but it's a question of whether you look at programmer visibility or actual execution). In the OpenCL spec barriers are needed for any communication between work items, but these barriers may be dropped (by the compiler, preferably) for workgroups of size 64.

To the original question, workgroups of size 64 are usually the best choice unless you have a good reason to do something else because that allows barriers to be dropped, and for maximum flexibility for the scheduler. Synchronisation between wavefronts may be performed across the CU and I *think* that barriers are at the CU level and hence the scheduler may, if you do have a 256 WI work group, schedule one wave on each SIMD unit within the CU. Given that synchronisation memory consistency of LDS is achieved in the same way it is for previous architectures. You certainly need four waves to fill the CU at a minimum, and more really because while the wave is executing a scalar instruction or memory operation it is not executing a vector instruction and hence that SIMD unit would lie idle. The scheduler will issue a scalar instruction, a vector instruction, a memory instruction and some other simultaneously, but it will do it from different waves.

Hey,

thanks Lee for the clarification. So in a wavefront the LDS is consistent, but between wave-fronts one needs barriers to do synchronization.

But now coming back to the original question about LDS bank conflicts. Have you tried to use the profiler to nail down if there are any LDSBankConflicts with multiple work-groups having the same access patterns?

The last slide of  this presentation: http://developer.amd.com/afds/assets/presentations/2620_final.pdf my be helpful, but as it is quite old, I'm not sure if it is up to date.

0 Likes

The LDS Bank conflicts are only related to within a workgroup. Any workgroup scheduled on a CU will have a part of LDS reserved for it. If the LDS requirements of workgroups is more, then less number of workgroups can be scheduled on a CU.

 

Section 4.11.2.3 In addition to registers, shared memory can also serve to limit the active

wavefronts/compute unit. Each compute unit has 32k of LDS, which is shared

among all active work-groups. LDS is allocated on a per-work-group granularity,

so it is possible (and useful) for multiple wavefronts to share the same local

memory allocation. However, large LDS allocations eventually limits the number

of workgroups that can be active

0 Likes

Bank conflicts are a half-wavefront issue problem, not a workgroup or even full wavefront problem. The way it works is that every cycle 16 lanes of requests are made by both one of SIMD 0 and 1 and one of or SIMD 2 and 3 in the CU. Those requests are serviced as 32-lanes per cycle by the LDS interface and hence conflicts (I think) can occur accross those 32 lanes and 32 banks.

Of course, bank conflicts that delay one wave can affect another wave on another SIMD unit in the CU because it creates a pipeline bubble.

0 Likes