cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

gpgpucoder
Journeyman III

relative importance of bank conflicts in local vs global per optimization guide

Hello, something is not clear to me from going through the optimization guide. My question is whether bank & channel conflicts are primarily a concern using OpenCL local memory, or if they are also applicable to global memory?  From the optimization guide, is there any of the bank conflict tuning advice that's not so important on SI-type hardware?

0 Likes
1 Solution

With access to global memory you may encounter bank and channel conflicts.

With LDS only bank conflicts are relevant. LDS access is the simpler case, there are 32 banks interleaved by 1 DWORD granularity.

If all threads of a wavefront read the same address there is no issue since the HW can broadcast the data to all threads. We encounter bank conflicts

when several threads read from the same bank but with different addresses. In other words, the following expression determines which bank is going to be used for an address : (address >> 2) % 32 .

Tzachi

View solution in original post

6 Replies
jason
Adept III

Based on what I've found on GCN cards, stick to words (uint32/int32) and the conflicts mostly go away for most common problems on global memory and local memory.  From what I've read and it seems to apply, techniques to make sure you are distributing memory accesses over all channels exist and make improvements but these always seem to refer soley to global memory.

For local memory I was not able to figure out if anything other than making sure to use uint32/int32 can make a difference despite some hours of trying things.

Thanks, I'm crunching 4-byte floats. You had tried those as well in the same way as 4-byte ints and not gotten similar effects?

0 Likes

With access to global memory you may encounter bank and channel conflicts.

With LDS only bank conflicts are relevant. LDS access is the simpler case, there are 32 banks interleaved by 1 DWORD granularity.

If all threads of a wavefront read the same address there is no issue since the HW can broadcast the data to all threads. We encounter bank conflicts

when several threads read from the same bank but with different addresses. In other words, the following expression determines which bank is going to be used for an address : (address >> 2) % 32 .

Tzachi

Thanks, Tzachi.

In my first rev of what I'm developing, I used if (get_local_id(0)==0 && get_local_id(1)==0) to pull my data into a __local cache buffer, and all other threads are stalled at memory barrier while this initialization occurs. The initialization is two for-loops of a rectangular data set. At this point my functionality seems pretty solid, and I can use the data in the way I expect.

Now I want to rework this to let each work-item do it's part to load the __local buffer, as I have the belief my current approach is deficient on performance compared to other practices. That said I haven't yet seriously studied it with CodeXL and I will not be surprised to see bank conflicts. Before I go too much further, I wonder if you have some advice on a good approach, and I am also keeping in mind your previous advice.

I first want to try setting all the work-items free. The size of the local cache buffer is proportional to my workgroup size, but it is kind of "larger" and there is not a 1-to-1 correspondence between the work-items and the data. So some work-items will have to do some extra work around the edges of the rectangle, and all individual work-items may grab more than one element depending on a parameter to the kernel.  The left and right edges I don't expect to be too difficult. The top and bottom edges are making me ponder -- there could be quite a few rows that need to be grabbed. In this case I'm thinking the work-items in the top and bottom rows will do that work. What about that top work-item for instance subsequently initing data[COLXY], data[COLXY-1], data[COLXY-2] for instance?

My other potential approach is calling one or more of the async_work_group_*copy* functions.

Any advice where to go first or approaches to avoid?

Thanks.

0 Likes

After some profiling, I'm only seeing 0.5% LDSBankConflict from CodeXL. Do I need to chase that?  And as hoped before -> after : MemUnitStalled 15% -> 4.2%; MemUnitBusy 86% -> 57%.That seems a good result to me.

0 Likes

0.5% is excellent, I would not spend time trying to optimize it.

0 Likes