cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

abc
Journeyman III

Question OpenCL Bank conflicts and AMDs 64 wavefront thread size

DISCLAIMER: I am using a Gizmo Board 2, and I'm not sure of all the specs of Kabini, I assume 64k LDS cache equates to local memory access and that this cache is divided into 32 4 byte banks, each with 512 x 4 byte column division, I also assume the wavefront size is 64 threads, each Compute Unit processes 64 threads at one time, with the exception of one of them, since the maximum number of shader cores is 80.  I assume each compute unit has 16x4 wave front thread processing arrangement, except one is 16x1 because 16 + 64 is 80 and that matches up with the number of shader cores.

So originally I thought bank conflicts could only happen if you dealt with variables that weren't a multiple of 32 bits, and if you explicitly were accessing the same position in local memory within the same wavefront, but that appears to not be the case.  It appears that this actually applies to every access within the bank column that is the same with in the same wavefront.  This confuses me, this would mean that if I was doing a simple global memory copy for the sake of increasing performance, with each thread in a wavefront copying one index from the global memory to local would have no choice but to induce a bank conflict, ex:

Kernel

example( global buffer, local localbuff)

{

     localbuff[local_id] = buffer[global_id]

     ....

}

This would induce 2 bank conflicts per bank, or at least simplified it would, because I when reading the AMD documentation here: http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/open...  it states that some GPUs are capable of doing two 4 byte accesses per cycle, but not really because they backtrack oddly saying that due to the number of instructions a thread can execute at a time it isn't actually possible to do this (really confusing wording on their part).  What the heck are they actually trying to say here?

In my code i do something like this:

Kernel

example( global buffer, local localbuff)

{

     localbuff[local_id * 2] = buffer[global_id * 2]

     localbuff[(local_id * 2) + 1] = buffer[(global_id * 2) + 1]

     ....

}

would doing this:

Kernel

example( global buffer, local localbuff)

{

     localbuff[local_id] = buffer[global_id]

     localbuff[(local_id + local_size] = buffer[global_id  + global_size]

     ....

}

actually result in a performance increase if my goal is to get 256 total 4 byte integers into my local memory, and my local size is 128, with my global size being half my data size, n (a power of two)?

Finally AMD also talks about with 32 bank divisions, the part of the byte that determines where it is placed in the bank is the 6:2 bits, the 2nd through 6th bits, which I guess makes sense, the first two bits corrispond to the byte order position within a bank 4 byte partition, and the 5 bits in between would address the 2^5 = 32 bank partitions, am I correct in assuming that this shouldn't affect indexing to make sure you avoid bank conflicts though? indexing into a local buffer of integers, indexing into position 0 would access bank 0, 32 would access bank 0, 63 would access bank 31, 129 would access bank 1, etc correct?

Additionally they seem to imply Local memory isn't LDS, which is even more confusing because every one appears to consider local LDS and people talk about bank conflicts there all the time.

Finally, and this is really annoying, people everywhere seem to talk about half warps/wavefront, and half warp/wavefront accesses, regardless of GPU, is there something I'm missing here? Half warps aren't a physical special hardware thing are they? they don't actually have specific properties right? (looked everywhere, doesn't appear to be an actual thing, here's and example of a mention, http://stackoverflow.com/questions/3841877/what-is-a-bank-conflict-doing-cuda-opencl-programming )

EDIT: I've found people claiming that bank conflicts don't occur when two different threads of the same wavefront read or write to the same bank 4 byte word there would be no bank conflict, but if the actual column address is different there occurs one, is this true?

0 Likes
1 Solution
maxdz8
Elite

It appears to me you have several misunderstandings about what's going on here.

DISCLAIMER: I am using a Gizmo Board 2, and I'm not sure of all the specs of Kabini, I assume 64k LDS cache

There's no such thing as "LDS cache". Cache is a thing, LDS is another. LDS is just memory. Cache is a copy of global memory which is automatically updated according to some policy.

Besides, for OpenCL you have access to only 32KiB AFAIK (I don't know yet if the thing changed with recent drivers but I don't think so).

I assume each compute unit has 16x4 wave front thread processing arrangement, except one is 16x1 because 16 + 64 is 80 and that matches up with the number of shader cores.

As you have noted in your later post, each SIMD lane truly processes 16 Work items each clock for 4 successive clocks. For future, note that at manufacturing level the kind of division you mention makes no sense. In general, when something is defective, the whole CU is culled.

So originally I thought bank conflicts could only happen if you dealt with variables that weren't a multiple of 32 bits, and if you explicitly were accessing the same position in local memory within the same wavefront, but that appears to not be the case.

No way (for my interpretation of English). If you think at your LDS as a 32xN matrix of dwords, conflicts happen if you hit the same column. Reading from exactly the same address is a fast operation as LDS have broadcast functionality... but AFAIK all 64 WIs must access the same address.

(I'd like someone to check this last statement... perhaps it's all the WIs in a clock? In half a wavefront? I am being conservative).

Kernel

example( global buffer, local localbuff)

{

     localbuff[local_id] = buffer[global_id]

     ....

}

This would induce 2 bank conflicts per bank

Absolutely impossible to conclude this as the example is untyped. And of course we're leaving out the fact you even forgot about pointer declaration and I have no idea what linear global_id and local_id is, albeit for the sake of being concise I take reasonable assumptions.

... or at least simplified it would, because I when reading the AMD documentation here: http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/open...  it states that some GPUs are capable of doing two 4 byte accesses per cycle, but not really because they backtrack oddly saying that due to the number of instructions a thread can execute at a time it isn't actually possible to do this (really confusing wording on their part).  What the heck are they actually trying to say here?

I also found that difficult and somewhat contradictory. An user really proficient in GCN ISA tried to explain it but honestly I'm not sure I understood. I support your request for a cleanup/rewording.

actually result in a performance increase if my goal is to get 256 total 4 byte integers into my local memory, and my local size is 128, with my global size being half my data size, n (a power of two)?

I can say for experience that NPOT strides work are quite convenient and pretty much not-32-dwords aligned. By contrast, using power of two strides basically guarantees you'll wrap on the 32-dwords bounduary. Again, for such a syntactic benchmark I don't see much a point in talking about performance. In general LDS layout is not as easy as "I'll just add +1" or "let's scale everything by 2".

Finally AMD also talks about with 32 bank divisions, the part of the byte that determines where it is placed in the bank is the 6:2 bits, the 2nd through 6th bits, which I guess makes sense, the first two bits corrispond to the byte order position within a bank 4 byte partition, and the 5 bits in between would address the 2^5 = 32 bank partitions, am I correct in assuming that this shouldn't affect indexing to make sure you avoid bank conflicts though? indexing into a local buffer of integers, indexing into position 0 would access bank 0, 32 would access bank 0, 63 would access bank 31, 129 would access bank 1, etc correct?

No, because there's no guarantee a generic LDS buffer is allocated starting from bank 0.

In my experience, they are concatenated so if you have two buffers, local uint blah[19] and uint meh[32] then meh[0] is on bank 19.

However this does not affect modulo operations on the buffer addressing so in terms of access conflicts you are correct. Sort of.

Additionally they seem to imply Local memory isn't LDS, which is even more confusing because every one appears to consider local LDS and people talk about bank conflicts there all the time.

Do they? Where? Maybe they consider LDS the whole physical construct including the access dispatchers.

Half warps aren't a physical special hardware thing are they?

I won't blame you for being confused. There are really four entities to consider:

  1. the 1/4 WF running at the same clock;
  2. the 1/2 WF involved in LDS bank conflicts (note: your GPU might still be able to not stall even when LDS access collide);
  3. the WF running on a tick (4 clocks) for native sharing of LDS and no-sync
  4. the full Work Group.

Half wavefronts absolutely exist when it comes to the LDS controller (and perhaps other parts as well we don't know). They don't exist anywhere else so their existence is debatable in general.

That's of course for GCN.

EDIT: I've found people claiming that bank conflicts don't occur when two different threads of the same wavefront read or write to the same bank 4 byte word there would be no bank conflict, but if the actual column address is different there occurs one, is this true?

Where? I write my addressing so the column is always different in the same half-WF and I observe 0 conflicts. Two WIs from the same WF can indeed do that with no conflict if they come from different half-WFs.

View solution in original post

3 Replies
dipak
Big Boss

Regarding the LDS bank conflict, I would like to share the following statements from an old thread:

by LeeHowes on 07-Jun-2012 22:49 (Re: HD 7970 LDS Bank Conflicts )

"Bank conflicts are a half-wavefront issue problem, not a workgroup or even full wavefront problem. The way it works is that every cycle 16 lanes of requests are made by both one of SIMD 0 and 1 and one of or SIMD 2 and 3 in the CU. Those requests are serviced as 32-lanes per cycle by the LDS interface and hence conflicts (I think) can occur accross those 32 lanes and 32 banks."

For more detail about LDS and wavefront,  you may check this document https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf

Regards,

0 Likes
abc
Journeyman III

While this doesn't answer all my questions this is exactly what i wanted to know for bank conflicts, I read the documentation in the AMD optimization guide and started to realize half way through that for GCN architecture didn't process all 64 stream processor commands in cycle, but it wasn't explicit in terms of what was actually happening to prevent bank conflicts if each one accessed on bank.  I've also read that banks aren't just 4 byte wide in GCN for Local memory, which is why float2 consecutive shouldn't cause bank conflicts but float4 will, I've found threads here about that, but they never explain the bank width, where in virtually all Nvidia cards it is simply 32 bits so its simple to understand.

0 Likes
maxdz8
Elite

It appears to me you have several misunderstandings about what's going on here.

DISCLAIMER: I am using a Gizmo Board 2, and I'm not sure of all the specs of Kabini, I assume 64k LDS cache

There's no such thing as "LDS cache". Cache is a thing, LDS is another. LDS is just memory. Cache is a copy of global memory which is automatically updated according to some policy.

Besides, for OpenCL you have access to only 32KiB AFAIK (I don't know yet if the thing changed with recent drivers but I don't think so).

I assume each compute unit has 16x4 wave front thread processing arrangement, except one is 16x1 because 16 + 64 is 80 and that matches up with the number of shader cores.

As you have noted in your later post, each SIMD lane truly processes 16 Work items each clock for 4 successive clocks. For future, note that at manufacturing level the kind of division you mention makes no sense. In general, when something is defective, the whole CU is culled.

So originally I thought bank conflicts could only happen if you dealt with variables that weren't a multiple of 32 bits, and if you explicitly were accessing the same position in local memory within the same wavefront, but that appears to not be the case.

No way (for my interpretation of English). If you think at your LDS as a 32xN matrix of dwords, conflicts happen if you hit the same column. Reading from exactly the same address is a fast operation as LDS have broadcast functionality... but AFAIK all 64 WIs must access the same address.

(I'd like someone to check this last statement... perhaps it's all the WIs in a clock? In half a wavefront? I am being conservative).

Kernel

example( global buffer, local localbuff)

{

     localbuff[local_id] = buffer[global_id]

     ....

}

This would induce 2 bank conflicts per bank

Absolutely impossible to conclude this as the example is untyped. And of course we're leaving out the fact you even forgot about pointer declaration and I have no idea what linear global_id and local_id is, albeit for the sake of being concise I take reasonable assumptions.

... or at least simplified it would, because I when reading the AMD documentation here: http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/open...  it states that some GPUs are capable of doing two 4 byte accesses per cycle, but not really because they backtrack oddly saying that due to the number of instructions a thread can execute at a time it isn't actually possible to do this (really confusing wording on their part).  What the heck are they actually trying to say here?

I also found that difficult and somewhat contradictory. An user really proficient in GCN ISA tried to explain it but honestly I'm not sure I understood. I support your request for a cleanup/rewording.

actually result in a performance increase if my goal is to get 256 total 4 byte integers into my local memory, and my local size is 128, with my global size being half my data size, n (a power of two)?

I can say for experience that NPOT strides work are quite convenient and pretty much not-32-dwords aligned. By contrast, using power of two strides basically guarantees you'll wrap on the 32-dwords bounduary. Again, for such a syntactic benchmark I don't see much a point in talking about performance. In general LDS layout is not as easy as "I'll just add +1" or "let's scale everything by 2".

Finally AMD also talks about with 32 bank divisions, the part of the byte that determines where it is placed in the bank is the 6:2 bits, the 2nd through 6th bits, which I guess makes sense, the first two bits corrispond to the byte order position within a bank 4 byte partition, and the 5 bits in between would address the 2^5 = 32 bank partitions, am I correct in assuming that this shouldn't affect indexing to make sure you avoid bank conflicts though? indexing into a local buffer of integers, indexing into position 0 would access bank 0, 32 would access bank 0, 63 would access bank 31, 129 would access bank 1, etc correct?

No, because there's no guarantee a generic LDS buffer is allocated starting from bank 0.

In my experience, they are concatenated so if you have two buffers, local uint blah[19] and uint meh[32] then meh[0] is on bank 19.

However this does not affect modulo operations on the buffer addressing so in terms of access conflicts you are correct. Sort of.

Additionally they seem to imply Local memory isn't LDS, which is even more confusing because every one appears to consider local LDS and people talk about bank conflicts there all the time.

Do they? Where? Maybe they consider LDS the whole physical construct including the access dispatchers.

Half warps aren't a physical special hardware thing are they?

I won't blame you for being confused. There are really four entities to consider:

  1. the 1/4 WF running at the same clock;
  2. the 1/2 WF involved in LDS bank conflicts (note: your GPU might still be able to not stall even when LDS access collide);
  3. the WF running on a tick (4 clocks) for native sharing of LDS and no-sync
  4. the full Work Group.

Half wavefronts absolutely exist when it comes to the LDS controller (and perhaps other parts as well we don't know). They don't exist anywhere else so their existence is debatable in general.

That's of course for GCN.

EDIT: I've found people claiming that bank conflicts don't occur when two different threads of the same wavefront read or write to the same bank 4 byte word there would be no bank conflict, but if the actual column address is different there occurs one, is this true?

Where? I write my addressing so the column is always different in the same half-WF and I observe 0 conflicts. Two WIs from the same WF can indeed do that with no conflict if they come from different half-WFs.