AnsweredAssumed Answered

Question OpenCL Bank conflicts and AMDs 64 wavefront thread size

Question asked by abc on Jul 28, 2015
Latest reply on Aug 7, 2015 by maxdz8

DISCLAIMER: I am using a Gizmo Board 2, and I'm not sure of all the specs of Kabini, I assume 64k LDS cache equates to local memory access and that this cache is divided into 32 4 byte banks, each with 512 x 4 byte column division, I also assume the wavefront size is 64 threads, each Compute Unit processes 64 threads at one time, with the exception of one of them, since the maximum number of shader cores is 80.  I assume each compute unit has 16x4 wave front thread processing arrangement, except one is 16x1 because 16 + 64 is 80 and that matches up with the number of shader cores.

 

So originally I thought bank conflicts could only happen if you dealt with variables that weren't a multiple of 32 bits, and if you explicitly were accessing the same position in local memory within the same wavefront, but that appears to not be the case.  It appears that this actually applies to every access within the bank column that is the same with in the same wavefront.  This confuses me, this would mean that if I was doing a simple global memory copy for the sake of increasing performance, with each thread in a wavefront copying one index from the global memory to local would have no choice but to induce a bank conflict, ex:

 

Kernel

example( global buffer, local localbuff)

{

     localbuff[local_id] = buffer[global_id]

     ....

}

 

This would induce 2 bank conflicts per bank, or at least simplified it would, because I when reading the AMD documentation here: http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/opencl-optimization-guid…  it states that some GPUs are capable of doing two 4 byte accesses per cycle, but not really because they backtrack oddly saying that due to the number of instructions a thread can execute at a time it isn't actually possible to do this (really confusing wording on their part).  What the heck are they actually trying to say here?

 

In my code i do something like this:

 

Kernel

example( global buffer, local localbuff)

{

     localbuff[local_id * 2] = buffer[global_id * 2]

     localbuff[(local_id * 2) + 1] = buffer[(global_id * 2) + 1]

     ....

}

 

would doing this:

Kernel

example( global buffer, local localbuff)

{

     localbuff[local_id] = buffer[global_id]

     localbuff[(local_id + local_size] = buffer[global_id  + global_size]

     ....

}

 

actually result in a performance increase if my goal is to get 256 total 4 byte integers into my local memory, and my local size is 128, with my global size being half my data size, n (a power of two)?

 

Finally AMD also talks about with 32 bank divisions, the part of the byte that determines where it is placed in the bank is the 6:2 bits, the 2nd through 6th bits, which I guess makes sense, the first two bits corrispond to the byte order position within a bank 4 byte partition, and the 5 bits in between would address the 2^5 = 32 bank partitions, am I correct in assuming that this shouldn't affect indexing to make sure you avoid bank conflicts though? indexing into a local buffer of integers, indexing into position 0 would access bank 0, 32 would access bank 0, 63 would access bank 31, 129 would access bank 1, etc correct?

 

Additionally they seem to imply Local memory isn't LDS, which is even more confusing because every one appears to consider local LDS and people talk about bank conflicts there all the time.

 

Finally, and this is really annoying, people everywhere seem to talk about half warps/wavefront, and half warp/wavefront accesses, regardless of GPU, is there something I'm missing here? Half warps aren't a physical special hardware thing are they? they don't actually have specific properties right? (looked everywhere, doesn't appear to be an actual thing, here's and example of a mention, http://stackoverflow.com/questions/3841877/what-is-a-bank-conflict-doing-cuda-opencl-programming )

 

 

EDIT: I've found people claiming that bank conflicts don't occur when two different threads of the same wavefront read or write to the same bank 4 byte word there would be no bank conflict, but if the actual column address is different there occurs one, is this true?

Outcomes