Archives Discussions

firespot · ‎01-02-2015

Hi,

I'd like to hear your comments if the suggested kernel modification is realistically (based on theory) expected to improve the current situation which shows a considerable bottleneck (profiled); The example here is much simplified (and changes to the real code are not so easy), but it's basic conditions meet the real code, to be run on a Tahiti. I try to give specific numbers whenever possible.

Consider a kernel like:

__kernel void my_kernel(__global double * Input1, __global someStruct * Input2, ...)

{

for (int i = 0; i < m; ++i)

{

double Val = Input1[someIndexToBeCalculated];

someStruct x1 = Input2[anotherIndexToBeCalculated];

someStruct x2 = Input2[anotherIndexToBeCalculated + 1];

...

}

Input1 is logically speaking a 2-D array with dimensions {n, k} in row-major outline (with k very large, several tens-thousands). A total of n * k work-items are launched, in work-group sizes of 64. All work-items within a work-group of this kernel however access only values from a given column, that is at most n different value; all accesses to Input1 are thus by necessity fully non-coalesced (they are fully coalesced in other kernels though, that's why the outline of Input 1 is not changed). n is a program-wide constant of about 120; m varies at runtime but is on average around 100; Thus a work-group makes ~ 6400 accesses to Input1 although there are not more than 120 different values of relevance for the work-group in there.

An object of someStruct has 40 bytes and Input2 is 1-dimensional of about 200 values (let's call that figure q); access is neither fully coalesced nor random, on average though work-items will access through indices within a of range considerably less than 200 (say 10-40), often exactly the same element (same element means, though IIRC, non-coalesced accessed for __global memory on a GCN device [?]).

I am now thinking of placing all n values from Input1 and all values of Input2 first into local memory (both as 1-D array of size n and q, respectively) and then use that for accesses. My conclusions is that for one work-group:

-) copying the data from Input1 makes 120 non-coalesced accesses to __global memory and then ~6400 to __local, mostly coalesced; that should be much faster than 6400 plain non-coalesced plain __global memory accesses.

-) copying the data from Input2 makes 200 coalesced accesses to __global memory and then ~ 2 * 6400 to __local, mostly coalesced; that should again be faster than 2 * 6400 semi-coalesced __global accesses.

note: using __constant memory is not an option, for several reasons.

First question: is that assumption qualitatively speaking correct (I know in practice nothing but trying it out won't really help in the end of the day ...).

Second, how much can GCN devices reduce latency due to non-coalesced access? AFAIK they switch wavefronts if a memory request is waiting (and with n * k > 1000000 lots of wavefronts are running), but I suppose this is still not as efficient as accessing __local memory?

Finally, in how far can the allocated __local memory be detrimental to performance because it may exceed other resources? In total there's about 10Kb needed for the stated figures (mostly for Input2), AFAIK a Tahiti has 32Kb per compute unit, does that mean I am just plain fine or do multiple wavefronts run in (semi-) parallel and compete here? And can register spill-over be an issue (does spill-over go to __local or __global memory?) if each work-item needs around 500-1000 bytes of __local memory?

Thanks !

dipak · ‎01-06-2015

Hi,

Please find my answers below.

First question: is that assumption qualitatively speaking correct...

Minimizing the global memory access is always preferable when accessing the same global memory multiple times specially for non-coalesced accessing. LDS is obvious alternative for this purpose. Proper usage of LDS can boost the performance significantly. One needs to take special care during placement of data on LDS such that bank-conflicts can be avoided during LDS access.

Second, how much can GCN devices reduce latency due to non-coalesced access?

The actual number depends on the chip/card and its memory organization (say, number of memory controllers, memory channels, banks, their size etc.). Normally, global memory is partitioned into multiple channels and banks. If two memory access requests are directed to the same channel controller or to the same memory bank, hardware serializes the access. Often, a large power of two stride results such channel/bank conflicts. The size of the power of two stride that causes a specific type of conflict depends on the chip.

I suppose this is still not as efficient as accessing __local memory?

Yes, this may be. Say, due to some large stride, all work-items in a wavefront access data from same memory channel and therefore, all the requests are serialized. Interestingly, other in-flight wavefronts, may be from same work-group or kernel, may also exhibit same behavior. So, total number of in-flight wavefronts (in GCN device, max. 10 in-flight wavefronts per SIMD and total 40 per CU) may not be sufficient to hide the latency.

Finally, in how far can the allocated __local memory be detrimental to performance because it may exceed other resources?

Please see the Table 2.2 Effect of LDS Usage on Wavefronts/CU in Chapter 2 in "AMD_OpenCL_Programming_Optimization_Guide". It shows how LDS usage can impact the wavefronts/compute unit.

And can register spill-over be an issue (does spill-over go to __local or __global memory?) if each work-item needs around 500-1000 bytes of __local memory?

Excess usage of register causes register spilling and a part of global memory is used to store the spilled registers.

Regards,

View solution in original post

dipak · ‎01-06-2015

Hi,

Please find my answers below.

First question: is that assumption qualitatively speaking correct...

Minimizing the global memory access is always preferable when accessing the same global memory multiple times specially for non-coalesced accessing. LDS is obvious alternative for this purpose. Proper usage of LDS can boost the performance significantly. One needs to take special care during placement of data on LDS such that bank-conflicts can be avoided during LDS access.

Second, how much can GCN devices reduce latency due to non-coalesced access?

The actual number depends on the chip/card and its memory organization (say, number of memory controllers, memory channels, banks, their size etc.). Normally, global memory is partitioned into multiple channels and banks. If two memory access requests are directed to the same channel controller or to the same memory bank, hardware serializes the access. Often, a large power of two stride results such channel/bank conflicts. The size of the power of two stride that causes a specific type of conflict depends on the chip.

I suppose this is still not as efficient as accessing __local memory?

Yes, this may be. Say, due to some large stride, all work-items in a wavefront access data from same memory channel and therefore, all the requests are serialized. Interestingly, other in-flight wavefronts, may be from same work-group or kernel, may also exhibit same behavior. So, total number of in-flight wavefronts (in GCN device, max. 10 in-flight wavefronts per SIMD and total 40 per CU) may not be sufficient to hide the latency.

Finally, in how far can the allocated __local memory be detrimental to performance because it may exceed other resources?

Please see the Table 2.2 Effect of LDS Usage on Wavefronts/CU in Chapter 2 in "AMD_OpenCL_Programming_Optimization_Guide". It shows how LDS usage can impact the wavefronts/compute unit.

And can register spill-over be an issue (does spill-over go to __local or __global memory?) if each work-item needs around 500-1000 bytes of __local memory?

Excess usage of register causes register spilling and a part of global memory is used to store the spilled registers.

Regards,

Archives Discussions

__local memory optimization