AnsweredAssumed Answered

__local  memory optimization

Question asked by firespot on Jan 2, 2015
Latest reply on Jan 6, 2015 by dipak



I'd like to hear your comments if the suggested kernel modification is realistically (based on theory) expected to improve the current situation which shows a considerable bottleneck (profiled); The example here is much simplified (and changes to the real code are not so easy), but it's basic conditions meet the real code, to be run on a Tahiti. I try to give specific numbers whenever possible.

Consider a kernel like:


__kernel void my_kernel(__global double * Input1, __global someStruct * Input2, ...)


for (int i = 0; i < m; ++i)


  double Val = Input1[someIndexToBeCalculated];


  someStruct x1 = Input2[anotherIndexToBeCalculated];

  someStruct x2 = Input2[anotherIndexToBeCalculated + 1];





Input1 is logically speaking a 2-D array with dimensions {n, k} in row-major outline (with k very large, several tens-thousands). A total of n * k work-items are launched, in work-group sizes of 64. All work-items within a work-group of this kernel however access only values from a given column, that is at most n different value; all accesses to Input1 are thus by necessity fully non-coalesced (they are fully coalesced in other kernels though, that's why the outline of Input 1 is not changed). n is a program-wide constant of about 120; m varies at runtime but is on average around 100; Thus a work-group makes ~ 6400 accesses to Input1 although there are not more than 120 different values of relevance for the work-group in there.

An object of someStruct has 40 bytes and Input2 is 1-dimensional of about 200 values (let's call that figure q); access is neither fully coalesced nor random, on average though work-items will access through indices within a of range considerably less than 200 (say 10-40), often exactly the same element (same element means, though IIRC, non-coalesced accessed for __global memory on a GCN device [?]).


I am now thinking of placing all n values from Input1 and all values of Input2 first into local memory (both as 1-D array of size n and q, respectively) and then use that for accesses. My conclusions is that for one work-group:

-) copying the data from Input1 makes 120 non-coalesced accesses to __global memory and then ~6400 to __local, mostly coalesced; that should be much faster than 6400 plain non-coalesced plain __global memory accesses.

-) copying the data from Input2 makes 200 coalesced accesses to __global memory and then ~ 2 * 6400 to __local, mostly coalesced; that should again be faster than 2 * 6400 semi-coalesced __global accesses.

note: using __constant memory is not an option, for several reasons.


First question: is that assumption qualitatively speaking correct (I know in practice nothing but trying it out won't really help in the end of the day ...).


Second, how much can GCN devices reduce latency due to non-coalesced access? AFAIK they switch wavefronts if a memory request is waiting (and with n * k > 1000000 lots of wavefronts are running), but I suppose this is still not as efficient as accessing __local memory?


Finally, in how far can the allocated __local memory be detrimental to performance because it may exceed other resources? In total there's about 10Kb needed for the stated figures (mostly for Input2), AFAIK a Tahiti has 32Kb per compute unit, does that mean I am just plain fine or do multiple wavefronts run in (semi-) parallel and compete here? And can register spill-over be an issue (does spill-over go to __local or __global memory?) if each work-item needs around 500-1000 bytes of __local memory?


Thanks !