Archives Discussions

cantallo · ‎07-11-2012

I want to share this "Bug" because OpenCL specs are all but clear on that point.

I wanted to deinterlace complex-valued signal and put it into a __local array of vectors (type float4)

__local float4 re[16],im[16];

__private float4 d;

(...)

d=vload4(get_global_id(0),...);

if ((get_global_id(0)&1)==0)

{

re[get_local_id(0)>>1].lo=d.even;

im[get_local_id(0)>>1].lo=d.odd:

}

else

{

re[get_local_id(0)>>1].hi=d.even;

im[get_local_id(0)>>1].hi=d.odd:

}

(...)

Coded like that, all .x .y components of the re & im arrays where uninitialized because assignment to the low (resp. high) half of a vector in local memory is in fact (onV7900) implemented as a full vector read by the thread, modification of the vector half and write of the full vector. I.e. the code above is implemented in a way similar to:

__local float4 re[16],im[16];

__private float4 d,tmp;

(...)

d=vload4(get_global_id(0),...);

if ((get_global_id(0)&1)==0)

{

tmp=re[get_local_id(0)>>1];

tmp.lo=d.even;

re[get_local_id(0)>>1]=tmp;

tmp=im[get_local_id(0)>>1];

tmp.lo=d.odd;

im[get_local_id(0)>>1]=tmp:

}

else

{

tmp=re[get_local_id(0)>>1];

tmp.hi=d.even;

re[get_local_id(0)>>1]=tmp;

tmp=im[get_local_id(0)>>1];

tmp.hi=d.odd:

im[get_local_id(0)>>1]=tmp;

}

(...)

which fails because any even thread and the next odd thread are executed simultaneously (there are on the same 1/4 wavefront).

putting a barrier on local memory between even and odd thread execution solves the problem:

__local float4 re[16],im[16];

__private float4 d;

(...)

d=vload4(get_global_id(0),...);

if ((get_global_id(0)&1)==0)

{

re[get_local_id(0)>>1].lo=d.even;

im[get_local_id(0)>>1].lo=d.odd:

}

barrier(CLK_LOCAL_MEM_FENCE);

if ((get_global_id(0)&1)==1)

{

re[get_local_id(0)>>1].hi=d.even;

im[get_local_id(0)>>1].hi=d.odd:

}

(...)

I think this point may be usefull to other:

if v is e.g. float4, v.hi is a float2 for reading but not as an lvalue

v.hi=something; is simply a short for v=select(v,(float4)(something,something),(unit4)(0,0,-1,-1));

LeeHowes · ‎07-11-2012

Reading the OpenCL and C99 specifications I'm not sure the problem is that v.hi is not an l-value in the way you want. The spec says that vectors are counted as aggregate types for aliasing according to C99 rules - and so if that were the case then arbitrary structs and arrays would surely suffer the same way.

I think the pertinent issue here is divisibility of addressing of vectors.

6.1.7 says:

It is an error to take the address of a vector element and will result in a compilation error.

and while the examples don't say this explicitly, the implication is that that is because there is no guarantee that addressing parts of the vector actually makes sense. .hi on a vector in memory would require that it could read just .hi, and while that is of course possible on the architecture, it would be less efficient to do so within the terms of the strict interpretation of the specification.

I'll try to see if this interpretation really makes sense.

gbilotta · ‎07-13-2012

Precisely: the .hi and .lo are valid l-values, but there is no guarantee that they can be accessed independently from the other half of the vector. This depends on both the vector width of the data and the vector width of the registers where it's stored.

For example, if hardware has 4-wide vector registers and your data is 8-wide, you can access .hi and .lo independently from eac other _in parallel_. If the hardware is 4-wide and the data is 4-wide, then each write access _must_ be turned into read-write, effectively the equivalent of something like a = ((a & 0xffff0000) | newlowbits) for scalar values.

What is happening in the OP's example is that he's updating the high and low half of the same vector data from two work-items concurrently, on a hardware where the vector registers are as wide as the data he's juggling. This results in only one of the write succeeding, because both happen at the same time.

Aside from the use of barriers, that will probably result in a performance hit, there are some alternatives that the OP can try to solve the issue, exploiting the fact that tthe odd and even numbered work-items will always be in the same wavefront.

Essentially, what he has to do is to ensure that the update is not happening on the same time. If the compiler is not "too smart" and does not do reordering inside the blocks, he can try something as simple as:

if ((get_global_id(0)&1)==0)
 {
 re[get_local_id(0)>>1].lo=d.even;
 im[get_local_id(0)>>1].lo=d.odd;
 }
else
 {
 im[get_local_id(0)>>1].hi=d.odd;
 re[get_local_id(0)>>1].hi=d.even;
 }

(notice how the im/re assignments are swapped in the second half); or maybe something like

if ((get_global_id(0)&1)==0)
 {
 re[get_local_id(0)>>1].lo=d.even;
 }
else
 {
 im[get_local_id(0)>>1].hi=d.odd;
 }

if ((get_global_id(0)&1)==0)
 {
 im[get_local_id(0)>>1].lo=d.odd;
 }
else
 {
 re[get_local_id(0)>>1].hi=d.even;
 }

(both are untested because I don't have an AMD card presently)

cantallo · ‎07-17-2012

No, clearly just swapping the two assignments would not work (I did not test, but I am pretty sure) because instruction in highly pipelined architectures as GPU and modern CPU have a long fly time and in this precise case if and else blocks are both executed by all threads successively (but each block instructions are masked "as no-ops" on half of the threads).

That's why we have barrier instructions (indeed, it is also because a single group may be executed on successive wavefronts in case it is larger than 64 threads).

On Nvidia hardware, where local memory is off-chip, a local memory access lasts a few cycles hence with your coding the last instruction will start before the first is completed.

gbilotta · ‎07-17-2012

cantallo wrote:
No, clearly just swapping the two assignments would not work (I did not test, but I am pretty sure) because instruction in highly pipelined architectures as GPU and modern CPU have a long fly time and in this precise case if and else blocks are both executed by all threads successively (but each block instructions are masked "as no-ops" on half of the threads).

Actually, if the GPU pipeline was really that long and the two blocks were executed (masked) sequentially, the OP would not be seeing any problem, because when the second block got to finally be executed, it would already be able to access the data updated from the first block. But this is obviously not the case, so either the local memory access has many-cycle latencies in AMD hardware (more than the block pipeline), or the compiler is doing some very smart predication usage that allows different lvalues to be updated concurrently (which is what I think is actually happening).

That's why we have barrier instructions (indeed, it is also because a single group may be executed on successive wavefronts in case it is larger than 64 threads).

Actually, it is _mostly_ to allow multi-wavefront groups to synchronize. If you have single-wavefront groups, you can get by without any barriers for local memory access, as long as you mark your __local storage volatile (I'm noticing right now that the OP didn't do that though) to prevent the compiler from optimizing consecutive accesses. I'm wondering if just marking the storage volatile would fix the OP problem, now (assuming he's using single-wavefront groups).

On Nvidia hardware, where local memory is off-chip, a local memory access lasts a few cycles hence with your coding the last instruction will start before the first is completed.

Note that what OpenCL calls ‘local’, CUDA calls 'shared', and the shared memory in nvidia hardware is on-chip, not off-chip, even though it has a two-cycle latency in Fermi and Kepler hardware (used to be one-cycle in Tesla (1.0) hardware). The local memory in CUDA/NVIDIA speak is actually a section of global memory reserved for private thread usage, so its access is extremely slow (not a few cycles, but hundreds of them).

notzed · ‎07-17-2012

Actually, if the GPU pipeline was really that long and the two blocks were executed (masked) sequentially, the OP would not be seeing any problem, because when the second block got to finally be executed, it would already be able to access the data updated from the first block. But this is obviously not the case, so either the local memory access has many-cycle latencies in AMD hardware (more than the block pipeline), or the compiler is doing some very smart predication usage that allows different lvalues to be updated concurrently (which is what I think is actually happening).

It's a lot simpler than that.

OpenCL has a different memory model to C which allows for much more agressive optimisation. It is similar to using 'restrict' on all accesses, allowing any memory access to be cached in registers indefinitely. Without explicit barriers there is no guarantee that memory reads (or writes) occur when they are listed in the code (unlike say, java, which has an explicitly ordered memory model).

In addition to synchronising threads (somewhat as a side-effect), the barriers force all writes to be completed before continuing, and also mark any existing reads as invalid forcing new reads as well.

It's easy enough to see what the compiler is actually doing with AMD gpu's since you can look at the isa.

notzed · ‎07-16-2012

I think what you're trying to do isn't much use anyway, you do realise that the compiler will generate both cases of the if statement which every thread will 'execute', and just mask execution of half of them - you will not get both halves executed concurrently. The reason you need the barrier is not because each part runs at the same time, it is because each odd/even thread pair has it's own copy of the value in registers and the barrier forces a flush and a re-read of any local references (you don't have enough local threads for the barrier to make any difference to the thread scheduling). For the same reason I'm pretty certain gbilotta's suggestion will not work either.

As for vectors: you've explicitly told the compiler you want a vector, so your description of the behaviour is exactly what you should expect.

What you probably want to do is just use a flat float array, and just write (or read) the values interleaved one element at a time. Doing this will require only 4x 32-bit writes and 4x32-bit reads (i.e. write interleaved, barrier, read linear), whereas your method will require 12x32-bit reads and 8x32-writes and 2 barriers to work properly (i.e. read modify write, barrier, read modify write, barrier, read). Accessing local memory with float4 will cause bank conflicts anyway (although your problem looks too narrow to hit those, or to fully utilise the processor).

cantallo · ‎07-17-2012

Thanks for you answer, I will try your suggestion of using a flat local array.

I am a bit puzzled by your last remark about bank conflict (I thought that -as I remember having read somewhere- on AMD hardware, the local memory is implemented in the register file which is organized as 4×32 bits wide banks).

I assumed that flat local array just use one of the 4 32-bits storage place per entry (bank) (otherwise an access such as float4 w=local_array[get_local_id(0)]; would result in a bank conflict).

Do you have better information on the AMD hardware ? (I tested my code on H7970 and V7900. The funny thing is that -at present- scalar code with 32-thread wide groups runs faster on AMD hardware than vector code with 64-wide groups, but the vector code runs faster on NVidia hardware C2075 or C2050 than the scalar one. Clearly the programming guidelines provided by both manufacturers do not yield best performance)

By the way, the problem of de-interlacing is just the very first data intake operation of a quite heavy computing kernel, that is why I wanted to use vector variables.

I shall investigate whever local array should be scalar or vectors because my program uses a lot of transposition for switching between row-oriented and column oriented processing stages.

notzed · ‎07-17-2012

I'm just going on publicly available information - mostly from the amd programming guide - but I gotta say it's a bit hard keeping up with every bit of new hardware or the explicit details of how it is implemented. My basic understanding is that you can access 2x32-bit words per work item without bank conflicts (assuming the addresses interleave properly, e.g. are linear), and trying to read 4x32 bit words will cause bank conflicts.

e.g. The programming guide section 5.9 has a single paragraph about LDS and states that GCN still has 32 banks (and each bank is still 32-bits in size not 128 bits - this is the same on the older hardware too), and LDS is accessed in half-wavefront granularity, so i presume the same rule of 2x32 adjacent values holds. My results from sprofile seem to suggest it works the same.

As a rule of thumb I normally just always write all vectors as scalar quantities when they're headed to LDS, even though sometimes it runs a tiny bit faster with a few bank conflicts (I also initially thought even float2 caused bank conflicts).

e.g. if i have 64 work items i would write float4 as

local float ldata[64*4];

float4 val;

int lx = get_local_id(0);

ldata[lx + 64*0] = val.s0;

ldata[lx + 64*1] = val.s1;

ldata[lx + 64*2] = val.s2;

ldata[lx + 64*3] = val.s3;

barrier()

This is not always the fastest possible way, and is pretty ugly TBH, but it makes the code consistent with different processing sizes, and makes it easier to calculate bank-conflict avoiding offsets (see the programming guide about LDS optimisations). And it lets me re-use precious LDS for different purposes within a kernel without tying it to a specific vector size or layout.

BTW on your point about wavefront sizes - either the algorithm doesn't benefit from increased parallelism (e.g. it's a small problem, and using smaller wave-fronts causes it to run on more CU's at once), or there are something else going on. e.g. increased LDS usage and register usage reducing the number of concurrent wavefronts possible on a given CU.

GCN hardware doesn't have vector registers (at least, not in the sense of vectors per work item), so if you're using vector4 registers all you're doing is providing 4x the work to do - which can be effective for unrolling, but will also require 4x the register file space (and LDS). If you are doing a lot of ALU work this increased register load can have an impact on concurrency achievable. And if you're using a lot of LDS it could also limit the concurrency. sprofile shows this stuff.

Archives Discussions

local subvector write is in fact read-modify-write of the full vector.