Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Adept II

Barrier to simulate SIMT on CPU


I have a small kernel, very simple. If you have the following input

"1 1 1 1 1 1 1 1 1 1 1 1 1 1 1"

It should create a simple scan

1 2 3 4 5 6 7 8 ...

It works only on NVidia SDK but not on AMD one !

But I got wrong results. 

__kernel void kernel__scanIntra(__global uint* input, uint size) { size_t idx = get_global_id(0); const uint lane = get_local_id(0); const uint bid = get_group_id(0); if (lane >= 1 && idx < size) input[idx] = input[idx - 1] + input[idx]; barrier(CLK_LOCAL_MEM_FENCE); if (lane >= 2 && idx < size) input[idx] = input[idx - 2] + input[idx]; barrier(CLK_LOCAL_MEM_FENCE); if (lane >= 4 && idx < size) input[idx] = input[idx - 4] + input[idx]; barrier(CLK_LOCAL_MEM_FENCE); if (lane >= 8 && idx < size) input[idx] = input[idx - 8] + input[idx]; barrier(CLK_LOCAL_MEM_FENCE); if (lane >= 16 && idx < size) input[idx] = input[idx - 16] + input[idx]; barrier(CLK_LOCAL_MEM_FENCE); }

11 Replies

Hi viewon01,

I would expect the results to be wrong. There seem to be read before write issue as new input[dx-1] is needed to calculate input[idx] while in most cases the kernel will use the old value.


Of course you're right, on GPU I have a real SIMT on every instructions, so here I have to do this in 2 steps... and one more syncho... no really efficient !!!!

Journeyman III

Why do you use barrier(CLK_LOCAL_MEM_FENCE) while you need to synchronize access to GLOBAL buffer?


So, I don't think he actually needs a fence here, just the thread barrier. As such, it doesn't really matter what kind of fence you use.

It actually is important. If the memory fence is not there for the correct memory type, then the compiler is free to move memory instructions across the barrier.

I thought that's what the volatile keyword was for, but since OpenCL doesn't have one, I guess it would make sense that you need the fence.


OpenCL inherits from C, so unless overridden by the OpenCL spec, anything that is in C99 is also in OpenCL.

I clearly have no idea what I'm talking about. Carry on. I thought volatile was reserved for future use, but it's in section 6.1.4.



Okay. Here is some sort of better solution for the problem.

Let's say we have an array with 20 elements with numbers 1-20 in them(for convinience). And we need to execute arr+= arr[i-1]  for each element.

Again for convenience we divide it into 2 parts of 10 elements each and assign the work to two separate threads. So thread0 get the value 55(1-10) and thread2 get the value 155(11-20). But the values calculated by thread1 are not correct but we can correct them by(a constant offset) adding (55-10)*10 to each item later in a separate kernel.

Well I understand this looks easy, but it would again be serialized on the steps to calculating offsets. But the number of offsets to be calculated will be reduced by a factor which is equal to the number of elements processed in the first step.

Hope this might be helpful.


Thanks Himanshu, it is the way most scan algorithms are working.

Today we have 2 scan algorithms, a general one and a GPU optimized one. I still have to used SIMD capabilities of the GPU and minimize memory access serialization !


In the current AMD OpenCL implementation, rick, it's actually the barrier he doesn't need and the fence he does. On the GPU at least. On the CPU the barriers would be necessary (split of course because of the RAW depency, I'm surprised that code works on nvidia's implementation... maybe I misunderstand the post).

Of course, that's out of the OpenCL spec and non-portable (but the only way to get good performance on a vector scan).

On either the CPU or GPU those barriers are going to lead to a slow scan operator, unfortunately. On the GPU you'd find it more efficient to allocate a set of identity values before the set of actual values and then you can take all that code without the conditionals:


input[idx] = input[idx - 1] + input[idx];


input[idx] = input[idx - 2] + input[idx];


input[idx] = input[idx - 4] + input[idx];


input[idx] = input[idx - 8] + input[idx];


input[idx] = input[idx - 16] + input[idx];


you know you're reading 0s then, rather than off the beginning of the array. Much more efficient!

All of that depends on knowing that you're running on a vector architecture, of course, and that the mapping of work items to the vectors is predictable.