Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Journeyman III

OpenCL persistent thread

Hi all, 

I'm trying to make an ray-triangle accelerator on GPU and according to the article Understanding the Efficiency of Ray Traversal on GPUs 
 ( one of the best solution is to make persistent threads.

So I tried to port the below CUDA code to OpenCL. However, running the code under CPU with ati-stream shows, that everything goes wrong at the line "localPoolRayCount -= 32" (althought it is set to 32 after decrementing I'm receiving a negative number.) I'm aware there are some problems with local variables, so I changed __local int localPoolNextRay to __local int* localPoolNextRay[1] with no luck.


I'll be glad for any suggestion as I'm at my wit's end. 


// CUDA code from the mentioned article const int B = 3*32; // example batch size const int globalPoolRayCount; int globalPoolNextRay = 0; __global__ void kernel() // variables shared by entire warp, place to shared memory __shared__ volatile int nextRayArray[BLOCKDIM_Y]; __shared__ volatile int rayCountArray[BLOCKDIM_Y] = f0g; volatile int& localPoolNextRay = nextRayArray[threadIdx.y]; volatile int& localPoolRayCount = rayCountArray[threadIdx.y]; while (true) f // get rays from global to local pool if (localPoolRayCount==0 && threadIdx.x==0) { localPoolNextRay = atomicAdd(globalPoolNextRay, B); localPoolRayCount = B; } // get rays from local pool int myRayIndex = localPoolNextRay + threadIdx.x; if (myRayIndex >= globalPoolRayCount) return; if (threadIdx.x==0) { localPoolNextRay += 32; localPoolRayCount -= 32; } // init and execute, these must not exit the kernel fetchAndInitRay(myRayIndex); trace(); } //my OpenCL code #pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable #define B 3*32; // example batch size // globalPoolNextRay is set to 0 in the cpp file __kernel void kernel(__global int* globalPoolNextRay, int globalPoolRayCount){ __local int localPoolNextRay; __local int localPoolRayCount; if ( get_local_id(0) == 0){ localPoolNextRay = localPoolRayCount = 0; } barrier(CLK_LOCAL_MEM_FENCE); while(true){ if ( localPoolRayCount == 0 && get_local_id(0) == 0){ localPoolNextRay = atom_add(globalPoolNextRay,B); localPoolRayCount = B; } //barrier(CLK_LOCAL_MEM_FENCE); //I suspect at least here should be a barrier, but some threads can be already missing... // get rays from local pool myRayIndex = localPoolNextRay + get_local_id(0); if ( myRayIndex > globalPoolRayCount) return; if ( get_local_id(0) == 0){ localPoolNextRay += 32; localPoolRayCount -= 32; } // init and execute, these must not exit the kernel fetchAndInitRay(myRayIndex); trace(); } }

25 Replies

Dear Karbous,

I have two candidates of solutionsto your problem. First I'd ask what is the workgroup size you use? Everything is incremented and decremeneted by 32, but if you use preferred work-group size, it might lead to problems. I'm not sure if it is strictly set to 32.

Second is, I think you are missing a barrier (above the one commented out), because

myRayIndex = localPoolNextRay + get_local_id(0);

depends on the varible localPoolNextRay which is later modified by the first thread of the workgroup. If the first thread takes over the others by enough to modify this variable, the others will read junky data. So this seems like a sync issue to me, which might be the cause of your problem.




Thank you for your swift answer!

I'm explicitly setting the local work-group size. However, it shouldn't be a problem in this case either, as there is the condition "myRayIndex > globalPoolRayCount". (eg. 64 thread block might do unnecessary work but sould never exceed the arrays.)

I think you are partially right with the barrier synchronization, however what will happen if some thread have already ended and never reach barrier call? Why equivalent CUDA code doesn't need any synchronization? (As far as I know CUDA shared memory is same as OpenCL local memory)

Thank you very much for trying to help me



I do not know how CUDA sync works, whether there are implicit sync commands anywhere or that sort of black magic. To answer yor question, what will happen if a thread ends without reaching a barrier: the world will explode. Okay, it's not that serious, but the program will definately not work.

Some function calls are explained: "all members of a work group must reach this function call. Failing to do so will result in undefined program behaviour." Barriers, mem_fences, work-group_copy, prefetch, local memory alloc and those sort of things should ALWAYS be done, so that every work-item makes the call to these functions. The latter 3 must be the very same function call (if I'm not mistaken), but barriers and mem_fences only have to be called the same number of times.

Let's say you have a switch statement, if you place a barrier inside each of the outcomes, everything is okay. If one thread makes it down a path in which there is no barrier, most probably the kernel will hang.

I do not know why there is no sync in the CUDA code, but the code you wrote relies heavily on luck, and one should always take great care with syncing before and after accessing shared variables. Even if something works on CPU-GPU (by luck), it can very easily crash on the other. The compiler can do major alterations to program flow. If you are worried about performance, think through if mem_fence is enough, which gives some more feeedom to the compiler, but 99% of the cases, mem_fences/barriers are not negligable when reading/writing shared variables.


If the workgroups are only 64 in size then branching around the barrier is safe. If the compiler knows the group is only 64 in size then the barrier is nothing more than a memory fence + compiler hint.

If the workgroup is 2D then multiple work items would try to do the write to LDS, can't be sure from your code. I doubt it though from what you've said.

You shouldn't need that barrier, but as your local address isn't a volatile pointer you will need a fence. The compiler may not actually update localPoolRayCount when you expect, keeping it in a register instead. It's possible that other work items will not get the updates to the shared variable because there's nothing to tell the compiler to push it out. CUDA has stricter rules about warp coherency than OpenCL so you have to be a bit more careful with fences in CL to make barrier-free code work correctly.

So, you do need a barrier or fence, I'd just use a fence (and write this sort of code regularly. Look at the SIMD optimised position solver in the Bullet tree for an example The barrier will be safe on AMD hardware, though out of CL spec, if your work groups are no bigger than 64. With code like this, don't run groups larger than 64, there's no  benefit.



.......With code like this, don't run groups larger than 64, there's no benefit.....

I would have to disagree here.... Having the cores oversubscribed is always a good thing(Latency Hiding)......

The mem fence should of course be added for correctness, but is not required, Thats coz cuda does an implicit warp(wavefront) wide sync after each divergent if statement.....


Another thing... Wavefronts are 64 wide on AMD, so you might want to tweak your batch sizes to match... Else there would be no real benefit to persistent threads....




Thank you all for your help and suggestions. Sorry, I didn't mentioned my kernel is only 1D (threads are assigned triangles and they inspect ray-hierarchy, so I'm trying to exploit possibilities in the opposite approach to standard ray-tracing scheme)

Thanks for suggesting volatile and mem_fence! 

I modified the OpenCL code and added some printf statements to debug it on CPU. So now it looks like the attachment. However, there is still behavior that I don't understand. I get output like this (first 2 iterations):



DECREMENTED localPoolRayCount -192

BEFORE DECREMENTED localPoolRayCount -192

DECREMENTED localPoolRayCount -448

Why 64 - 64 isn't 0 but -192? What am I missing? Or the code can't be correctly run on CPU?


//my OpenCL code #pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable #pragma OPENCL EXTENSION cl_amd_printf: enable #define B 64; // example batch size // globalPoolNextRay is set to 0 in the cpp file __kernel void kernel(__global int* globalPoolNextRay, int globalPoolRayCount){ __local volatile int* localPoolNextRay[1]; //arrays only to bypass local variable problems __local volatile int* localPoolRayCount[1]; if ( get_local_id(0) != 0 ) return; //aded for debugging purposes if ( get_local_id(0) == 0){ localPoolNextRay[0] = localPoolRayCount[0] = 0; } mem_fence(CLK_LOCAL_MEM_FENCE); while(true){ if ( localPoolRayCount == 0 && get_local_id(0) == 0){ localPoolNextRay[0] = atom_add(globalPoolNextRay,B); localPoolRayCount[0] = B; } mem_fence(CLK_LOCAL_MEM_FENCE); // get rays from local pool myRayIndex = localPoolNextRay[0] + get_local_id(0); if ( myRayIndex > globalPoolRayCount) return; if ( get_local_id(0) == 0){ printf("BEFORE DECREMENTED localPoolRayCount %d\n", localPoolRayCount[0]); localPoolNextRay[0] += 64; localPoolRayCount[0] -= 64; mem_fence(CLK_LOCAL_MEM_FENCE); printf("DECREMENTED localPoolRayCount %d\n", localPoolRayCount[0]); } // init and execute, these must not exit the kernel fetchAndInitRay(myRayIndex); trace(); } }


I would have to disagree here.... Having the cores oversubscribed is always a good thing(Latency Hiding)......

The mem fence should of course be added for correctness, but is not required,

Yes. That's what groups are for. The fences may not be required for correctness in CUDA but our experiments around physics code have shown that recent versions of SC or or the opencl compiler require them to stop the *compiler* reordering the memory operations. It's not about issuing a hardware fence but a compiler hint. At least that's true for a local fence.

I can't see anything obvious wrong with the code now. Unfortunately I don't have time to run it myself as I have to hop on a plane soon. Hopefully someone can help.


Never mind, thank for your helpful insight view. Have a pleasant journey.


Anytime you switch from reading to writing or vice versa on a local variable, you need a barrier(CLK_LOCAL_MEM_FENCE), not just a mem_fence.

Thank you for joining this thread, Micah Villmow. I'm trying to bypass barrier call as barrier must be encountered by all work-items in a work-group but some work-items can be already missing between "persistent threads" due to "myRayIndex > globalPoolRayCount" condition.

So the question is, is it possible to have persistent threads in OpenCL? (threads that share global array and asking for work through an atomic counter) The original idea of persistent threads comes from NVidia's researchers (CUDA code) as far as I know and they are also cited by others. However other programmers are struggling with this concept as well (

With the printf and only 0. thread running on CPU I discovered odd behavior subtracting from a local variable. There isn't need for a barrier, I guess, as only one thread is running, so why the result isn't correct?

Thank you for bearing with me





What is your local size when you launch a kernel? We default to 256 threads, 64-256 = -192. -192 - 256 = -448, so the subtractions make perfect sense if you are using the default local size.

There is no way to safely/correctly bypass barrier calls at the OpenCL source level in a portable manner. You can do this with knowledge of the underlying hardware, but it is not portable.

The correct way to do this is to insert the barriers as you would expect them to be in order to be correct, but then use reqd_work_group_size as an attribute to the kernel so that the underlying compiler will optimize the barrier away.

One AMD GPU hardware, as long as one thread in a wavefront hits a barrier, it is the same as every thread in the wavefront hitting the barrier. If you limit your work-group size to a single wavefront, then all barriers become no-ops.

However, the wavefront size is different between chips and also the behavior on the CPU is different.


I partially get it. Thanks. 

I'm explicitly setting local size to 64. So how can be subtraction altered by thread-block size? According to the condition, only 0. thread of a block should decrement the local variable... How it can make sense 64 - 64 = -192 ? I didn't subtract -256, did I?


Indeed, setting block size to be equal to warp/wavefront size helped and now it is running on GPU as expected (without barriers and mem_fences). Thank you all for your help!


Originally posted by: karbous Indeed, setting block size to be equal to warp/wavefront size helped and now it is running on GPU as expected (without barriers and mem_fences). Thank you all for your help!

Do you mean "setting local workgroup size eq. to block size helped" ? Otherwise, I wouldn't understand how it started to work ...


Journeyman III

DTop, sorry for not beeing clear enough. I meant "local worgroup size". However, the persistent thread concept didn't help to run it faster 😞


This is, however, expose interesting thing that if you local wrk grp size == 64, and one is not limiting work group size to 64 by reqd_work_group_size (or purposely set it to 256) then __local variable becomes shared between 4 consequent waveforts run on the same simd. So __local becomes shared between wavefronts!


DTop, are you sure about this? Can anyone confirm? I highly doubt this, since this would go against API logic. If I start 4 groups of 64 threads, they will most likely end up on the same SIMD. However, they have nothing to do with each other when it comes to __local variables. HW should not let them touch each other's memory.

Or I can only imagine, that GPU does not check for overindexing (becuase it doesn't), and you can accidently access the other threads memory. But explicitly naming a __local variable of another wavefront is quite impossible, since it is referred to with the same name in the code.



Local memory is only shared between wavefronts when they are present inside a common work group. Accessing same LDS block from two different workgroups is not possible.

In cases where multi-workgroups run on same simd LDS is divided between them not shared between them. For more info see

NDRange and Execution Range Optimization section

of OpenCL Programming guide.



however, from webinar presentations it is known that gpu runs 4x 64 (4 wavefronts) one after another. when fully loaded. If 4 of them run on the same simd, what will not allow them to access the same __local variable and substruct 64 every time?


Also, LDS size is 32K. How it going to shared? Temporary saved somewhere?



LDS size is 32k indeed, but it's not saved anywhere. 64 is the default wavefront size, and another wavefront is co-issued to the same SIMD only in the case that it doesn't require more LDS memory than there is left free by the other wavefront already running.

The reason why I highly doubt your statement is that kernels work this way that I mentioned, and you cannot even name a memory space that belongs to a different workgroup. If you set LWS to 64, then if two groups are co-issued to the same SIMD, both of them will allocate all their shared variables accordingly and will not share them.

This is how the programming model works and I will eat my hat if what you say is true.


DTOP & Karbous,

Here is my analysis of the kernel you provided.(I had to build the host code for thorough analysis):

1. localPoolNextRay[1] and localPoolRayCount[0] are not a local array but  arrays of pointers to local memory  :

__local volatile int* localPoolNextRay[1];                  

__local volatile int* localPoolRayCount[1];

2. There is statement written which tries to copy the result of a global atomic addition to this local pointer. Definitely not possible. Undefined behaviour should occur. Although it gave an error with internal SDK.

3. According to pointer arithmetic adding N to a pointer shifts it N*sizeof(datatype of pointer). So a decrement of 64 should give an actual decrement of 64*4.Which might be the reason for the 256 effective subtraction.

Please correct me if i am wrong somewhere.



Himanshu & Meteorhead,

Thanks. Appreciate your efforts and explanaitons.





Don't want to spawn a speculations, but take a look at the code above. The substruction is happening only if get_local_id(0) == 0 (so only one thread of local group is allowed to pass), and it happen 4 times by 64, when local workgroup been set to 64, while default kernel group is == 256 (can be changed by reqd_work_group_size, I guess . Once kernel local group has been forced to 64, by reqd_work_grou_size, again, my guess, get_local_id(0) == 0 happen 1 time as it should.

Also the question, whether __local variables are taken from LDS storage as well, or it is some kind of shared registers (like temp?) ?


Alright... Declare the  nextrayarray  etc to be volatile...  should be much faster than doing frequent mem fences....

Secondly... i n oticed your next ray array has only   one element... This may  be the cause of your error...

What are your work group sizes? Ideally, the no of elements in the next ray array should be equal to the work group size in the y dimension....

If it is not, it is easy to see why that error is happening...


Debdatta Basu.




debdatta.basu, thank you for joining this thread. I have only 1D kernel of block size 64/32 (depends on which GPU I want it to run). I was just wondering what went wrong on CPU with local variables... (Note I run there only 1 thread, as the condition get_global_id(0)) != 0 ends the other threads.) However, I'm not targeting the CPU, so it's not that painful for me. Just I am curious.