karbous

OpenCL persistent thread

Discussion created by karbous on Dec 7, 2010
Latest reply on Dec 14, 2010 by DTop

Hi all, 

I'm trying to make an ray-triangle accelerator on GPU and according to the article Understanding the Efficiency of Ray Traversal on GPUs 
 (www.tml.tkk.fi/~timo/publications/aila2009hpg_paper.pdf) one of the best solution is to make persistent threads.

So I tried to port the below CUDA code to OpenCL. However, running the code under CPU with ati-stream shows, that everything goes wrong at the line "localPoolRayCount -= 32" (althought it is set to 32 after decrementing I'm receiving a negative number.) I'm aware there are some problems with local variables, so I changed __local int localPoolNextRay to __local int* localPoolNextRay[1] with no luck.

 

I'll be glad for any suggestion as I'm at my wit's end. 

 

// CUDA code from the mentioned article const int B = 3*32; // example batch size const int globalPoolRayCount; int globalPoolNextRay = 0; __global__ void kernel() // variables shared by entire warp, place to shared memory __shared__ volatile int nextRayArray[BLOCKDIM_Y]; __shared__ volatile int rayCountArray[BLOCKDIM_Y] = f0g; volatile int& localPoolNextRay = nextRayArray[threadIdx.y]; volatile int& localPoolRayCount = rayCountArray[threadIdx.y]; while (true) f // get rays from global to local pool if (localPoolRayCount==0 && threadIdx.x==0) { localPoolNextRay = atomicAdd(globalPoolNextRay, B); localPoolRayCount = B; } // get rays from local pool int myRayIndex = localPoolNextRay + threadIdx.x; if (myRayIndex >= globalPoolRayCount) return; if (threadIdx.x==0) { localPoolNextRay += 32; localPoolRayCount -= 32; } // init and execute, these must not exit the kernel fetchAndInitRay(myRayIndex); trace(); } //my OpenCL code #pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable #define B 3*32; // example batch size // globalPoolNextRay is set to 0 in the cpp file __kernel void kernel(__global int* globalPoolNextRay, int globalPoolRayCount){ __local int localPoolNextRay; __local int localPoolRayCount; if ( get_local_id(0) == 0){ localPoolNextRay = localPoolRayCount = 0; } barrier(CLK_LOCAL_MEM_FENCE); while(true){ if ( localPoolRayCount == 0 && get_local_id(0) == 0){ localPoolNextRay = atom_add(globalPoolNextRay,B); localPoolRayCount = B; } //barrier(CLK_LOCAL_MEM_FENCE); //I suspect at least here should be a barrier, but some threads can be already missing... // get rays from local pool myRayIndex = localPoolNextRay + get_local_id(0); if ( myRayIndex > globalPoolRayCount) return; if ( get_local_id(0) == 0){ localPoolNextRay += 32; localPoolRayCount -= 32; } // init and execute, these must not exit the kernel fetchAndInitRay(myRayIndex); trace(); } }

Outcomes