So, I wrote this prefix sum helper function that computes a scan of a local array whose size equals the size of 1st dimension of a work group.
unsigned int _log2(unsigned int num)
{
switch(num)
{
case 1:
return 0;
case 2:
return 1;
case 4:
return 2;
case 8:
return 3;
case 16:
return 4;
case 32:
return 5;
case 64:
return 6;
case 128:
return 7;
case 256:
return 8;
case 512:
return 9;
default:
return 0;
}
}
void scanSum(__local unsigned int* array)
{
if(get_local_id(0) == 0)
{
for(unsigned int i = 1; i < get_local_size(0); i++)
{
array += array[i - 1];
}
}
barrier(CLK_LOCAL_MEM_FENCE);
#if 0
//Don't know why but this code crashes Radeon cards
unsigned int logThreads = _log2(get_local_size(0));
for(unsigned int i = 0; i < logThreads; i++)
{
unsigned int newVal = array[get_local_id(0)];
//If threadNum > 2^i
if(get_local_id(0) >= (1 << i))
{
newVal += array[get_local_id(0) - (1 << i)];
}
barrier(CLK_LOCAL_MEM_FENCE);
array[get_local_id(0)] = newVal;
barrier(CLK_LOCAL_MEM_FENCE);
}
#endif
}
If I use the logrithmic parallel version of the scan (commented out), it runs fine on the CPU but crashes on my Radeon 5870. The linear single threaded version works fine on both CPU and GPU. Since it works on the CPU, but not on the GPU, I'm curious how I can debug this code. Any thoughts?