teh_orph

Kernel repeatedly hard-locks machine when using global atomics

Discussion created by teh_orph on Jun 14, 2010
Latest reply on Jun 22, 2010 by MicahVillmow

Hi there,

I wrote some OpenCL code today at work in one of the NVIDIA OpenCL samples, then brought it home to see how well it fares on my 5770. After a dependent read is enqueued on my kernel, the machine will quickly lock up. Networking still works but all USB halts and the machine's display doesn't update. Sound continues though...

If I slightly modify my code to not do some simple atomic stuff (which works on my GTX 275) then it works okay (albeit the result is not correct). This failure occurs with both cat 10.4 and 10.5.

Is there anywhere I can send this so that an engineer can have a go and repeat it themselves?

Cheers!

Simon

PS: all the AMD Stream 2.1 examples run fine...it's just my program which locks it!

Btw, kernel's attached.

EDIT: it appears that it's the final barrier that's the problem. If I remove that then it's fine!

#pragma OPENCL EXTENSION cl_khr_global_int32_extended_atomics : enable #pragma OPENCL EXTENSION cl_khr_local_int32_extended_atomics : enable __kernel void SSD (const __global unsigned char* pBig, const __global unsigned char* pSmall, __global int* pOut, int iWidth, int iHeight, __global int *pFinal) { // find position in global arrays int gx = get_global_id(0); int gy = get_global_id(1); int lx = get_local_id(0); int ly = get_local_id(1); __local int local_min; local_min = 99999; barrier(CLK_LOCAL_MEM_FENCE); int global_x = get_group_id(0) * localW; int global_y = get_group_id(1) * localH; //__local unsigned char plSmall[iSmallWidth * iSmallHeight]; //__local unsigned char plBig[(localW + iSmallWidth) * (localH + iSmallHeight)]; //event_t pref = 0; //pref = async_work_group_copy(&plSmall[0], &pSmall[0], (size_t)(iSmallWidth * iSmallHeight), pref); //for (int y = 0; y < localH + iSmallHeight; y++) // prefetch(&pBig[(y + global_y) * iWidth + global_x], (size_t)(localW + iSmallWidth)); // pref = async_work_group_copy(&plBig[y * (localW + iSmallWidth)], &pBig[(y + global_y) * iWidth + global_x], (size_t)(localW + iSmallWidth), pref); //wait_group_events(1, &pref); //barrier(CLK_LOCAL_MEM_FENCE); int sum = 0; for (int y = 0; y < iSmallHeight; y++) for (int x = 0; x < iSmallWidth; x++) { unsigned char big = pBig[(y + gy) * iWidth + (x + gx)]; //unsigned char big = plBig[(y + ly) * (localW + iSmallWidth) + (x + lx)]; unsigned char small = pSmall[y * iSmallWidth + x]; int diff = (int)big - (int)small; diff = diff * diff; sum += diff; } pOut[gy * iWidth + gx] = sum; /*atom_min(&local_min, sum); <-- unmark this section to force the crash! if (local_min == sum) { atom_min(&pFinal[0], sum); barrier(CLK_GLOBAL_MEM_FENCE); if (pFinal[0] == sum) { pFinal[1] = gx; pFinal[2] = gy; } }*/ }

Outcomes