Hi there,
I wrote some OpenCL code today at work in one of the NVIDIA OpenCL samples, then brought it home to see how well it fares on my 5770. After a dependent read is enqueued on my kernel, the machine will quickly lock up. Networking still works but all USB halts and the machine's display doesn't update. Sound continues though...
If I slightly modify my code to not do some simple atomic stuff (which works on my GTX 275) then it works okay (albeit the result is not correct). This failure occurs with both cat 10.4 and 10.5.
Is there anywhere I can send this so that an engineer can have a go and repeat it themselves?
Cheers!
Simon
PS: all the AMD Stream 2.1 examples run fine...it's just my program which locks it!
Btw, kernel's attached.
EDIT: it appears that it's the final barrier that's the problem. If I remove that then it's fine!
#pragma OPENCL EXTENSION cl_khr_global_int32_extended_atomics : enable #pragma OPENCL EXTENSION cl_khr_local_int32_extended_atomics : enable __kernel void SSD (const __global unsigned char* pBig, const __global unsigned char* pSmall, __global int* pOut, int iWidth, int iHeight, __global int *pFinal) { // find position in global arrays int gx = get_global_id(0); int gy = get_global_id(1); int lx = get_local_id(0); int ly = get_local_id(1); __local int local_min; local_min = 99999; barrier(CLK_LOCAL_MEM_FENCE); int global_x = get_group_id(0) * localW; int global_y = get_group_id(1) * localH; //__local unsigned char plSmall[iSmallWidth * iSmallHeight]; //__local unsigned char plBig[(localW + iSmallWidth) * (localH + iSmallHeight)]; //event_t pref = 0; //pref = async_work_group_copy(&plSmall[0], &pSmall[0], (size_t)(iSmallWidth * iSmallHeight), pref); //for (int y = 0; y < localH + iSmallHeight; y++) // prefetch(&pBig[(y + global_y) * iWidth + global_x], (size_t)(localW + iSmallWidth)); // pref = async_work_group_copy(&plBig[y * (localW + iSmallWidth)], &pBig[(y + global_y) * iWidth + global_x], (size_t)(localW + iSmallWidth), pref); //wait_group_events(1, &pref); //barrier(CLK_LOCAL_MEM_FENCE); int sum = 0; for (int y = 0; y < iSmallHeight; y++) for (int x = 0; x < iSmallWidth; x++) { unsigned char big = pBig[(y + gy) * iWidth + (x + gx)]; //unsigned char big = plBig[(y + ly) * (localW + iSmallWidth) + (x + lx)]; unsigned char small = pSmall[y * iSmallWidth + x]; int diff = (int)big - (int)small; diff = diff * diff; sum += diff; } pOut[gy * iWidth + gx] = sum; /*atom_min(&local_min, sum); <-- unmark this section to force the crash! if (local_min == sum) { atom_min(&pFinal[0], sum); barrier(CLK_GLOBAL_MEM_FENCE); if (pFinal[0] == sum) { pFinal[1] = gx; pFinal[2] = gy; } }*/ }
So...where should I be reporting these bugs? Is this the right forum?
Cool, cheers for the info! Although it'd give the same result in this case, I only actually want the ones which took part in the atom_min to wait (since they're the only ones which may have written data). Or is that not the intention of barriers?
Let's say that I wanted to surround the whole function with an if statement which only did processing assuming the global x,y were not near the border of my data - where would I put the barrier then?
eg,
if (get_global_id(0) < iWidth-iSmallWidth && get_global_id(1) < iHeight -iSmallHeight)
{
<do algorithm>
atom_min(ptr, result);
barrier(CLK_GLOBAL_MEM_FENCE); <--here?
<do more work>
atom_max(ptr, another result);
}
where should I put the barrier to ensure that they all take it?
And finally, surely a misplaced barrier shouldn't be locking up my PC?
So...any guidance on where to put barriers when a whole kernel is if'ed out?
And can I report the locking-up -the-computer issue in some kind of bug track database? Cheers! (surely the driver should time out some how?)
Hang on, that can't be right
- I'm only using one compute unit, since my work size is only made up of one work group
- the driver can obviously load balance several pieces of work at the same time
- I can't believe that there's not a watchdog timer of some sorts which can interrupt errant kernels
- it crashes over a period of ten seconds, first the mouse + keyb refuse the work, then sound stops, then the display stops working so it's not a simple dead-lock
- this is 2010: how are we supposed to develop software if we have to reboot if we make a mistake whilst coding??
EDIT: sorry this sounds a bit negative, I'm just a bit frustrated and I hope to make OpenCL development easier for people further along the line and don't want people to automatically blame "dodgy AMD drivers" when their mouse locks up!
Ah fair enough! Thanks for the help.
So - for future reference - what can hang a GPU? I've seen unending loops and this barrier issue, anything else?
And I still don't understand this barrier problem - why do I need all processors to take a barrier? Why should they hang if not all of them do it? And finally, what can I do if it only makes sense for a subset to take it (eg if the other ones leave via a surrounding if statement)?
Ah I see! Thanks! So it's really a way of preventing the program counter from advancing. Sorry, my mistake. (Although I thought that all the processors within a compute unit ran each instruction in lockstep?)
Anyway I guess what I'm really looking for is a mem_fence: these don't have to be run by all processing units, right? Can I have a kernel where a subset write data to a memory, and then have that subset do a mem_fence and then read back data from that memory space?
Cheers again for the help.
Ok, final question:
I guess I can't do this either, if it's a program counter blocker...
if (get_local_id(0) < 128) {
some work
barrier(CLK_GLOBAL_MEM_FENCE);
} else {
barrier(CLK_GLOBAL_MEM_FENCE);
some work
}
...assuming the program counter of the two barrier instructions differ? Or is it sufficient that both sets of work hit a barrier before allowing their pcs are allowed to advance? (I guess this would be undefined behaviour...)
Cool thanks.
Back on the "it's easy to hang your GPU thing", would it be possible to get the compiler to emit a message if a condition like this were detected during? Sure, you couldn't detect many conditions but there must be some basic uses of barriers that are incorrect and the compiler could then hint about? (even if the hint wasn't necessarily correct)
eg,
if (get_global_id(0) < 128)
return;
else
barrier(...);
The compiler won't know if there would ever be a spilt kernel - where half take the barrier but half don't - but a hint could be given saying that the barrier isn't a part of the main function expression, but a conditional expression? Just something to draw your eye, rather than binary chopping your program until it doesn't crash any more! Make it easier for people porting code to OpenCL.