Archives Discussions

teh_orph · ‎06-14-2010

Hi there,

I wrote some OpenCL code today at work in one of the NVIDIA OpenCL samples, then brought it home to see how well it fares on my 5770. After a dependent read is enqueued on my kernel, the machine will quickly lock up. Networking still works but all USB halts and the machine's display doesn't update. Sound continues though...

If I slightly modify my code to not do some simple atomic stuff (which works on my GTX 275) then it works okay (albeit the result is not correct). This failure occurs with both cat 10.4 and 10.5.

Is there anywhere I can send this so that an engineer can have a go and repeat it themselves?

Cheers!

Simon

PS: all the AMD Stream 2.1 examples run fine...it's just my program which locks it!

Btw, kernel's attached.

EDIT: it appears that it's the final barrier that's the problem. If I remove that then it's fine!

#pragma OPENCL EXTENSION cl_khr_global_int32_extended_atomics : enable #pragma OPENCL EXTENSION cl_khr_local_int32_extended_atomics : enable __kernel void SSD (const __global unsigned char* pBig, const __global unsigned char* pSmall, __global int* pOut, int iWidth, int iHeight, __global int *pFinal) { // find position in global arrays int gx = get_global_id(0); int gy = get_global_id(1); int lx = get_local_id(0); int ly = get_local_id(1); __local int local_min; local_min = 99999; barrier(CLK_LOCAL_MEM_FENCE); int global_x = get_group_id(0) * localW; int global_y = get_group_id(1) * localH; //__local unsigned char plSmall[iSmallWidth * iSmallHeight]; //__local unsigned char plBig[(localW + iSmallWidth) * (localH + iSmallHeight)]; //event_t pref = 0; //pref = async_work_group_copy(&plSmall[0], &pSmall[0], (size_t)(iSmallWidth * iSmallHeight), pref); //for (int y = 0; y < localH + iSmallHeight; y++) // prefetch(&pBig[(y + global_y) * iWidth + global_x], (size_t)(localW + iSmallWidth)); // pref = async_work_group_copy(&plBig[y * (localW + iSmallWidth)], &pBig[(y + global_y) * iWidth + global_x], (size_t)(localW + iSmallWidth), pref); //wait_group_events(1, &pref); //barrier(CLK_LOCAL_MEM_FENCE); int sum = 0; for (int y = 0; y < iSmallHeight; y++) for (int x = 0; x < iSmallWidth; x++) { unsigned char big = pBig[(y + gy) * iWidth + (x + gx)]; //unsigned char big = plBig[(y + ly) * (localW + iSmallWidth) + (x + lx)]; unsigned char small = pSmall[y * iSmallWidth + x]; int diff = (int)big - (int)small; diff = diff * diff; sum += diff; } pOut[gy * iWidth + gx] = sum; /*atom_min(&local_min, sum); <-- unmark this section to force the crash! if (local_min == sum) { atom_min(&pFinal[0], sum); barrier(CLK_GLOBAL_MEM_FENCE); if (pFinal[0] == sum) { pFinal[1] = gx; pFinal[2] = gy; } }*/ }

teh_orph · ‎06-15-2010

So...where should I be reporting these bugs? Is this the right forum?

MicahVillmow · ‎06-15-2010

This is the correct location.

MicahVillmow · ‎06-15-2010

teh_orph,
All threads in a group must hit the barrier or the results are undefined. Please gaurantee that all threads hit the last barrier or that no threads hit the last barrier.

teh_orph · ‎06-16-2010

Cool, cheers for the info! Although it'd give the same result in this case, I only actually want the ones which took part in the atom_min to wait (since they're the only ones which may have written data). Or is that not the intention of barriers?

Let's say that I wanted to surround the whole function with an if statement which only did processing assuming the global x,y were not near the border of my data - where would I put the barrier then?

eg,

if (get_global_id(0) < iWidth-iSmallWidth && get_global_id(1) < iHeight -iSmallHeight)

{

atom_min(ptr, result);

barrier(CLK_GLOBAL_MEM_FENCE); <--here?

atom_max(ptr, another result);

}

where should I put the barrier to ensure that they all take it?

And finally, surely a misplaced barrier shouldn't be locking up my PC?

teh_orph · ‎06-20-2010

So...any guidance on where to put barriers when a whole kernel is if'ed out?

And can I report the locking-up -the-computer issue in some kind of bug track database? Cheers! (surely the driver should time out some how?)

MicahVillmow · ‎06-21-2010

teh_orph,
If you deadlock the card by incorrectly writing a kernel/app there is nothing the driver or the OS can do to recover. A graphics card is not a CPU and is not pre-emptible by the OS. Your PC is not locked up, as you should still be able to SSH into the machine, but the graphics card is, so your display won't update.

teh_orph · ‎06-21-2010

Hang on, that can't be right

- I'm only using one compute unit, since my work size is only made up of one work group

- the driver can obviously load balance several pieces of work at the same time

- I can't believe that there's not a watchdog timer of some sorts which can interrupt errant kernels

- it crashes over a period of ten seconds, first the mouse + keyb refuse the work, then sound stops, then the display stops working so it's not a simple dead-lock

- this is 2010: how are we supposed to develop software if we have to reboot if we make a mistake whilst coding??

EDIT: sorry this sounds a bit negative, I'm just a bit frustrated and I hope to make OpenCL development easier for people further along the line and don't want people to automatically blame "dodgy AMD drivers" when their mouse locks up!

MicahVillmow · ‎06-21-2010

teh_orph,
This is not a driver issue, but a hardware/kernel issue, as the driver has no control over the kernel that the user writes. This is a hardware issue because our hardware is not pre-emptible, so it doesn't act like a CPU. So if you do something to hang the hardware and cause the hardware to stop responding to reset commands from the OS, then it is a kernel issue. It only takes two wavefronts to hang the GPU, so unless your work-group size is equal to the wavefront size, then it is possible to hang on an illegal kernel.

teh_orph · ‎06-21-2010

Ah fair enough! Thanks for the help.

So - for future reference - what can hang a GPU? I've seen unending loops and this barrier issue, anything else?

And I still don't understand this barrier problem - why do I need all processors to take a barrier? Why should they hang if not all of them do it? And finally, what can I do if it only makes sense for a subset to take it (eg if the other ones leave via a surrounding if statement)?

MicahVillmow · ‎06-21-2010

teh_orph,
It is a simple live-lock problem.
Consider this simple code:
if (get_local_id(0) < 128) {
barrier(CLK_LOCAL_MEM_FENCE);
} else {
barrier(CLK_GLOBAL_MEM_FENCE);
}

If you run a work-group of 256 work-items, which is made up of between 4 and 16 wavefronts depending on the graphics card. The first half(A) of them will take the barrier with the local fence, the second half(B) will take the barrier with the global fence.
The A group then goes to sleep waiting for the B group to hit the barrier with the local fence.
The B group then goes to sleep waiting for the A group to hit the barrier with the global fence.

This is a typical locking problem. The big issue here is that the lock is in hardware, not in software, so you can't kill it.

teh_orph · ‎06-21-2010

Ah I see! Thanks! So it's really a way of preventing the program counter from advancing. Sorry, my mistake. (Although I thought that all the processors within a compute unit ran each instruction in lockstep?)

Anyway I guess what I'm really looking for is a mem_fence: these don't have to be run by all processing units, right? Can I have a kernel where a subset write data to a memory, and then have that subset do a mem_fence and then read back data from that memory space?

Cheers again for the help.

MicahVillmow · ‎06-21-2010

If there is any kind of data sharing, then you need a barrier and not just a memory fence. A memory fence is a compiler hint, a barrier is for execution synchronization.

On a side note, 1/4th of each wavefront execute in lock-step, and a work-group can be made up of multiple wavefronts. Also the barrier counter is at the wavefront level, so for our hardware, if one work-item from each wavefront hits the barrier then you can avoid a live-lock of the device.

teh_orph · ‎06-21-2010

Ok, final question:

I guess I can't do this either, if it's a program counter blocker...

if (get_local_id(0) < 128) {

some work
barrier(CLK_GLOBAL_MEM_FENCE);

} else {
barrier(CLK_GLOBAL_MEM_FENCE);
some work

}

...assuming the program counter of the two barrier instructions differ? Or is it sufficient that both sets of work hit a barrier before allowing their pcs are allowed to advance? (I guess this would be undefined behaviour...)

MicahVillmow · ‎06-21-2010

Each barrier at the source level is unique, so no two barriers will ever share the same program counter.

teh_orph · ‎06-22-2010

Cool thanks.

Back on the "it's easy to hang your GPU thing", would it be possible to get the compiler to emit a message if a condition like this were detected during? Sure, you couldn't detect many conditions but there must be some basic uses of barriers that are incorrect and the compiler could then hint about? (even if the hint wasn't necessarily correct)

eg,

if (get_global_id(0) < 128)

return;

else

barrier(...);

The compiler won't know if there would ever be a spilt kernel - where half take the barrier but half don't - but a hint could be given saying that the barrier isn't a part of the main function expression, but a conditional expression? Just something to draw your eye, rather than binary chopping your program until it doesn't crash any more! Make it easier for people porting code to OpenCL.

MicahVillmow · ‎06-22-2010

teh_orph,
The code generator already detects some cases, but most of the time the detection is only for cases where the optimizer illegally transforms your code, not when the user writes illegal code.

Archives Discussions

Kernel repeatedly hard-locks machine when using global atomics