Archives Discussions

corry · ‎03-14-2012

Found a copy of 8.96 drivers, and tried my program that just wrote 2 values to a uav, and again, kernel hard locked the entire machine. What gives here? I haven't been able to run code on these things with anything but the original driver that shipped with these cards!

MicahVillmow · ‎03-14-2012

What cards? I'm guessing HD7XXX, also what OS? If you can link to your original post it would help in these situations where a new post is referencing information from a previous post.

Thanks,

Micah

corry · ‎03-14-2012

The previous post had 2 topis in it. Yeah, its HD7970s, 2 of them in the machine currently under windows 7 x64.

Old post was at http://devgurus.amd.com/message/1279449#1279449 but contained another issue as well.

Also mentioned in http://devgurus.amd.com/thread/158695 and I swear I had one more....

drallan · ‎03-14-2012

I have had the exact same kind of problem (edit:similar). I have tried all the drivers (listed below) since installing the cards and as soon as I run certain (many) programs the system crashes (instant reboot) or I get the BOSD (Bill's screen of death). I'm running a similar system with two 7970s and one 6950 under Windows 7 SP1. Only clues are:

Many programs are older programs known to work.
Not all programs.
A program might run once then crash the second time.
One program always runs once then always crashes second time.
Crash occurs at program launch.
Occasionally, program runs a few loops prior to crash (thus not compile stage)
Same problem after recent motherboard change and reinstall of Windows, (this is virtually a new computer)

Looks more like a host side problem, memory/buffer out of bounds. I noticed that my 3 cards have a total of 8GB memory, same as the motherboard. Windows (which is mostly unintelligible) shows something about buffering memory for the cards??

Drivers that work: (slightly different variants of 8.93.xxxxx)

OEM driver v 8.93.

02/21/2012 12:00 AM 137,171,160 amd_radeon_hd7900_win7_64.exe

Drivers that don't

12/05/2011 08:36 AM 155,455,736 11-11c_amd_catalyst_windows_vista_7.exe

12/14/2011 07:47 AM 114,931,120 11-12_vista64_win7_64_dd_ccc_ocl.exe

01/25/2012 07:46 AM 180,809,808 12-1_preview_amd_catalyst_windows_vista_7.exe

02/28/2012 05:35 PM 152,441,856 12-2_pre-certified_win7_64_feb_16.exe

03/08/2012 02:12 PM 165,923,488 12-2_vista_win7_64_dd_ccc_march7.exe

02/26/2012 04:13 PM 186,899,768 12-3_8.95_rc_amd_catalyst_feb17.exe

02/11/2012 09:45 AM 181,131,440 12.1a_preview_amd_catalyst_win7_32-64.exe

03/03/2012 10:54 PM 181,225,344 8.96-120228m-[Guru3D.com]x.exe

02/26/2012 04:40 PM 182,317,139 8_96-120214a-[Guru3D.com].exe

drallan

corry · ‎03-15-2012

I'm not sure if it is the exact same problem, but my kernels are all pretty narrow in scope, so its possible that by hitting different portions of the ALU or memory system the lockup can be avoided sometimes....me, just a simple kernel, I think I was using uav3 (arbitrarily) literal l0, r0, and r1, moving l0.xxxx to r0, and r1, then uav_raw_store_id(3) mem, l0.x, r0 uav_raw_store_id(3) mem, l0.y, r1....yeah simple as that, and it locks up. Oviouusly l0.y was the next address to write to. The machine, from crash to crash takes about 7 minutes or so, so I lost my patience having run the thing 4 times with the same results and gave up.

drallan · ‎03-15-2012

Yes, of course it may not be exactly the same.

What I meant was a a problem that occurs with all newer drivers but never seen with the OEM drivers for seemingly good kernels. It's possible that that something has changed that exposes a problem I have, but it seems unusual that so many kernels are affected. I'll have to go back and see if I can narrow down where this is happening, not so easy since the system reboots and I have to continually reinstall drivers.

Does your whole system hang or is it just the display that stops?

corry · ‎03-15-2012

It seems to hang completly. Eventually even the mouse cursor stops updating. I haven't tried having a command line window up and a reboot command prepped and just pressing enter after a while to see if it responds...

corry · ‎03-16-2012

I get the feeling we're going to get stonewalled here again and be told to use OpenCL..Its really not making a good case for us to continue using AMD GPUs....I'm not even certain at this point if those with decision making power are going to let us....instead we'll have to use the higher power consumption, higher cost of cooling, slower, but working other brand GPUs....maybe I'll get lucky and Intels Knights Landing will come soon and to a broad audience...

drallan · ‎03-16-2012

I have not used CAL since installing the 7970s, though I would like to see how it works. If you post your short kernel I'll try that since our systems are similar.

Instead of CAL, I now implement an extra step to edit IL code before it is compiled to binary, which either substitutes an entire block of IL code or does line by line substitution of IL instructions. The latter is good for all the integer instructions like bit reverse, bit alignment, and reading the timers.

As for crashing, I took a better look at the most recent 8.96 (March 7) drivers to see what migh be making them crash, I found two situations that account for most of it, (I also found that one ALU intensive kernel runs as much as 25 percent faster under the new version, though most kernels run about the same speed). The crashes are occuring when

1. The very first initial call to clGetPlatformIDs(0,NULL,&nplatforms); IF, the call is executed from a thread (which I normally do). I can fix this by making a dummy call to the same routine just before entering the thread. Surely that's just a patch as the call should never crash, and it may have something to do with the multiple GPUs.

2. Several applications that run two devices in parallel will crash on calling enqnueueNKRangeKernel. If I now call clFinish() between the calls then it will not crash. This also should not be the case, I have been routinely running kernels in parallel for a long time, it is only when installing the new drivers. Calling clFinish though is not a fix because then there is no point to multiple devices.

As for GPUs, 10 teraflops on two cards it hard to turn down, though I agree OCL was never designed for high performance.

corry · ‎03-16-2012

I don't think the problem is the kernel, I think its CAL. I agree with them moving away from it, I just don't think their timeline was thought out at all...not...at...all... The kernel I was using was basically the following, but I hacked it into existing code that I generate from another program. Its been regenerated several times now.

il_cs_2_0

dcl_literal l0, 0, 16, 0xFFFFFFFF, 0

dcl_raw_uav_id(8)

uav_raw_store_id(8) mem, l0.x, l0.zzzz

uav_raw_store_id(8) mem, l0.y, l0.zzzz

endmain

end

Still, that will lock it up good and solid after the call to calCtxIsEventDone (which actually dispatches the work).

I don't really want to rant about this, so I'll just say this. I'm not far away from pulling my support of AMD hardware where I work. The hardware might be better, but without software that works, the best hardware in the world is completly useless.

Archives Discussions

8.96 still can't run any code on the GPU