9 Replies Latest reply on Mar 16, 2012 2:45 PM by corry

    8.96 still can't run any code on the GPU

    corry

      Found a copy of 8.96 drivers, and tried my program that just wrote 2 values to a uav, and again, kernel hard locked the entire machine.  What gives here?  I haven't been able to run code on these things with anything but the original driver that shipped with these cards!

        • Re: 8.96 still can't run any code on the GPU
          MicahVillmow

          What cards? I'm guessing HD7XXX, also what OS? If you can link to your original post it would help in these situations where a new post is referencing information from a previous post.

           

          Thanks,

          Micah

          • Re: 8.96 still can't run any code on the GPU
            drallan

            I have had the exact same kind of problem (edit:similar). I have tried all the drivers (listed below) since installing the cards and as soon as I run certain (many) programs the system crashes (instant reboot) or I get the BOSD (Bill's screen of death). I'm running a similar system with two 7970s and one 6950 under Windows 7 SP1.  Only clues are:

             

            1. Many programs are older programs known to work.
            2. Not all programs.
            3. A program might run once then crash the second time.
            4. One program always runs once then always crashes second time.
            5. Crash occurs at program launch.
            6. Occasionally, program runs a few loops prior to crash (thus not compile stage)
            7. Same problem after recent motherboard change and reinstall of Windows, (this is virtually a new computer)

             

            Looks more like a host side problem, memory/buffer out of bounds. I noticed that my 3 cards have a total of 8GB memory, same as the motherboard. Windows (which is mostly unintelligible) shows something about buffering memory for the cards??

             

            Drivers that work: (slightly different variants of 8.93.xxxxx)

                                                   OEM driver v 8.93.

            02/21/2012  12:00 AM       137,171,160 amd_radeon_hd7900_win7_64.exe

             

            Drivers that don't

            12/05/2011  08:36 AM       155,455,736 11-11c_amd_catalyst_windows_vista_7.exe

            12/14/2011  07:47 AM       114,931,120 11-12_vista64_win7_64_dd_ccc_ocl.exe

            01/25/2012  07:46 AM       180,809,808 12-1_preview_amd_catalyst_windows_vista_7.exe

            02/28/2012  05:35 PM       152,441,856 12-2_pre-certified_win7_64_feb_16.exe

            03/08/2012  02:12 PM       165,923,488 12-2_vista_win7_64_dd_ccc_march7.exe

            02/26/2012  04:13 PM       186,899,768 12-3_8.95_rc_amd_catalyst_feb17.exe

            02/11/2012  09:45 AM       181,131,440 12.1a_preview_amd_catalyst_win7_32-64.exe

            03/03/2012  10:54 PM       181,225,344 8.96-120228m-[Guru3D.com]x.exe

            02/26/2012  04:40 PM       182,317,139 8_96-120214a-[Guru3D.com].exe

             

            drallan                                       

              • Re: 8.96 still can't run any code on the GPU
                corry

                I'm not sure if it is the exact same problem, but my kernels are all pretty narrow in scope, so its possible that by hitting different portions of the ALU or memory system the lockup can be avoided sometimes....me, just a simple kernel, I think I was using uav3 (arbitrarily) literal l0, r0, and r1, moving l0.xxxx to r0, and r1, then uav_raw_store_id(3) mem, l0.x, r0 uav_raw_store_id(3) mem, l0.y, r1....yeah simple as that, and it locks up.  Oviouusly l0.y was the next address to write to.  The machine, from crash to crash takes about 7 minutes or so, so I lost my patience having run the thing 4 times with the same results and gave up.

                  • Re: 8.96 still can't run any code on the GPU
                    drallan

                    Yes, of course it may not be exactly the same.  

                    What I meant  was a  a problem that occurs with all newer drivers but never seen with the OEM drivers for seemingly good kernels. It's possible that that something has changed that exposes a problem I have, but it seems unusual that so many kernels are affected.  I'll have to go back and see if I can narrow down where this is happening, not so easy since the system reboots and I have to continually reinstall drivers.

                     

                    Does your whole system hang or is it just the display that stops?

                      • Re: 8.96 still can't run any code on the GPU
                        corry

                        It seems to hang completly.  Eventually even the mouse cursor stops updating.  I haven't tried having a command line window up and a reboot command prepped and just pressing enter after a while to see if it responds...

                        • Re: 8.96 still can't run any code on the GPU
                          corry

                          I get the feeling we're going to get stonewalled here again and be told to use OpenCL..Its really not making a good case for us to continue using AMD GPUs....I'm not even certain at this point if those with decision making power are going to let us....instead we'll have to use the higher power consumption, higher cost of cooling, slower, but working other brand GPUs....maybe I'll get lucky and Intels Knights Landing will come soon and to a broad audience...

                            • Re: 8.96 still can't run any code on the GPU
                              drallan

                              I have not used CAL since installing the 7970s, though I would like to see how it works. If you post your short kernel I'll try that since our systems are similar.

                               

                              Instead of CAL, I now implement an extra step to edit IL code before it is compiled to binary, which either substitutes an entire block of IL code or does line by line substitution of IL instructions. The latter is good for all the integer instructions like bit reverse, bit alignment, and reading the timers.

                               

                              As for crashing, I took a better look at the most recent 8.96 (March 7) drivers to see what migh be making them crash, I found two situations that account for most of it, (I also found that one ALU intensive kernel runs as much as 25 percent faster under the new version, though most kernels run about the same speed). The crashes are occuring when

                               

                              1. The very first initial call to clGetPlatformIDs(0,NULL,&nplatforms); IF, the call is executed from a thread (which I normally do). I can fix this by making a dummy call to the same routine just before entering the thread. Surely that's just a patch as the call should never crash, and it may have something to do with the multiple GPUs.

                               

                              2. Several applications that run two devices in parallel will crash on calling enqnueueNKRangeKernel. If I now call clFinish() between the calls then it will not crash. This also should not be the case, I have been routinely running kernels in parallel for a long time, it is only when installing the new drivers. Calling clFinish though is not a fix because then there is no point to multiple devices.

                               

                              As for GPUs, 10 teraflops on two cards it hard to turn down, though I agree  OCL was never designed for high performance.

                                • Re: 8.96 still can't run any code on the GPU
                                  corry

                                  I don't think the problem is the kernel, I think its CAL.  I agree with them moving away from it, I just don't think their timeline was thought out at all...not...at...all...  The kernel I was using was basically the following, but I hacked it into existing code that I generate from another program.  Its been regenerated several times now.

                                   

                                  il_cs_2_0

                                  dcl_literal l0, 0, 16, 0xFFFFFFFF, 0

                                  dcl_raw_uav_id(8)

                                  uav_raw_store_id(8) mem, l0.x, l0.zzzz

                                  uav_raw_store_id(8) mem, l0.y, l0.zzzz

                                  endmain

                                  end

                                   

                                  Still, that will lock it up good and solid after the call to calCtxIsEventDone (which actually dispatches the work).

                                   

                                  I don't really want to rant about this, so I'll just say this.  I'm not far away from pulling my support of AMD hardware where I work.  The hardware might be better, but without software that works, the best hardware in the world is completly useless.