cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

liwoog
Adept II

RADEON HD 7970 on linux

fglrxinfo will not recognize a RADEON HD 7970 under linux 64. Latest SDK and drivers. Is the card no yet supported?

0 Likes
39 Replies
Marix
Adept II

If you have a working x config it will work with Catalyst 11.12. I am seeing some bugs, but that might be errors in my code.

0 Likes

It sort of works, but I get the stupid "Unsupported hardware" watermark in the corner of the screens, and I've been getting a lot of kernel panics since I put it in.

0 Likes

you may need wait one or two catalyst versions.

0 Likes

I am also getting kernel panics at startup (it seems to happen with 2/3rds or so probability). Wouldn't this be a compatability issue with the kernel? Or is it rather that the kernel doesn't have the appropriate firmware yet to correctly recognize the radeon 7970?

I installed fglrx at first (drivers version 11.12-r1) and it seemed to work with the "unsupported hardware" watermark, but then my X wouldn't start anymore. I attached the error message, along with my xorg.conf file.

Let me know if you want me to provide additional information.

Thanks! (this is the first i've seen anyone else having this problem)

 

Backtrace: [ 22.922] 0: /usr/bin/X (xorg_backtrace+0x28) [0x568608] [ 22.922] 1: /usr/bin/X (0x400000+0x16c1e9) [0x56c1e9] [ 22.922] 2: /lib64/libpthread.so.0 (0x7f0a45547000+0x102e0) [0x7f0a455572e0] [ 22.923] 3: /usr/lib64/xorg/modules/drivers/fglrx_drv.so (PhwSIslands_FindVoltage+0x1e) [0x7f0a424284be] [ 22.923] 4: /usr/lib64/xorg/modules/drivers/fglrx_drv.so (PhwSIslands_ApplyVoltageDeltaRules+0x7c) [0x7f0a4242854c] [ 22.923] 5: /usr/lib64/xorg/modules/drivers/fglrx_drv.so (PhwSIslands_ApplyStateAdjustRules+0x385) [0x7f0a42428925] [ 22.923] 6: /usr/lib64/xorg/modules/drivers/fglrx_drv.so (PHM_ApplyStateAdjustRules+0x12) [0x7f0a423be602] [ 22.923] 7: /usr/lib64/xorg/modules/drivers/fglrx_drv.so (0x7f0a41e5a000+0x587171) [0x7f0a423e1171] [ 22.924] 8: /usr/lib64/xorg/modules/drivers/fglrx_drv.so (PEM_Task_AdjustPowerState+0x3f) [0x7f0a423e508f] [ 22.924] 9: /usr/lib64/xorg/modules/drivers/fglrx_drv.so (PEM_ExcuteEventChain+0x64) [0x7f0a423e39a4] [ 22.924] 10: /usr/lib64/xorg/modules/drivers/fglrx_drv.so (PEM_HandleEvent_Unlocked+0x23) [0x7f0a423e2193] [ 22.924] 11: /usr/lib64/xorg/modules/drivers/fglrx_drv.so (PEM_HandleEvent+0x25) [0x7f0a423e2245] [ 22.924] 12: /usr/lib64/xorg/modules/drivers/fglrx_drv.so (PEM_Initialize+0x147) [0x7f0a423e24a7] [ 22.925] 13: /usr/lib64/xorg/modules/drivers/fglrx_drv.so (0x7f0a41e5a000+0x560856) [0x7f0a423ba856] [ 22.925] 14: /usr/lib64/xorg/modules/drivers/fglrx_drv.so (PP_Initialize+0x28) [0x7f0a423ba448] [ 22.925] 15: /usr/lib64/xorg/modules/drivers/fglrx_drv.so (swlPPLibInitializePowerPlay+0x7c) [0x7f0a4237030c] [ 22.925] 16: /usr/lib64/xorg/modules/drivers/fglrx_drv.so (swlPPLibInit+0x3f) [0x7f0a42370a1f] [ 22.925] 17: /usr/lib64/xorg/modules/drivers/fglrx_drv.so (xilDisplayAdaptorCreate+0xbb) [0x7f0a4235deab] [ 22.925] 18: /usr/lib64/xorg/modules/drivers/fglrx_drv.so (xdl_xs111_atiddxDisplayPreInit+0xab7) [0x7f0a4234ef67] [ 22.926] 19: /usr/lib64/xorg/modules/drivers/fglrx_drv.so (xdl_xs111_atiddxPreInit+0xe6f) [0x7f0a4232031f] [ 22.926] 20: /usr/bin/X (InitOutput+0x80c) [0x47238c] [ 22.926] 21: /usr/bin/X (0x400000+0x24783) [0x424783] [ 22.926] 22: /lib64/libc.so.6 (__libc_start_main+0xfd) [0x7f0a4448309d] [ 22.926] 23: /usr/bin/X (0x400000+0x244e9) [0x4244e9] [ 22.926] Segmentation fault at address 0x800aa6454 [ 22.926] Fatal server error: [ 22.926] Caught signal 11 (Segmentation fault). Server aborting ============================================================ xorg.conf ============================================================ Section "ServerLayout" Identifier "Layout0" Screen 0 "Screen0" 0 0 InputDevice "Keyboard0" "CoreKeyboard" InputDevice "Mouse0" "CorePointer" EndSection Section "Files" EndSection Section "InputDevice" # generated from default Identifier "Mouse0" Driver "mouse" Option "Protocol" "auto" Option "Device" "/dev/input/mice" Option "Emulate3Buttons" "no" Option "ZAxisMapping" "4 5" EndSection Section "InputDevice" # generated from default Identifier "Keyboard0" Driver "kbd" EndSection Section "Monitor" Identifier "Monitor0" VendorName "Unknown" ModelName "Unknown" HorizSync 28.0 - 33.0 VertRefresh 43.0 - 72.0 Option "DPMS" "true" EndSection Section "Device" Identifier "Device0" Driver "fglrx" BusID "PCI:2:0:0" EndSection Section "Screen" Identifier "Screen0" Device "Device0" Monitor "Monitor0" DefaultDepth 24 SubSection "Display" Depth 24 EndSubSection EndSection Section "Module" Disable "dri" Disable "dri2" EndSection

0 Likes

We got some codes to run, but the most important one locks the machine.

kernel: [fglrx] ASIC hang happened


17.01.2012 - 16:06:03westmere10Warningkernelkernel: End of dump
17.01.2012 - 16:06:03westmere10Warningkernelkernel: Dump the trace queue.
17.01.2012 - 16:06:03westmere10Warningkernelkernel: last submit IB buffer -- MC :0xffd8a36000,phys:0x62191d000.
17.01.2012 - 16:06:03westmere10Warningkernelkernel: CP_IB1_BUFSZ:0x0, CP_IB1_BASE_HI:0xff, CP_IB1_BASE_LO:0xd8a36000.
17.01.2012 - 16:06:03westmere10Warningkernelkernel: CP_RB_BASE : 0xffd81000, CP_RB_RPTR : 0x27e8 , CP_RB_WPTR :0x29a0.
17.01.2012 - 16:06:03westmere10Warningkernelkernel: GRBM : 0xa04c7028, SRBM : 0x20000fc0 .
17.01.2012 - 16:06:03westmere10Warningkernelkernel: Mapped heap -- Offset:0xef000, size:0x11000, reference count:1, mapping count:0,
17.01.2012 - 16:06:03westmere10Warningkernelkernel: Mapped heap -- Offset:0x0, size:0x200000, reference count:6, mapping count:0,
17.01.2012 - 16:06:03westmere10Warningkernelkernel: Mapped heap -- Offset:0x200000, size:0x900000, reference count:2, mapping count:0,
17.01.2012 - 16:06:03westmere10Warningkernelkernel: Mapped heap -- Offset:0xb00000, size:0x900000, reference count:4, mapping count:0,
17.01.2012 - 16:06:03westmere10Warningkernelkernel: Mapped heap -- Offset:0x1400000, size:0x900000, reference count:2, mapping count:0,
17.01.2012 - 16:06:03westmere10Warningkernelkernel: Mapped heap -- Offset:0x1d00000, size:0x900000, reference count:1, mapping count:0,
17.01.2012 - 16:06:03westmere10Warningkernelkernel: MC start:0xffc0400000, Physical:0x0, size:0x17d00000.
17.01.2012 - 16:06:03westmere10Warningkernelkernel: mc_node :GART_CACHEABLE, total 3 zones
17.01.2012 - 16:06:03westmere10Warningkernelkernel: Mapped heap -- Offset:0x0, size:0x2000000, reference count:21, mapping count:0,
17.01.2012 - 16:06:03westmere10Warningkernelkernel: Mapped heap -- Offset:0x2000000, size:0x1800000, reference count:2, mapping count:0,
17.01.2012 - 16:06:03westmere10Warningkernelkernel: Mapped heap -- Offset:0x3800000, size:0x1800000, reference count:2, mapping count:0,
17.01.2012 - 16:06:03westmere10Warningkernelkernel: Mapped heap -- Offset:0x5000000, size:0x1800000, reference count:2, mapping count:0,
17.01.2012 - 16:06:03westmere10Warningkernelkernel: Mapped heap -- Offset:0x6800000, size:0x1800000, reference count:2, mapping count:0,
17.01.2012 - 16:06:03westmere10Warningkernelkernel: Mapped heap -- Offset:0x8000000, size:0x1800000, reference count:2, mapping count:0,
17.01.2012 - 16:06:03westmere10Warningkernelkernel: Mapped heap -- Offset:0x9800000, size:0x1800000, reference count:2, mapping count:0,
17.01.2012 - 16:06:03westmere10Warningkernelkernel: MC start:0xffd8100000, Physical:0x0, size:0x27f00000.
17.01.2012 - 16:06:03westmere10Warningkernelkernel: mc_node :GART_USWC, total 2 zones
17.01.2012 - 16:06:03westmere10Warningkernelkernel: Mapped heap -- Offset:0xb02f4000, size:0xc000, reference count:1, mapping count:0,
17.01.2012 - 16:06:03westmere10Warningkernelkernel: Mapped heap -- Offset:0x2f8000, size:0x8000, reference count:1, mapping count:0,
17.01.2012 - 16:06:03westmere10Warningkernelkernel: MC start:0xf40fd00000, Physical:0xdfd00000, size:0xb0300000.
17.01.2012 - 16:06:03westmere10Warningkernelkernel: mc_node :INV_FB, total 1 zones
17.01.2012 - 16:06:03westmere10Warningkernelkernel: Mapped heap -- Offset:0xfaff000, size:0x201000, reference count:1, mapping count:0,
17.01.2012 - 16:06:03westmere10Warningkernelkernel: Mapped heap -- Offset:0x0, size:0x1000000, reference count:1, mapping count:0,
17.01.2012 - 16:06:03westmere10Warningkernelkernel: Mapped heap -- Offset:0x0, size:0xfaff000, reference count:22, mapping count:0,
17.01.2012 - 16:06:03westmere10Warningkernelkernel: MC start:0xf400000000, Physical:0xd0000000, size:0xfd00000.
17.01.2012 - 16:06:03westmere10Warningkernelkernel: mc_node :FB, total 1 zones
17.01.2012 - 16:06:03westmere10Warningkernelkernel: gart table MC:0xf40faff000, Physical:0xdfaff000, size:0x200000.
17.01.2012 - 16:06:03westmere10Warningkernelkernel: FB phys addr: 0xd0000000, MC :0xf400000000, Total FB size :0xc0000000.
17.01.2012 - 16:06:03westmere10Warningkernelkernel: Asic ID:0x6798, revision:0x5, MMIOReg:0xffffc90017680000.
17.01.2012 - 16:06:03westmere10Warningkernelkernel: device 0 : 0xffff88062ba98000 .
17.01.2012 - 16:06:03westmere10Warningkernelkernel: pubdev:0xffffffffa03883e0, num of device:1 , name:fglrx, major 8, minor 93.
17.01.2012 - 16:06:03westmere10Warningkernelkernel: [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
17.01.2012 - 16:06:03westmere10Warningkernelkernel: [<ffffffff81189941>] ? sys_ioctl+0x81/0xa0
17.01.2012 - 16:06:03westmere10Warningkernelkernel: [<ffffffff8109b753>] ? current_kernel_time+0x13/0x50
17.01.2012 - 16:06:03westmere10Warningkernelkernel: [<ffffffff811893c4>] ? do_vfs_ioctl+0x84/0x580
17.01.2012 - 16:06:03westmere10Warningkernelkernel: [<ffffffff81189222>] ? vfs_ioctl+0x22/0xa0
17.01.2012 - 16:06:03westmere10Warningkernelkernel: [<ffffffffa00fa93e>] ? ip_firegl_unlocked_ioctl+0xe/0x20 [fglrx]
17.01.2012 - 16:06:03westmere10Warningkernelkernel: [<ffffffff81042ba4>] ? __do_page_fault+0x1e4/0x480
17.01.2012 - 16:06:03westmere10Warningkernelkernel: [<ffffffffa0104ded>] ? firegl_ioctl+0x1ed/0x250 [fglrx]
17.01.2012 - 16:06:03westmere10Warningkernelkernel: [<ffffffffa0127330>] ? firegl_cmmqs_CWDDE32+0x0/0x100 [fglrx]
17.01.2012 - 16:06:03westmere10Warningkernelkernel: [<ffffffffa01273a0>] ? firegl_cmmqs_CWDDE32+0x70/0x100 [fglrx]
17.01.2012 - 16:06:03westmere10Warningkernelkernel: [<ffffffffa0128a72>] ? firegl_cmmqs_CWDDE_32+0x332/0x440 [fglrx]
17.01.2012 - 16:06:03westmere10Warningkernelkernel: [<ffffffff8109681e>] ? down+0x2e/0x50
17.01.2012 - 16:06:03westmere10Warningkernelkernel: [<ffffffffa018804d>] ? _Z8uCWDDEQCmjjPvjS_+0x54d/0x10c0 [fglrx]
17.01.2012 - 16:06:03westmere10Warningkernelkernel: [<ffffffffa018c1c4>] ? _Z19uQSTimeStampRetiredmjj14_LARGE_INTEGER+0x74/0x80 [fglrx]
17.01.2012 - 16:06:03westmere10Warningkernelkernel: [<ffffffffa0193a23>] ? _ZN15QS_PRIVATE_CORE27multiVpuPM4ElapsedTimeStampEj14_LARGE_INTEGER12_QS_CP_RING_+0x33/0x50 [fglrx]
17.01.2012 - 16:06:03westmere10Warningkernelkernel: [<ffffffffa01264a2>] ? firegl_trace+0x72/0x1e0 [fglrx]
17.01.2012 - 16:06:03westmere10Warningkernelkernel: [<ffffffffa01264a2>] ? firegl_trace+0x72/0x1e0 [fglrx]
17.01.2012 - 16:06:03westmere10Warningkernelkernel: [<ffffffffa019cabf>] ? _ZN4Asic19PM4ElapsedTimeStampEj14_LARGE_INTEGER12_QS_CP_RING_+0xaf/0x170 [fglrx]
17.01.2012 - 16:06:03westmere10Warningkernelkernel: [<ffffffffa01a227c>] ? _ZN4Asic9WaitUntil15WaitForCompleteEv+0x9c/0xf0 [fglrx]
17.01.2012 - 16:06:03westmere10Warningkernelkernel: [<ffffffffa01a22d9>] ? _ZN4Asic9WaitUntil15ResetASICIfHungEv+0x9/0x10 [fglrx]
17.01.2012 - 16:06:03westmere10Warningkernelkernel: [<ffffffffa010924c>] ? firegl_hardwareHangRecovery+0x1c/0x50 [fglrx]
17.01.2012 - 16:06:03westmere10Warningkernelkernel: [<ffffffffa00fbc1e>] ? KCL_DEBUG_OsDump+0xe/0x10 [fglrx]
17.01.2012 - 16:06:03westmere10Warningkernelkernel: Call Trace:
17.01.2012 - 16:06:03westmere10Warningkernelkernel: Pid: 4012, comm: ktmig3d_gpu Tainted: P ---------------- 2.6.32-220.2.1.el6.centos.plus.x86_64 #1
17.01.2012 - 16:06:03westmere10Informationalkernelkernel: [fglrx] ASIC hang happened


0 Likes

Hi

You should check if the fglrx installation succeeds to compile the kernel driver.

If you have this icon - it means thats installaion failed .

Please look at the popup after installing fglrx - it shows that path to the log file.

Look at the log file - and search for errors.

You can send me the file if you need help

0 Likes

liwoog,

If you can provide a test case that is causing the failure, we can investigate it to find out why it is causing the graphics card to hang.

0 Likes

Thank you Micah,

The crash is not immediate and the code rather involved and proprietary.

I will try to reduce the bug if I have the time (It took me a week to find and document a bug in NVIDIA's OpenCL LLVM compiler that was easier to isolate).

Are there any tools you can offer me to help me locate the problem?

Lionel

0 Likes

Its a graphics/kernel timeout. The most likely problems are the kernel is running to long(180 seconds is the cutoff I was told) or that all work-items in a work-group are not hitting a barrier.

0 Likes

Thank you,

I will check both possibilities.

0 Likes

I have lowered the amount of work I send a kernel, as otherwise, it could run longer than 180 seconds. The hang now takes more time to happen.

Also, processing is surprisingly slow compared to a GTX 580.

Freezing the hardware is not a very elegant response to this though, any thought of improving this? I also get:

kernel:Disabling IRQ #17

0 Likes

Hi Micah,

clEnqueueBarrier does not work works intermitently in your implementation.

Testing on another code that was failing on the 7970,  I got this part to work by replacing the clEnqueueBarrier with a clFinish (see the DEBUG comment).

The result of the second clEnqueueNDRangeKernel is wrong (NaNs) if I use a barrier, correct if I use a clFinish.

Added: Though the failure is not consistent with clEnqueueBarrier, it will sometimes provide proper results.

I am compiling with the CL_USE_DEPRECATED_OPENCL_1_1_APIS flag.

Lionel

                    cl_long offset0 = 0L;

  cl_long offset1 = rtm2DInfo.image_float_mem_size;

  cl_long offset2 = rtm2DInfo.image_float_mem_size * 2;


  // Set the arguments for the first processing step

                    err  = clSetKernelArg(rtm2DDevice->rtmForwardStep2D0, sizeof(cl_mem),          &rtm2DDevice->forward_mem[0]);

                    err |= clSetKernelArg(rtm2DDevice->rtmForwardStep2D1, sizeof(cl_mem),          &rtm2DDevice->forward_mem[0]);

                    err |= clSetKernelArg(rtm2DDevice->rtmForwardStep2D3, sizeof(cl_mem),          &rtm2DDevice->forward_mem[0]);

                    err |= clSetKernelArg(rtm2DDevice->rtmForwardStep2D4, sizeof(cl_long),          &offset0);

                    err |= clSetKernelArg(rtm2DDevice->rtmForwardStep2D5, sizeof(cl_long),          &offset1);

                    err |= clSetKernelArg(rtm2DDevice->rtmForwardStep2D6, sizeof(cl_long),          &offset0);

                    err |= clSetKernelArg(rtm2DDevice->rtmForwardStep2D7, sizeof(float),                    &zeroCoeff);

                    err |= clSetKernelArg(rtm2DDevice->rtmForwardStep2D18, sizeof(float),          &oneCoeff);

                    if (CL_SUCCESS != err)

                    {

  fprintf(stderr, "Failed to set rtmForwardStep2D kernel argument on device %d (%d)\n", device, err);

                              goto error;

                    }

  // Enqueue a processing step

                    if (CL_SUCCESS != (err = clEnqueueNDRangeKernel(rtm2DDevice->cmd_queue,

                                                                                rtm2DDevice->rtmForwardStep2D,

                                                                                3,

                                                                                NULL,

  rtm2DInfo.forwardGlobalWorkSize,

  rtm2DInfo.forwardLocalWorkSize,

                                                                                0, NULL, NULL)))

                    {

  fprintf(stderr, "clEnqueueNDRangeKernel rtmForwardStep2D failed on device %d (%d)\n", device, err);

                              goto error;

                    }

  // Enqueue a barrier

                    // if (CL_SUCCESS != (err = clEnqueueBarrier(rtm2DDevice->cmd_queue))) // <------ DEBUG ----------

                    if (CL_SUCCESS != (err = clFinish(rtm2DDevice->cmd_queue)))

                    {

  fprintf(stderr, "Failed to enqueue barrier on GPU %d (%d)\n", device, err);

                              goto error;

                    }

  // Set the arguments for the second processing step

                    err  = clSetKernelArg(rtm2DDevice->rtmForwardStep2D4, sizeof(cl_long),          &offset1);

                    err |= clSetKernelArg(rtm2DDevice->rtmForwardStep2D5, sizeof(cl_long),          &offset2);

                    if (CL_SUCCESS != err)

                    {

  fprintf(stderr, "Failed to set rtmForwardStep2D kernel argument on device %d (%d)\n", device, err);

                              goto error;

                    }

  // Enqueue a processing step

                    if (CL_SUCCESS != (err = clEnqueueNDRangeKernel(rtm2DDevice->cmd_queue,

                                                                                rtm2DDevice->rtmForwardStep2D,

                                                                                3,

                                                                                NULL,

  rtm2DInfo.forwardGlobalWorkSize,

  rtm2DInfo.forwardLocalWorkSize,

                                                                                0, NULL, NULL)))

                    {

  fprintf(stderr, "clEnqueueNDRangeKernel rtmForwardStep2D failed on device %d (%d)\n", device, err);

                              goto error;

                    }

0 Likes

I think I observed something similar but I never finished a reduced test case for it. I needed a clFinish between launching a kernel and setting arguments for the second launch. Otherwise the output of the first kernel would end up in the wrong buffer which is one of the arguments I set.

I also observed the same on Windows.

0 Likes

It looks like clEnqueueBarrier is a no-op, I'll file a bug against engineering team for this issue so that it will get implemented in the future.

0 Likes

Thank you Micah. A time frame would be helpful.

Also please make sure that other OpenCL 1.1 deprecated APIs are not no-ops (i.e. clEnqueueMarker).

0 Likes

Is there a time frame for fixing this bug? We will be upgrading 40 NVIDIA GPUs and adding at least 20 more when the NVIDIA GTX 680 comes out. If AMD wishes to be in the running to compete for those slots, a little better customer support would be a good idea.

We will be upgrading 40 NVIDIA GPUs and adding at least 20 more when the NVIDIA GTX 680 comes out. If AMD wishes to be in the running to compete for those slots, a little better customer support would be a good idea.

I doubt support for GTX 680 will be much better in the very beginning. 🙂

Especially since nVidia wants you to buy Tesla for computing.

And sure, with Tesla you'll get more and better support, but you'll also pay 3x the actual value of the card.

HD7970 beats Tesla by a factor 2 in speed, and still you only paid half the price. Is that worth the support?

That's up to you to decide. For some usages you need a blazing fast race car (HD7970), for others you need a rolls royce with paid driver (Tesla).

0 Likes

When I found a bug in NVIDIA's compiler I was able to file a bug report and track its progress. The compiler was fixed within two weeks. In comparison, AMD, champion of OpenCL, does not have a working OpenCL implementation on their first true HPC contender card and it is impossible to get an idea as to when the reported bugs will be taken care of.

I have been a top promoter of OpenCL, leaving the door open for AMD, but so far NVIDIA keeps my business.

This is the beauty of an open standard for heterogeneous computing; you can always change your device's vendor without much trouble.

0 Likes

pwvdendr wrote:

HD7970 beats Tesla by a factor 2 in speed, and still you only paid half the price. Is that worth the support?

Increase the price 45 bucks and give us smooth drivers and support.  We like that!

Even 100 bucks!  We want the thing to work!

EDIT:  I'm very sorry if you got multiple email notifications for my edits.  I just realized I could turn that off.

0 Likes

So, cEnqueueBuffer is supposed to be a no-op since we only support in-order queue's and thus everything is executed sequentially. From the OpenCL spec:

"The clEnqueueBarrier command ensures that all queued

commands in command_queue have finished execution before the next batch of

commands can begin execution. The clEnqueueBarrier command is a synchronization

point."

clEnqueueBarrier does not block the host, and since we don't support out of order queue's, a no-op follows the spec.

If you can get a test case of your issue, we can try to figure out why the behavior app is not working correctly.

0 Likes

I think arsenm is more likely to have found the root of the problem, it would explain the clFInish vs the clEnqueueBarrier:

arsenm wrote:

I think I observed something similar but I never finished a reduced test case for it. I needed a clFinish between launching a kernel and setting arguments for the second launch. Otherwise the output of the first kernel would end up in the wrong buffer which is one of the arguments I set.

I also observed the same on Windows.

I will try the 12.2 drivers first.

0 Likes

There is a known failure with this that we have fixed and it should be in the 12.3 preview driver(It might be in 12.2, but not sure).

0 Likes

Can one get access to the 12.3 preview for linux?

The 12.2 are still broken.

0 Likes

12.3 is still broken, though it seems to freeze a little less. I guess I will try the NVIDIA GTX 680 next, the Radon HD 7970 OpenCL drivers are clearly not usable for this round of GPU cards.

29.03.2012 - 10:47:08hermesWarningkernelkernel: [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
29.03.2012 - 10:47:08hermesWarningkernelkernel: [<ffffffff814f692e>] ? do_device_not_available+0xe/0x10
29.03.2012 - 10:47:08hermesWarningkernelkernel: [<ffffffff811899c1>] ? sys_ioctl+0x81/0xa0
29.03.2012 - 10:47:08hermesWarningkernelkernel: [<ffffffff81189444>] ? do_vfs_ioctl+0x84/0x580
29.03.2012 - 10:47:08hermesWarningkernelkernel: [<ffffffff811892a2>] ? vfs_ioctl+0x22/0xa0
29.03.2012 - 10:47:08hermesWarningkernelkernel: [<ffffffffa02ae93e>] ? ip_firegl_unlocked_ioctl+0xe/0x20 [fglrx]
29.03.2012 - 10:47:08hermesWarningkernelkernel: [<ffffffffa02b8e6d>] ? firegl_ioctl+0x1ed/0x250 [fglrx]
29.03.2012 - 10:47:08hermesWarningkernelkernel: [<ffffffffa02dc120>] ? firegl_cmmqs_CWDDE32+0x0/0x100 [fglrx]
29.03.2012 - 10:47:08hermesWarningkernelkernel: [<ffffffffa02dc190>] ? firegl_cmmqs_CWDDE32+0x70/0x100 [fglrx]
29.03.2012 - 10:47:08hermesWarningkernelkernel: [<ffffffffa02dd5af>] firegl_cmmqs_CWDDE_32+0x14f/0x440 [fglrx]
29.03.2012 - 10:47:08hermesWarningkernelkernel: [<ffffffffa02a8bae>] KCL_SEMAPHORE_DownUninterruptible+0xe/0x10 [fglrx]
29.03.2012 - 10:47:08hermesWarningkernelkernel: [<ffffffff810968b1>] down+0x41/0x50
29.03.2012 - 10:47:08hermesWarningkernelkernel: [<ffffffff814f55a2>] __down+0x72/0xb0
29.03.2012 - 10:47:08hermesWarningkernelkernel: [<ffffffffa02daf22>] ? firegl_trace+0x72/0x1e0 [fglrx]
29.03.2012 - 10:47:08hermesWarningkernelkernel: [<ffffffffa02daf22>] ? firegl_trace+0x72/0x1e0 [fglrx]
29.03.2012 - 10:47:08hermesWarningkernelkernel: [<ffffffff814f4685>] schedule_timeout+0x215/0x2e0
29.03.2012 - 10:47:08hermesWarningkernelkernel: [<ffffffff8118ba50>] ? pollwake+0x0/0x60
29.03.2012 - 10:47:08hermesWarningkernelkernel: [<ffffffff8118ba50>] ? pollwake+0x0/0x60
29.03.2012 - 10:47:08hermesWarningkernelkernel: Call Trace:
29.03.2012 - 10:47:08hermesWarningkernelkernel: ffff880805328678 ffff880807721fd8 000000000000f4e8 ffff880805328678
29.03.2012 - 10:47:08hermesWarningkernelkernel: ffffc90000000000 ffff880807721ab8 ffffffff8118ba50 dead000000100100
29.03.2012 - 10:47:08hermesWarningkernelkernel: ffff880807721bd8 0000000000000082 0000000000000000 00000000000000db
29.03.2012 - 10:47:08hermesInformationalkernelkernel: Xorg D 0000000000000001 0 2903 2901 0x00400000
29.03.2012 - 10:47:08hermesErrorkernelkernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
29.03.2012 - 10:47:08hermesErrorkernelkernel: INFO: task Xorg:2903 blocked for more than 120 seconds.
0 Likes

I was able to get the proper computation by inserting a clFlush() between the clEnqueueNDRangeKernel() calls. 12.3 drivers. This should not be required, but fixes the problem. Would you have an explanation on the cause of the bug. This piece of code now provides the proper results (though quite slow compared to the NVIDIA). My other piece of code still ends up freezing the hardware but runs about 80% faster on the 7970 than on a GTX 580.

This works:

  // Enqueue a processing step

                    if (CL_SUCCESS != (err = clEnqueueNDRangeKernel(..)

 

  // Enqueue a barrier (For NVIDIA GPUs)

                    if (CL_SUCCESS != (err = clEnqueueBarrier(..)))

  // Make sure the previous kernel is loaded on the GPU before changing the arguments (For AMD GPUs)

                    if (CL_SUCCESS != (err = clFlush(..)))

 

  // Set the arguments for the second processing step

                    err  = clSetKernelArg(..);

                    err |= clSetKernelArg(..);

  // Enqueue a processing step

                    if (CL_SUCCESS != (err = clEnqueueNDRangeKernel(..)))

0 Likes

I seem to have a similar problem with 12.4 could you ever resolve the issue completely?

http://devgurus.amd.com/thread/159073

In my case clFlush() or clFinish() does not really help. After 8-10 enqueues of the same kernel (only with different offset). It crashes. The problem is not a 180 second time limit, I am able to run the kernel if I enqueue 3-4 times but with larger global sizes. I even tried to enqueue whole range at once and it ran in about 500 seconds with single enqueue and there were no problems.

Thanks!

0 Likes

I'm still seeing similar problems using 12.3, but they disappear using the 1.2 beta driver (1.4.1720)

0 Likes
arsenm
Adept III

I'm still getting the unsupported hardware thing with 12.1

0 Likes

I also have the same flag with 12.1 on HD 7970. Furthermore, I have no OpenCL support anymore and can't even initialze an OpenCL platform anymore (I recieve error code -1001). The code I've compile worked under Catalayst 11.12 on a HD 6970. Has there been any official statement on this issue?

0 Likes
arsenm
Adept III

0 Likes

The 12.1 are still buggy. Hangs the machine. (I have been running this code on NVIDIA's OpenCL implementation for months on 40 GPUs).

Feb  1 10:04:00 hermes kernel: [fglrx] ASIC hang happened

Feb  1 10:04:00 hermes kernel: Pid: 7550, comm: ktmig3d_gpu Tainted: P        W  ----------------   2.6.32-220.2.1.el6.centos.plus.x86_64 #1

Feb  1 10:04:00 hermes kernel: Call Trace:

Feb  1 10:04:00 hermes kernel: [<ffffffffa0c14c1e>] ? KCL_DEBUG_OsDump+0xe/0x10 [fglrx]

Feb  1 10:04:00 hermes kernel: [<ffffffffa0c2224c>] ? firegl_hardwareHangRecovery+0x1c/0x50 [fglrx]

Feb  1 10:04:00 hermes kernel: [<ffffffffa0cae444>] ? _ZN18mmEnginesContainer9timestampEP26_QS_MM_TIMESTAMP_PACKET_INP27_QS_MM_TIMESTAMP_PACKET_OUT+0x184/0x1c0 [fglrx]

Feb  1 10:04:00 hermes kernel: [<ffffffffa0c3ee62>] ? firegl_trace+0x72/0x1e0 [fglrx]

Feb  1 10:04:00 hermes kernel: [<ffffffffa0cab763>] ? _ZN15QS_PRIVATE_CORE27multiVpuPM4ElapsedTimeStampEj14_LARGE_INTEGER12_QS_CP_RING_+0x33/0x50 [fglrx]

Feb  1 10:04:00 hermes kernel: [<ffffffffa0cabc43>] ? _ZN15QS_PRIVATE_CORE25escapeMultiMediaInterfaceEP21_QS_QUERY_API_CALL_INPvjS2_j+0xd3/0xe0 [fglrx]

Feb  1 10:04:00 hermes kernel: [<ffffffffa0ca07bc>] ? _Z8uCWDDEQCmjjPvjS_+0xe7c/0x10c0 [fglrx]

Feb  1 10:04:00 hermes kernel: [<ffffffff8109681e>] ? down+0x2e/0x50

Feb  1 10:04:00 hermes kernel: [<ffffffffa0c41432>] ? firegl_cmmqs_CWDDE_32+0x332/0x440 [fglrx]

Feb  1 10:04:00 hermes kernel: [<ffffffffa0c3fd60>] ? firegl_cmmqs_CWDDE32+0x70/0x100 [fglrx]

Feb  1 10:04:00 hermes kernel: [<ffffffffa0c3fcf0>] ? firegl_cmmqs_CWDDE32+0x0/0x100 [fglrx]

Feb  1 10:04:00 hermes kernel: [<ffffffffa0c1dded>] ? firegl_ioctl+0x1ed/0x250 [fglrx]

Feb  1 10:04:00 hermes kernel: [<ffffffff81042ba4>] ? __do_page_fault+0x1e4/0x480

Feb  1 10:04:00 hermes kernel: [<ffffffffa0c1393e>] ? ip_firegl_unlocked_ioctl+0xe/0x20 [fglrx]

Feb  1 10:04:00 hermes kernel: [<ffffffff81189222>] ? vfs_ioctl+0x22/0xa0

Feb  1 10:04:00 hermes kernel: [<ffffffff811893c4>] ? do_vfs_ioctl+0x84/0x580

Feb  1 10:04:00 hermes kernel: [<ffffffff814f3911>] ? thread_return+0x4e/0x79d

Feb  1 10:04:00 hermes kernel: [<ffffffff81189941>] ? sys_ioctl+0x81/0xa0

Feb  1 10:04:00 hermes kernel: [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b

Feb  1 10:04:00 hermes kernel: pubdev:0xffffffffa0ea2dc0, num of device:1 , name:fglrx, major 8, minor 92.

Feb  1 10:04:00 hermes kernel: device 0 : 0xffff88012f7d4000 .

Feb  1 10:04:00 hermes kernel: Asic ID:0x6798, revision:0x5, MMIOReg:0xffffc90007100000.

Feb  1 10:04:00 hermes kernel: FB phys addr: 0xd0000000, MC :0xf400000000, Total FB size :0xc0000000.

Feb  1 10:04:00 hermes kernel: gart table MC:0xf40faff000, Physical:0xdfaff000, size:0x200000.

Feb  1 10:04:00 hermes kernel: mc_node :FB, total 1 zones

Feb  1 10:04:00 hermes kernel:    MC start:0xf400000000, Physical:0xd0000000, size:0xfd00000.

Feb  1 10:04:00 hermes kernel:    Mapped heap -- Offset:0x0, size:0xfaff000, reference count:22, mapping count:0,

Feb  1 10:04:00 hermes kernel:    Mapped heap -- Offset:0x0, size:0x1000000, reference count:1, mapping count:0,

Feb  1 10:04:00 hermes kernel:    Mapped heap -- Offset:0xfaff000, size:0x201000, reference count:1, mapping count:0,

Feb  1 10:04:00 hermes kernel: mc_node :INV_FB, total 1 zones

Feb  1 10:04:00 hermes kernel:    MC start:0xf40fd00000, Physical:0xdfd00000, size:0xb0300000.

Feb  1 10:04:00 hermes kernel:    Mapped heap -- Offset:0x2f8000, size:0x8000, reference count:1, mapping count:0,

Feb  1 10:04:00 hermes kernel:    Mapped heap -- Offset:0xb02f4000, size:0xc000, reference count:1, mapping count:0,

Feb  1 10:04:00 hermes kernel: mc_node :GART_USWC, total 2 zones

Feb  1 10:04:00 hermes kernel:    MC start:0xffd8100000, Physical:0x0, size:0x27f00000.

Feb  1 10:04:00 hermes kernel:    Mapped heap -- Offset:0x9800000, size:0x1800000, reference count:2, mapping count:0,

Feb  1 10:04:00 hermes kernel:    Mapped heap -- Offset:0x8000000, size:0x1800000, reference count:2, mapping count:0,

Feb  1 10:04:00 hermes kernel:    Mapped heap -- Offset:0x6800000, size:0x1800000, reference count:2, mapping count:0,

Feb  1 10:04:00 hermes kernel:    Mapped heap -- Offset:0x5000000, size:0x1800000, reference count:2, mapping count:0,

Feb  1 10:04:00 hermes kernel:    Mapped heap -- Offset:0x3800000, size:0x1800000, reference count:2, mapping count:0,

Feb  1 10:04:00 hermes kernel:    Mapped heap -- Offset:0x2000000, size:0x1800000, reference count:2, mapping count:0,

Feb  1 10:04:00 hermes kernel:    Mapped heap -- Offset:0x0, size:0x2000000, reference count:20, mapping count:0,

Feb  1 10:04:00 hermes kernel: mc_node :GART_CACHEABLE, total 3 zones

Feb  1 10:04:00 hermes kernel:    MC start:0xffc0400000, Physical:0x0, size:0x17d00000.

Feb  1 10:04:00 hermes kernel:    Mapped heap -- Offset:0x1900000, size:0x100000, reference count:1, mapping count:0,

Feb  1 10:04:00 hermes kernel:    Mapped heap -- Offset:0x1800000, size:0x100000, reference count:1, mapping count:0,

Feb  1 10:04:00 hermes kernel:    Mapped heap -- Offset:0x1700000, size:0x100000, reference count:1, mapping count:0,

Feb  1 10:04:00 hermes kernel:    Mapped heap -- Offset:0x1600000, size:0x100000, reference count:1, mapping count:0,

Feb  1 10:04:00 hermes kernel:    Mapped heap -- Offset:0x1500000, size:0x100000, reference count:1, mapping count:0,

Feb  1 10:04:00 hermes kernel:    Mapped heap -- Offset:0x1400000, size:0x100000, reference count:1, mapping count:0,

Feb  1 10:04:00 hermes kernel:    Mapped heap -- Offset:0x1300000, size:0x100000, reference count:1, mapping count:0,

Feb  1 10:04:00 hermes kernel:    Mapped heap -- Offset:0x1200000, size:0x100000, reference count:2, mapping count:0,

Feb  1 10:04:00 hermes kernel:    Mapped heap -- Offset:0x1100000, size:0x100000, reference count:1, mapping count:0,

Feb  1 10:04:00 hermes kernel:    Mapped heap -- Offset:0xc00000, size:0x500000, reference count:2, mapping count:0,

Feb  1 10:04:00 hermes kernel:    Mapped heap -- Offset:0x700000, size:0x500000, reference count:2, mapping count:0,

Feb  1 10:04:00 hermes kernel:    Mapped heap -- Offset:0x200000, size:0x500000, reference count:2, mapping count:0,

Feb  1 10:04:00 hermes kernel:    Mapped heap -- Offset:0x0, size:0x200000, reference count:8, mapping count:0,

Feb  1 10:04:00 hermes kernel:    Mapped heap -- Offset:0xef000, size:0x11000, reference count:1, mapping count:0,

Feb  1 10:04:00 hermes kernel: GRBM : 0xa0003028, SRBM : 0x200006c0 .

Feb  1 10:04:00 hermes kernel: CP_RB_BASE : 0xffd81000, CP_RB_RPTR : 0xcfd0 , CP_RB_WPTR :0xcfd0.

Feb  1 10:04:00 hermes kernel: CP_IB1_BUFSZ:0x0, CP_IB1_BASE_HI:0xff, CP_IB1_BASE_LO:0xd87c8000.

Feb  1 10:04:00 hermes kernel: last submit IB buffer -- MC :0xffd87c8000,phys:0x36504a000.

Feb  1 10:04:00 hermes kernel: Dump the trace queue.

Feb  1 10:04:00 hermes kernel: End of dump

0 Likes

I also see these ASIC hangs on my three 7970s running on an openSUSE 12.1 in all drivers up to the 12.4 Preview with OpenCL 1.2 support. The code causing this runs fine on an AMD FirePro V7800. At least in my case the issue seems to be somewhat related to problem size. On small datasets it works, on larger datasets and the resulting increase in memory consumption and runtime the hangs happen. Worst part is that only way to get the GPU back after that is to reset the whole machine.

0 Likes

Identical experience. After the clFlush fix for the kernel argument bug (see above), I can run small problems but not large ones.

0 Likes

liwoog wrote:

Identical experience. After the clFlush fix for the kernel argument bug (see above), I can run small problems but not large ones.

Hi liwoog,  is it possible for you to try the steps suggested by pwvdendr here:

http://devgurus.amd.com/message/1280510#1280510

0 Likes

I've tried it out, and the card is working quiet well. But nonetheless I have no chance getting an OpenCL program running. You also can find this driver here, which seems to be based on 11.12

http://support.amd.com/us/kbarticles/Pages/catalyst121linuxdriver.aspx

but for me it doesn't work as well. Does anybody have positive experiences with this?

Edit: 02/09

Problem fixed. There was some porblem with the insatllation of the driver and OpenCl. Some libraries have not been instzalled correctly. Just linking them to the correct ones helped. If anyone has similar problems, it helps to check the installation log of the driver in detail. Takes some time but really helps.

0 Likes

With this driver the unsupported hardware thing is gone, but my stuff is running about 25% slower than on Windows. With the older GPUs it's still pretty much the same like it usually is.

0 Likes
vic20
Journeyman III

I was not able to make the 8.921 opencl drivers work on Linux for my radeon 7970. I suspect a bad cleaning of the old amd drivers (for hd radeon 5850). I would like to avoid reinstalling ubuntu.  Is there a procedure to install the new drivers without a clean linux install ?

I am also interested in opencl-opengl interop. How to be sure that the compiler and linker will find the opengl headers and the good opengl drivers ?

This problem reappears each time I update the drivers or when synaptics touches the opengl configuration.

AMD GPU are nice, but the Nvidia linux drivers are easier to tame ! linux drivers are important for people interested in HPC...

thanks

0 Likes

I have 3 7970s running in Ubuntu and the only problems I've had with them is that mapping a buffer > 256MB causes a zombie process after it hangs.

In Ubuntu, try this https://wiki.ubuntu.com/X/Troubleshooting/FglrxInteferesWithRadeonDriver#Problem:__Need_to_fully_rem...

to fully cleanse your system of the drivers and then install the ones on AMD's site. Be warned that kernel header updates break the drivers, so you'll have to do this anytime you update them.

I'm running Ubuntu 11.10, APP 2.6, and the 8.921 drivers, have a monitor connected to each card, and did the dumb tricks needed to run OpenCL over ssh. Other than that, everything is vanilla.

0 Likes