cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Raistmer
Adept II

Problems with Cat 12.10 and up and HD7xxx (and not only) GPUs

My app starts to produce invalid results if source CL file is compiled with Catalyst 12.10 and causes driver restarts if it compiled with Cat 12.11 beta 8.

All this observed on HD7770 GPU. Looks like HD5xxx and HD6xxx are not affected.

Moreover, if app uses cached binaries compiled with Catalyst 12.8 - it works OK under Catalyst 12.10 and 12.11 beta 8. So, it's some problem that very new OpenCL runtime compiler adds.

Any chance to get this issue fixed in Cat 12.11 release ?

0 Likes
43 Replies
binying
Challenger

Could you provide more information about this issue such as a simple test code?

0 Likes

It was observed on whole app (final results were invalid) and I had no time so far to debug this issue to particular kernel type.

App is available for test, I can provide bench config but so far each time I did this AMD side falled to permanent silence. I don't want spend time for nothing really.

0 Likes

I have also seen it on my HD7750 running 12.11 beta 8, this was while testing some autocorellation work units from SETI Beta using our bench program.

0 Likes

Well, unfortunately, this problem not only HD7xxx specific.

I was able to reproduce it on own HD6950.

Here is testcase you asked for: https://dl.dropbox.com/u/60381958/Bad_binaries_with_Cat12.11beta8_test_case.7z

How to use:

Extract archive, run application (executable). It will perform some computations over included in archive dataset.

App provided with text-based CL file. It will compile that CL file and produce few *.bin* files with binary kernels.

There are 2 subdirectories also. One with such binaries generated under Catalyst 12.11 beta 8 (for HD6950 GPU) and another - binaries generated with some older Catalyst (can't say exactly but 12.6 most probably). When I use older binaries (running under Catalyst 12.11 beta 8 ) app does its computations and finished OK.

But when I use no binaries (that is, compilation from scratch under Cat 12.11 beta8) or binaries already compiled under Cat 12.11b8 app causes driver restart.

Please, confirm this and advise for some possible fix for this issue. App supposed to be installed automaticvally on huge number of hosts so driver restarts not an option to live with...

0 Likes
Raistmer
Adept II

Same issue with Catalyst 12.11 beta 11, GPU is HD6950, OS: Vista x86

0 Likes
Raistmer
Adept II

I installed Catalyst 12.1

Test case with freshly generated kernels binaries works OK.

When I replace those binaries with one generated under Cat 12.11 driver restart issue returns.

I think all this quite full evidence that the problem not in runtime, but in new OpenCL->binary compiler that generates GPU binaries. No matter under what runtime they run, old or new, always binaries generated with old Catalyst work OK, binaries generated with Cat 12.11 (beta 8 or beta 11) cause driver restart.

One can find those binaries in link above.

0 Likes

Had some time to test this apps issue on two 64bit openSuse Linux hosts (same sources as on windows).

Sadly was able to reproduce the problem found on windows on one of the hosts with two Radeon HD 7750 with Cat. 12.11beta8 and beta11. It causes a complete system freeze (screen active, system does not respond to any actions, like mouse or keyboard, anymore.) and makes a hard reboot necessary. This system has only PCIe2.0 slots available, while the second host, which does NOT reproduce this problem has a Radeon HD 7850 residing in a PCIe3.0 slot. Maybe this is connected, maybe not, at least some observation.

So, if there is a fix for windows please make sure you have one for Linux as well.

0 Likes
Raistmer
Adept II

Any advance with test case? Problem confirmed ?

0 Likes

Yes, the test case causes a restart of the graphic card on Win7/SDK2.8/ Catalyst 12.11beta.

Is it possible that you narrowed it down a little bit?

0 Likes

Thanks! Hope it will help to fix this issue in next release.

EDIT: About narrowing - yes, I can provide build with very verbose output. So it will be seen what last API call was before restart.

0 Likes

yes, I can provide build with very verbose output. So it will be seen what last API call was before restart.

     --it's perhaps better than none.

0 Likes

Here it is: https://dl.dropbox.com/u/60381958/Bad_binaries_with_Cat12.11beta8_test_case.7z

I added verbose build to same archive as before.

It will write logs into stderr.txt in same dir.

0 Likes
Raistmer
Adept II

So, what the current status of this issue ?

0 Likes

It should be taken care of by the engineering team of AMD, although I haven't heard back from them.

0 Likes
Raistmer
Adept II

Problem not fixed in Cat 13.1, I have reports that driver restart occurs on 13.1 too.

0 Likes
Raistmer
Adept II

Looks like problem not fixed in Catalyst 13.2 beta 3 too.

Now CL file just can't be compiled at all:

OpenCL Platform Name:



AMD Accelerated Parallel Processing
Number of devices:


1
  Max compute units:


10
  Max work group size:


256
  Max clock frequency:


1120Mhz
  Max memory allocation:

536870912
  Cache type:



Read/Write
  Cache line size:


64
  Cache size:



16384
  Global memory size:


1073741824
  Constant buffer size:


65536
  Max number of constant args:

8
  Local memory type:


Scratchpad
  Local memory size:


32768
  Queue properties:


    Out-of-Order:


No
  Name:




Capeverde
  Vendor:



Advanced Micro Devices, Inc.
  Driver version:


1124.2 (VM)
  Version:



OpenCL 1.2 AMD-APP (1124.2)
  Extensions:



cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_d3d10_sharing cl_amd_c1x_atomics

INFO: can't open binary kernel file: .\\AstroPulse_Kernels_r1761.cl_Capeverde.bin_V6, continue with recompile...

Error : Building Program (source, clBuildProgram):main kernels: not OK code -11

Internal error: Compilation failed.

It works OK with Cat 12.1, with Cat 12.8 (for example).

EDIT: And I don't see how this issue recived "Assumed Answered" status if it remains in all new drivers AMD releases. It's absolutely not answered and critical issue in fact.

0 Likes

Hi raistmer,

Sorry for the delay. I am looking into this issue now. I will let you know the status.

I am not sure who marked this thread as answered. I hope you can again mark it as unanswered.

Raistmer,

It looks the dropbox link you had given is dead (may be temporarily). Can you please share the test case again.

Message was edited by: Himanshu Gautam

0 Likes

1) I don't see how I can unmark that.

2) Just now I clicked on DropBox link above and got archive downloaded on this PC.

Link is: https://dl.dropbox.com/u/60381958/Bad_binaries_with_Cat12.11beta8_test_case.7z and it should not expire until I manually delete this archive from DropBox. So try again, maybe luck will be with you next time

0 Likes

Confirming that the dropbox link Raistmer has given still works.

Also i can confirm that Linux shows identical failure to compile OpenCL Kernels from source. This avoids the crashes i reported earlier in this thread.

Using precompiled (binary) kernels still works, but how to create these in the future.

0 Likes

Hi,

Still not able to access this link as my company network does not allow it.

Please re-share the test case by attaching it here itself as a zip file.

EDIT: Use advanced text editor to attach the testcase.

Message was edited by: Himanshu Gautam

0 Likes

Here it is.

0 Likes

Thanks Raistmer.

I guess the problem is already reported by binying, but I was not able to find a tracking number for it. Will let you know the status now.

0 Likes

Hi Raistmer,

I tried the two executables you had shared (MB7_win_x86_SSE_OpenCL_ATi_r1726_verbose.exe and setiathome_6.99_windows_intelx86__opencl_ati_sah.exe) on Drivers 13.1, 12.8 and 12.3. Both the applications always resulted in driver crash.

My system details: HD 7970, Driver: 13.1,12.8,12.3, CPU: FX4100

Anyways I will report it to AMD Team Again. Sorry could not find the reference to the old bug.

0 Likes

Are you sure you was able to downgrade recent drivers properly.

Inability of AMD Catalyst installer to properly do OpenCL runtime downgrade is known bug and was reported by Claggy on these forums too.

Very possible that all variants you tried were on the same recent 13.1 OpenCL runtime that fails to compile at all indeed.

Real OpenCL runtime from Cat 12.8 has no issues with app. And to check initial problem with invalid computations you should try Catalyst 12.10 drivers, not ones you tried.

0 Likes

I have not seen any problems in downgrading to old drivers with a clean system. I could see proper driver versions in CCC.

Anyways will check with 12.10 too. (As per you, i should get invalid results with 12.10 driver, how to verify that?).

I will check once more with 12.8 with more rigorous cleanup. Thanks for your support.

0 Likes

Driver version and OpenCL runtime version are quite different things. Be careful to refer the right one (OpenCL runtime).

I have reports of success with running Catalyst 13.1 video (and perhaps sound and so on) driver but with OpenCL runtime taken from Cat 12.8. Only OpenCL compiler works incorrectly. BTW, did you remove *.bin* files between runs? If you will run with compiled binaries (under old driver) on new driver you will get correct results too (cause again, OpenCL compiler broken, not OpenCL runtime per se. If one already have right binary it will be executed OK.

For now to check if app works differently check stderr.txt file for number of found signals.

I will attach validating tool later.

0 Likes
Raistmer
Adept II

To check validness of computation one can use attached tool and reference result (inside archive).

usage:

rescmpv5.exe ref-setiathome_6.98_windows_intelx86.exe-PG0395_v7.wu.res result.sah

where result.sah is the file generated after app full run.

tool output self-explaining. In case of big result differencies it will show table with quality of found signals between 2 files.

Examples of usage:

Cat 12.8 run:

E:\123>rescmpv5.exe ref-setiathome_6.98_windows_intelx86.exe-PG0395_v7.wu.res result.sah

Result      : Strongly similar,  Q= 99.41%

not 100% similarity almost inevitable between CPU and GPU long floating point computations but similarity good enough.

Cat 12.8 but binaries taken from Cat 12.11 beta 8:

1) driver restart occured (just as was reported in initial post).

2)

E:\123>setiathome_6.99_windows_intelx86__opencl_ati_sah.exe

E:\123>rescmpv5.exe ref-setiathome_6.98_windows_intelx86.exe-PG0395_v7.wu.res result.sah

                ------------- R1:R2 ------------     ------------- R2:R1 ------------

                Exact  Super  Tight  Good    Bad     Exact  Super  Tight  Good    Bad

        Spike      0      0      0      0      0        0      0      0      0      0

     Autocorr      0      0      0      0      1        0      0      0      0      0

     Gaussian      0      0      0      0      1        0      0      0      0      0

        Pulse      0      0      0      0      0        0      0      0      0      0

      Triplet      0      0      0      0      0        0      0      0      0      0

   Best Spike      0      0      0      0      1        0      0      0      0      0

Best Autocorr      0      0      0      0      1        0      0      0      0      0

Best Gaussian      0      0      0      0      1        0      0      0      0      0

   Best Pulse      0      0      0      0      1        0      0      0      0      0

Best Triplet      0      0      0      0      0        0      0      0      0      0

                ----   ----   ----   ----   ----     ----   ----   ----   ----   ----

                   0      0      0      0      6        0      0      0      0      0

Unmatched signal(s) in R1 at line(s) 672 689 716 732 749 775

Result      : Different.

As one can see number of found results differs (of course, app was terminated after driver restart, computations not finished).

One will see similar table if computation will finish ok, but with wrong results.

Validation tool will show differencies as in this sample.

P.S.:

(As per you, i should get invalid results with 12.10 driver, how to verify that?).

Yes, expect wrong result (but no driver restart ) with Cat 12.10. Driver restarts appeared on later driver releases.

Tool for verification and how to use it described in this post, above.


0 Likes

Hi Raistmer,

Probably you were right about driver downgrading issue. Here are my observations:

1. I had installed 12.10 without a proper system clean and saw the driver crash there. Ran rescmpv5.exe and result were incorrect.

2. Then I had cleaned the system using AMD cleanup utility before installing any other driver:

3. Installed 12.8 driver: SETI.exe ran without a crash. Check correctness with rescmpv5.exe and it gave 99.9% correctness.

4. Installed 12.10 again, and surprisingly seti.exe again ran without crash. rescmpv5 also passed correctly.

5. Now installed 13.1 driver, SETI.exe crashed. rescmpv5.exe confirms incorrect result.

Attached are the result.sah and stderr file for all cases.

So our observations are differing for 12.10 driver as of now. But anyways it is a bug. Please provide any feedback you have on the results.

himanshu.gautam wrote:

Hi Raistmer,

Probably you were right about driver downgrading issue. Here are my observations:

1. I had installed 12.10 without a proper system clean and saw the driver crash there. Ran rescmpv5.exe and result were incorrect.

2. Then I had cleaned the system using AMD cleanup utility before installing any other driver:

3. Installed 12.8 driver: SETI.exe ran without a crash. Check correctness with rescmpv5.exe and it gave 99.9% correctness.

4. Installed 12.10 again, and surprisingly seti.exe again ran without crash. rescmpv5 also passed correctly.

5. Now installed 13.1 driver, SETI.exe crashed. rescmpv5.exe confirms incorrect result.

Attached are the result.sah and stderr file for all cases.

So our observations are differing for 12.10 driver as of now. But anyways it is a bug. Please provide any feedback you have on the results.

Looking at the result from:

4. Installed 12.10 again, and surprisingly seti.exe again ran without crash. rescmpv5 also passed correctly.

It looks as if Raistmer has supplied a workunit that doesn't show a weakily similiar result on Cat 12.10, it has:

'WU true angle range is :  0.394768'

The Workunits that that showed the weakily similar reult were the PG0009_v7.wu and the refquick_v7.wu workunits,

which have 'WU true angle range is :  0.008955' and 'WU true angle range is :  0.775000' respectively.

Here's a full bench of five different workunits (with 3 different apps) where those two workunits are weakily similar.

Claggy

Edit: added PG0009_v7.wu and refquick_v7.wu workunits along with ref files for said workunits.

Himanshu, thanks for looking into this issue deeply.

Cause Claggy was first who report Cat 12.10 issue to me I think he is right with explanation why your observation differs from what I said about Cat 12.10.

Please, replace work_unit.sah from my archive with same file from PG0009_v7.workunit.7z.zip archive that Claggy attached.

Also, another ref file,ref-setiathome_6.98_windows_intelx86.exe-PG0009_v7.wu.res (again, provided in that archive), needed to check fresh result.sah. Comparison utility remains the same.

P.S. So, for now we can summarize issues in next way:

1) Incorrect computations with Cat 12.10 appear not in all data sets. Moreover, difference for PG0009 task in (as we call them) "best signals", that is, signals below threshold to be marked as reportable. That means computations in kernels compiled under 12.10 differ from correct ones not too big, but enough for precision issue to appear.

2) Catalyst 13.1 compiler broken for this kernels file. It's another issue cause error appears even before computations begin.

0 Likes

Hi Raistmer,

I checked with the new data as you suggested. I had taken the result.sah file and the reference file from PG0009_v7.workunit.7z.zip attachment. Not sure what the other 2 attachments are intended for.


rescmpv5 utility gives weakly similar for 12.10 driver. So some corruption happening.

rescmpv5 gives strongly similar for 12.8 driver. Expected.

But rescmpv5 gives strongly similar for 13.1 driver now with New Data. SURPRISE again.

so as i understand it, there are two issues here:

1. Data corruption when driver is updated from 12.8 to 12.10. But not reproduced with 13.1 driver, so probably not a issue. Can you confirm?

2. Driver crash when driver updated from 12.10 to 13.1. This is valid for the old data itself.

I will try to do some debugging on codeXL too, and let you know.

Hi Raistmer,

Will it be possible to give a testcase with the host code. I tried working with the kernel file, but there are so many kernels (which are enabled/disabled using #defines) . Also RESULT_SIZE seems to be a macro defined in Host code and used in kernels. I could not compile the kernels in KernelAnalyzer because of this macro.

Message was edited by: Himanshu Gautam

0 Likes

Hi, Himanshu.

Thanks for continuing to look into this issue.

Regarding no crash under Cat 13.1 - no idea for now, maybe Claggy or other alpha tester who follows this thread will bring some idea. I can't maintain test configs by myself now cause need stable environment so stick with Cat 12.1 on main PC and "unknown" version of Catalyst (but old too) on C-60 netbook. Info about app behavior on latest drivers comes from alpha testers.

And regarding host code - of course, no problems with this. It's GPLed app with freely available sources.

So you can look directly into repository (head or that revision that I used for test case binary). Suggestions and improvements are welcomed!

Here is repository:

https://setisvn.ssl.berkeley.edu/svn/branches/sah_v7_opt

and for this particular app you need files in root + these dirs:

https://setisvn.ssl.berkeley.edu/svn/branches/sah_v7_opt/AKv8

https://setisvn.ssl.berkeley.edu/svn/branches/sah_v7_opt/bin

https://setisvn.ssl.berkeley.edu/svn/branches/sah_v7_opt/lib

https://setisvn.ssl.berkeley.edu/svn/branches/sah_v7_opt/src

P.S. and defines you looking for in GPU_lock.cpp file:



strcpy(buildoptions,"-w -DRESULT_SIZE=32 -cl-unsafe-math-optimizations -fno-bin-llvmir -fno-bin-amdil");

if(swi.analysis_cfg.autocorr_fftlen) strcat(buildoptions," -DSETI7");//R: dynamically define if autocorr is needed
0 Likes

Thanks Raistmer for the update.

Are you sure -cl-unsafe-math-optimizations flag is not causing the data corruption issue?

I will try to look into the code base in some days. Meanwhile If you can arrange for more information, from claggy and team, it would be helpful.

0 Likes

It was not the case with older drivers. But maybe new one enabled some more "unsafe" optimizations indeed. Worth to check, I will, thanks.

Regarding kernel file compilation issues - they were observed under Linux too. Not a crash but some "internal error" instead:

Error : Building Program (source, clBuildProgram):main kernels: not OK code -11

Internal error: Compilation failed.

(it's on Catalyst 13.2 beta7 )

It's the same app, just its Linux port. We will try to narrow issue location inside CL file.

0 Likes

Error : Building Program (source, clBuildProgram):main kernels: not OK code -11

Internal error: Compilation failed.

-11 is the kernel compilation failed. Check out the build log from clGetProgramBuildInfo API.

0 Likes

himanshu.gautam schrieb:

Error : Building Program (source, clBuildProgram):main kernels: not OK code -11

Internal error: Compilation failed.

-11 is the kernel compilation failed. Check out the build log from clGetProgramBuildInfo API.

That is the complete Buildlog:

"Internal error:Compilation failed."

Attached the output from AMD APP KernelAnalyzer2 for our MultiBeam_Kernels.cl

0 Likes

I am able to compile both the kernel files Multibeam_kernels_r1726.cl & Multibeam_kernel_r1643.cl with the above mentioned build options with 13.1 Driver. 13.2 is in beta, so I recommend to try 13.1 only. Kernel Analyzer with attached Info, built both kernels for all 18 OpenCL devices.

0 Likes

Did you try under Windows or under Linux ?

This subthread about Linux and your screenshot very resembles Windows version "About" dialog...

0 Likes

Hi Raistmer,

Yes , i tried in windows just like everything else. I will try on linux today, and let you know.

But I am afraid, we have not been able to nail down the problem that i can forward to Some relevant people for fixing.

As of now, here are the inferences:

1. Both Kernels compile fine on Kernel Analyzer on Windows ( so kernel compilation is probably not the issue). Need to check on linux though.

2. Two testcases were given, first testcase produces driver crash. Can you confirm it is not a very time consuming kernel? The reason for driver crash may just be VPU recover. The second testcase, which only differs in some date files, passes properly (with strong correctness).

Do you have any ideas?

0 Likes