43 Replies Latest reply on Apr 12, 2013 4:29 PM by Claggy Branched to a new discussion.

    Problems with Cat 12.10 and up and HD7xxx (and not only) GPUs

    Raistmer

      My app starts to produce invalid results if source CL file is compiled with Catalyst 12.10 and causes driver restarts if it compiled with Cat 12.11 beta 8.

      All this observed on HD7770 GPU. Looks like HD5xxx and HD6xxx are not affected.

       

      Moreover, if app uses cached binaries compiled with Catalyst 12.8 - it works OK under Catalyst 12.10 and 12.11 beta 8. So, it's some problem that very new OpenCL runtime compiler adds.

       

      Any chance to get this issue fixed in Cat 12.11 release ?

        • Re: Problems with Cat 12.10 and up and HD7xxx GPUs
          binying

          Could you provide more information about this issue such as a simple test code?

            • Re: Problems with Cat 12.10 and up and HD7xxx GPUs
              Raistmer

              It was observed on whole app (final results were invalid) and I had no time so far to debug this issue to particular kernel type.

              App is available for test, I can provide bench config but so far each time I did this AMD side falled to permanent silence. I don't want spend time for nothing really.

              • Re: Problems with Cat 12.10 and up and HD7xxx GPUs
                Raistmer

                Well, unfortunately, this problem not only HD7xxx specific.

                I was able to reproduce it on own HD6950.

                 

                Here is testcase you asked for: https://dl.dropbox.com/u/60381958/Bad_binaries_with_Cat12.11beta8_test_case.7z

                 

                How to use:

                Extract archive, run application (executable). It will perform some computations over included in archive dataset.

                App provided with text-based CL file. It will compile that CL file and produce few *.bin* files with binary kernels.

                There are 2 subdirectories also. One with such binaries generated under Catalyst 12.11 beta 8 (for HD6950 GPU) and another - binaries generated with some older Catalyst (can't say exactly but 12.6 most probably). When I use older binaries (running under Catalyst 12.11 beta 8 ) app does its computations and finished OK.

                But when I use no binaries (that is, compilation from scratch under Cat 12.11 beta8) or binaries already compiled under Cat 12.11b8 app causes driver restart.

                 

                Please, confirm this and advise for some possible fix for this issue. App supposed to be installed automaticvally on huge number of hosts so driver restarts not an option to live with...

              • Re: Problems with Cat 12.10 and up and HD7xxx (and not only) GPUs
                Raistmer

                Same issue with Catalyst 12.11 beta 11, GPU is HD6950, OS: Vista x86

                • Re: Problems with Cat 12.10 and up and HD7xxx (and not only) GPUs
                  Raistmer

                  I installed Catalyst 12.1

                  Test case with freshly generated kernels binaries works OK.

                  When I replace those binaries with one generated under Cat 12.11 driver restart issue returns.

                   

                  I think all this quite full evidence that the problem not in runtime, but in new OpenCL->binary compiler that generates GPU binaries. No matter under what runtime they run, old or new, always binaries generated with old Catalyst work OK, binaries generated with Cat 12.11 (beta 8 or beta 11) cause driver restart.

                  One can find those binaries in link above.

                    • Re: Problems with Cat 12.10 and up and HD7xxx (and not only) GPUs
                      freighter

                      Had some time to test this apps issue on two 64bit openSuse Linux hosts (same sources as on windows).

                       

                      Sadly was able to reproduce the problem found on windows on one of the hosts with two Radeon HD 7750 with Cat. 12.11beta8 and beta11. It causes a complete system freeze (screen active, system does not respond to any actions, like mouse or keyboard, anymore.) and makes a hard reboot necessary. This system has only PCIe2.0 slots available, while the second host, which does NOT reproduce this problem has a Radeon HD 7850 residing in a PCIe3.0 slot. Maybe this is connected, maybe not, at least some observation.

                       

                      So, if there is a fix for windows please make sure you have one for Linux as well.

                    • Problems with Cat 12.10 and up and HD7xxx (and not only) GPUs
                      Raistmer

                      Any advance with test case? Problem confirmed ?

                      • Re: Problems with Cat 12.10 and up and HD7xxx (and not only) GPUs
                        Raistmer

                        So, what the current status of this issue ?

                        • Re: Problems with Cat 12.10 and up and HD7xxx (and not only) GPUs
                          Raistmer

                          Problem not fixed in Cat 13.1, I have reports that driver restart occurs on 13.1 too.

                          • Re: Problems with Cat 12.10 and up and HD7xxx (and not only) GPUs
                            Raistmer

                            Looks like problem not fixed in Catalyst 13.2 beta 3 too.

                            Now CL file just can't be compiled at all:

                            OpenCL Platform Name:



                            AMD Accelerated Parallel Processing
                            Number of devices:


                            1
                              Max compute units:


                            10
                              Max work group size:


                            256
                              Max clock frequency:


                            1120Mhz
                              Max memory allocation:

                            536870912
                              Cache type:



                            Read/Write
                              Cache line size:


                            64
                              Cache size:



                            16384
                              Global memory size:


                            1073741824
                              Constant buffer size:


                            65536
                              Max number of constant args:

                            8
                              Local memory type:


                            Scratchpad
                              Local memory size:


                            32768
                              Queue properties:


                                Out-of-Order:


                            No
                              Name:




                            Capeverde
                              Vendor:



                            Advanced Micro Devices, Inc.
                              Driver version:


                            1124.2 (VM)
                              Version:



                            OpenCL 1.2 AMD-APP (1124.2)
                              Extensions:



                            cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_d3d10_sharing cl_amd_c1x_atomics

                             

                             

                            INFO: can't open binary kernel file: .\\AstroPulse_Kernels_r1761.cl_Capeverde.bin_V6, continue with recompile...

                            Error : Building Program (source, clBuildProgram):main kernels: not OK code -11

                            Internal error: Compilation failed.

                             

                            It works OK with Cat 12.1, with Cat 12.8 (for example).

                             

                            EDIT: And I don't see how this issue recived "Assumed Answered" status if it remains in all new drivers AMD releases. It's absolutely not answered and critical issue in fact.

                            • Re: Problems with Cat 12.10 and up and HD7xxx (and not only) GPUs
                              Raistmer

                              To check validness of computation one can use attached tool and reference result (inside archive).

                              usage:

                              rescmpv5.exe ref-setiathome_6.98_windows_intelx86.exe-PG0395_v7.wu.res result.sah

                              where result.sah is the file generated after app full run.

                              tool output self-explaining. In case of big result differencies it will show table with quality of found signals between 2 files.

                               

                              Examples of usage:

                              Cat 12.8 run:

                              E:\123>rescmpv5.exe ref-setiathome_6.98_windows_intelx86.exe-PG0395_v7.wu.res result.sah

                              Result      : Strongly similar,  Q= 99.41%

                               

                              not 100% similarity almost inevitable between CPU and GPU long floating point computations but similarity good enough.

                               

                              Cat 12.8 but binaries taken from Cat 12.11 beta 8:

                               

                              1) driver restart occured (just as was reported in initial post).

                              2)

                              E:\123>setiathome_6.99_windows_intelx86__opencl_ati_sah.exe

                               

                               

                              E:\123>rescmpv5.exe ref-setiathome_6.98_windows_intelx86.exe-PG0395_v7.wu.res result.sah

                                              ------------- R1:R2 ------------     ------------- R2:R1 ------------

                                              Exact  Super  Tight  Good    Bad     Exact  Super  Tight  Good    Bad

                                      Spike      0      0      0      0      0        0      0      0      0      0

                                   Autocorr      0      0      0      0      1        0      0      0      0      0

                                   Gaussian      0      0      0      0      1        0      0      0      0      0

                                      Pulse      0      0      0      0      0        0      0      0      0      0

                                    Triplet      0      0      0      0      0        0      0      0      0      0

                                 Best Spike      0      0      0      0      1        0      0      0      0      0

                              Best Autocorr      0      0      0      0      1        0      0      0      0      0

                              Best Gaussian      0      0      0      0      1        0      0      0      0      0

                                 Best Pulse      0      0      0      0      1        0      0      0      0      0

                              Best Triplet      0      0      0      0      0        0      0      0      0      0

                                              ----   ----   ----   ----   ----     ----   ----   ----   ----   ----

                                                 0      0      0      0      6        0      0      0      0      0

                               

                               

                              Unmatched signal(s) in R1 at line(s) 672 689 716 732 749 775

                              Result      : Different.

                              As one can see number of found results differs (of course, app was terminated after driver restart, computations not finished).

                               

                              One will see similar table if computation will finish ok, but with wrong results.

                              Validation tool will show differencies as in this sample.

                               

                              P.S.:

                               

                              (As per you, i should get invalid results with 12.10 driver, how to verify that?).

                              Yes, expect wrong result (but no driver restart ) with Cat 12.10. Driver restarts appeared on later driver releases.

                              Tool for verification and how to use it described in this post, above.


                                • Re: Problems with Cat 12.10 and up and HD7xxx (and not only) GPUs
                                  himanshu.gautam

                                  Hi Raistmer,

                                  Probably you were right about driver downgrading issue. Here are my observations:

                                   

                                  1. I had installed 12.10 without a proper system clean and saw the driver crash there. Ran rescmpv5.exe and result were incorrect.

                                  2. Then I had cleaned the system using AMD cleanup utility before installing any other driver:

                                  3. Installed 12.8 driver: SETI.exe ran without a crash. Check correctness with rescmpv5.exe and it gave 99.9% correctness.

                                  4. Installed 12.10 again, and surprisingly seti.exe again ran without crash. rescmpv5 also passed correctly.

                                  5. Now installed 13.1 driver, SETI.exe crashed. rescmpv5.exe confirms incorrect result.

                                   

                                  Attached are the result.sah and stderr file for all cases.

                                  So our observations are differing for 12.10 driver as of now. But anyways it is a bug. Please provide any feedback you have on the results.

                                  1 of 1 people found this helpful
                                    • Re: Problems with Cat 12.10 and up and HD7xxx (and not only) GPUs
                                      Claggy

                                      himanshu.gautam wrote:

                                       

                                      Hi Raistmer,

                                      Probably you were right about driver downgrading issue. Here are my observations:

                                       

                                      1. I had installed 12.10 without a proper system clean and saw the driver crash there. Ran rescmpv5.exe and result were incorrect.

                                      2. Then I had cleaned the system using AMD cleanup utility before installing any other driver:

                                      3. Installed 12.8 driver: SETI.exe ran without a crash. Check correctness with rescmpv5.exe and it gave 99.9% correctness.

                                      4. Installed 12.10 again, and surprisingly seti.exe again ran without crash. rescmpv5 also passed correctly.

                                      5. Now installed 13.1 driver, SETI.exe crashed. rescmpv5.exe confirms incorrect result.

                                       

                                      Attached are the result.sah and stderr file for all cases.

                                      So our observations are differing for 12.10 driver as of now. But anyways it is a bug. Please provide any feedback you have on the results.

                                      Looking at the result from:

                                       

                                      [quote]4. Installed 12.10 again, and surprisingly seti.exe again ran without crash. rescmpv5 also passed correctly.[/quote]

                                       

                                      It looks as if Raistmer has supplied a workunit that doesn't show a weakily similiar result on Cat 12.10, it has:

                                       

                                      'WU true angle range is :  0.394768'

                                       

                                      The Workunits that that showed the weakily similar reult were the PG0009_v7.wu and the refquick_v7.wu workunits,

                                       

                                      which have 'WU true angle range is :  0.008955' and 'WU true angle range is :  0.775000' respectively.

                                       

                                      Here's a full bench of five different workunits (with 3 different apps) where those two workunits are weakily similar.

                                       

                                      Claggy

                                       

                                      Edit: added PG0009_v7.wu and refquick_v7.wu workunits along with ref files for said workunits.

                                      1 of 1 people found this helpful
                                      • Re: Problems with Cat 12.10 and up and HD7xxx (and not only) GPUs
                                        Raistmer

                                        Himanshu, thanks for looking into this issue deeply.

                                        Cause Claggy was first who report Cat 12.10 issue to me I think he is right with explanation why your observation differs from what I said about Cat 12.10.

                                        Please, replace work_unit.sah from my archive with same file from PG0009_v7.workunit.7z.zip archive that Claggy attached.

                                        Also, another ref file,ref-setiathome_6.98_windows_intelx86.exe-PG0009_v7.wu.res (again, provided in that archive), needed to check fresh result.sah. Comparison utility remains the same.

                                         

                                        P.S. So, for now we can summarize issues in next way:

                                        1) Incorrect computations with Cat 12.10 appear not in all data sets. Moreover, difference for PG0009 task in (as we call them) "best signals", that is, signals below threshold to be marked as reportable. That means computations in kernels compiled under 12.10 differ from correct ones not too big, but enough for precision issue to appear.

                                        2) Catalyst 13.1 compiler broken for this kernels file. It's another issue cause error appears even before computations begin.

                                          • Re: Problems with Cat 12.10 and up and HD7xxx (and not only) GPUs
                                            himanshu.gautam

                                            Hi Raistmer,

                                            I checked with the new data as you suggested. I had taken the result.sah file and the reference file from PG0009_v7.workunit.7z.zip attachment. Not sure what the other 2 attachments are intended for.


                                            rescmpv5 utility gives weakly similar for 12.10 driver. So some corruption happening.

                                            rescmpv5 gives strongly similar for 12.8 driver. Expected.

                                            But rescmpv5 gives strongly similar for 13.1 driver now with New Data. SURPRISE again.

                                             

                                            so as i understand it, there are two issues here:

                                            1. Data corruption when driver is updated from 12.8 to 12.10. But not reproduced with 13.1 driver, so probably not a issue. Can you confirm?

                                            2. Driver crash when driver updated from 12.10 to 13.1. This is valid for the old data itself.

                                             

                                            I will try to do some debugging on codeXL too, and let you know.

                                             

                                            Hi Raistmer,

                                            Will it be possible to give a testcase with the host code. I tried working with the kernel file, but there are so many kernels (which are enabled/disabled using #defines) . Also RESULT_SIZE seems to be a macro defined in Host code and used in kernels. I could not compile the kernels in KernelAnalyzer because of this macro.

                                             

                                            Message was edited by: Himanshu Gautam