8 Replies Latest reply on Dec 17, 2015 3:30 AM by dipak

    Soft Lockup in Linux 3.18.20 with Catalyst 14.12 (and 15.9)

    opello

      Hello,

       

      I'm seeing a soft lockup in Linux when running any OpenCL application.  My test case is clinfo, which when run using strace -Ffttt demonstrates the soft lockup happening after a certain ioctl:

      135   1448488023.137973 ioctl(5, 0x4004648c, 0x7ffc737e0190) = 0

      and the kernel log shows:

      [  108.563955] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [clinfo:135]

      (including a stack trace).

       

      I only see this on my custom platform using a GX-424CC.  I do not see it on my reference platform (Sapphire BP-FT3GS).  I would appreciate any help in what to investigate.  My platform is running coreboot while the Sapphire platform is running AMI BIOS, and from what I can tell the VGA BIOS is the same in both.  The one difference I have spotted is the GPU device on the PCIe bus has a different subsystem device ID on the Sapphire board (1002:9851 for the PCIe device VID:DID and 1002:0123 for the subsystem VID:DID) but that seems like it is Sapphire's identifier based on this thread.

       

      It's also worth noting that clinfo returns the expected information for this platform.  And other OpenCL applications seem to work.  But they all seem to incur this soft lockup on startup which is unacceptable for my use case.

       

      I've attached the strace log as well as the relevant kernel log messages from the failure.

       

      Thanks for your time.

        • Re: Soft Lockup in Linux 3.18.20 with Catalyst 14.12 (and 15.9)
          dipak

          Hi,

          Sorry for this delayed reply.

          Just wanted to check whether you are still facing the issue or not. In case of yes, could you please try the latest 15.11 Crimson driver? [please perform a fresh installation]

          If the issue still persists, please share more details about your setup. We'll try to reproduce it here.

           

          Regards,

            • Re: Soft Lockup in Linux 3.18.20 with Catalyst 14.12 (and 15.9)
              opello

              Thanks for the follow up.

               

              Yes I am still facing the issue with the mentioned versions.  After testing 15.11 clinfo just fails with:

              Number of platforms:                             1
                Platform Profile:                              FULL_PROFILE
                Platform Version:                              OpenCL 2.0 AMD-APP (1912.5)
                Platform Name:                                 AMD Accelerated Parallel Processing
                Platform Vendor:                               Advanced Micro Devices, Inc.
                Platform Extensions:                           cl_khr_icd cl_amd_event_callback cl_amd_offline_devices


                Platform Name:                                 AMD Accelerated Parallel Processing
              ERROR: clGetDeviceIDs(-1)

              So while 15.11 (Crimson) removes the soft lockup it seems to not work in a worse way (no GPU OpenCL functionality).

              I also tested the same installation on my reference platform (Sapphire BP-FT3GS) and it works there.  So hopefully the 2 attached strace logs can provide a basis for comparison to help direct the investigation into my custom platform.  If I can collect any other information of value please let me know.  If I should start another thread to manage the 15.11 clGetDeviceIDs(-1) failure in clinfo please let me know.

              Thanks for your time.

                • Re: Soft Lockup in Linux 3.18.20 with Catalyst 14.12 (and 15.9)
                  dipak

                  It seems that clinfo is unable to recognize any OpenCL device on that setup. Actually, Crimson driver requires 4GB or higher RAM to work properly. Otherwise, neither CPU, nor iGPU report OpenCL support and dGPU is reported as only OCL 1.2 capable regardless the ASIC family [OpenCL 2.0 requirements changed in Crimson driver? ]. As I suspect, shortage of RAM may also be the reason for the above issue. I can remember one such instance when I tried to install the 15.11 driver on an APU with lower RAM capacity. So, please check the RAM size. Sorry, as I don't have the SoC, I can't check it my end.

                  [Note: If you see any segfault after adding required RAM, please check my suggestions given here Re: Error using OpenCL 2.0 ]

                   

                  BTW, here is another embedded driver for G-series SoCs: http://support.amd.com/en-us/download/embedded/previous/detail?os=Linux%20x86_64&rev=15.101.1007.

                   

                   

                  Regards,

                  1 of 1 people found this helpful
                    • Re: Soft Lockup in Linux 3.18.20 with Catalyst 14.12 (and 15.9)
                      opello

                      Thanks for the reply.  My custom board only has 2GB RAM.  Is there some reason for the additional RAM requirement?  I certainly don't need that much for *my* OpenCL program.

                       

                      I had not tried that exact version but after doing so it behaves similarly to the 14.12 series that I've tested, exhibiting the soft lockup upon the same ioctl number (0x4004648c) being invoked.

                       

                      If the G-series SoCs get a Crimson driver release that doesn't require 4GB RAM I just might be set.  As it stands the latest embedded driver seems to only be for the R-series (the "AMDGPU" driver mentioned here).  How likely is it that a Crimson driver package will be released in the embedded series?

                      • Re: Soft Lockup in Linux 3.18.20 with Catalyst 14.12 (and 15.9)
                        opello

                        Part of your suggestion made me wonder if I couldn't try using just the newer kernel module with the older user space, since that seems to happen a lot with people using the APP SDK.  Well it seemed to work alright for me in a test setup.

                         

                        I used the fglrx.ko from 15.30.1025 since it was not causing soft lockups in Linux for me.  I then used the user space pieces from 15.101.1007 since they seemed to be the most recent that functioned despite the soft lockups.

                         

                        What are the risks of this kind of mixing and matching?

                        Thanks for your time.

                          • Re: Soft Lockup in Linux 3.18.20 with Catalyst 14.12 (and 15.9)
                            dipak

                            Hi,

                            Is there some reason for the additional RAM requirement?  I certainly don't need that much for *my* OpenCL program.

                            From Crimson driver, the RAM requirement has been changed and as I think, it will remain valid for upcoming drivers also. As I came to know, driver team faced a lot of difficulties and issues to properly support OpenCL on such systems with very limited RAM resource. That's why they took that decision. So, it's all about the driver requirement, nothing to do with your own program. Anyway, nowadays 4 GB or higher RAM is very common for personal as well as professional setups.

                             

                            How likely is it that a Crimson driver package will be released in the embedded series?

                            Sorry, I can't say anything as I'm not aware of any.

                             

                            What are the risks of this kind of mixing and matching?

                            Truly speaking I don't know as I'm not an expert in this topic. However, I don't think that mixing driver components from various versions would be a good idea unless you've the required knowledge/information. Someone from driver team may provide you the answer.

                            From me, best of luck for your experiment. It's good as long as everything goes right .

                             

                            Regards,