6 Replies Latest reply on May 14, 2014 5:22 PM by aharbilx

    ACML with pwscf

    aharbilx

      Hi Everybody,

       

      I am newbie in GPGPU and I wish to offload  some part of calculation from a material science science application (pwscf) on GPU, the code is written under Fortran with calling of LAPACK, BLAS and FFT.

      How to do linking to the ACML 6 beta under the wrapper (configure), I tried:

       

      ./configure

      ./configure BLAS_LIBS="-L/home/youssef/Documents/Soft/acml6/acml-6.0.3.97-Beta-gfortran64/gfortran64_mp/lib -lacml_mp -fopenmp -m64"

       

      i compile it with gfortran and mpif90 but i dont see nothing sent to gpu, maybe i missed somthing, i have already install amd app sdk

       

      My config

      Intel® Core™ i7-3630QM CPU @ 2.40GHz × 8

      AMD Radeon HD 7970M

      8 Gb ram

       

      Working under Ubuntu 14.04 LTS

        • Re: ACML with pwscf
          timmy.liu

          Hi aharbilx,

           

          There are a few potential reasons that the workload is not offload to GPU:

           

          - The OpenCL library is not detected; The OpenCL library is required for ACML to run on GPU. If it is not detected, all computation will be only offloaded to CPU. It sounds like you have the OpenCL installed. Can you check if you can correctly run "clinfo" command?

           

          - ACML offloads computation to either CPU or GPU based on the "heuristic" defined in lua files. If you are testing a sgemm routines for example, you can look at the ${path-to-acml}/resources/Tahiti/gemm.lua. At this point, sgemm will be offloaded to GPU if only m*n*k is bigger than 400*400*400. I wonder if you are testing a smaller matrix. Of course you can change the heuristic in lua files to anything you like. "return true" means offload to GPU. (for more information about this, you can refer to Chap 7 of acml.pdf under /Doc)

           

          - you can also set the env "ACML_LOG_FILTER=1" to generate a log file at the working directory. If you can see "usegpu( 1 )", ACML should have offloaded that subroutine to the GPU.

           

          - Note that there are actually two fft libraries shipped with this version of ACML. Only the ACML_FFTW (a separate so file) supports gpu computation.

           

          I hope this helps,

          Timmy

            • Re: ACML with pwscf
              aharbilx

              Thanks Timmy,

               

              Could please tell me about clinfo return, is that OK?:

               

              Number of platforms:     1
                Platform Profile:     FULL_PROFILE
                Platform Version:     OpenCL 1.2 AMD-APP (1214.3)
                Platform Name:     AMD Accelerated Parallel Processing
                Platform Vendor:     Advanced Micro Devices, Inc.
                Platform Extensions:     cl_khr_icd cl_amd_event_callback cl_amd_offline_devices

               

               

                Platform Name:     AMD Accelerated Parallel Processing
              Number of devices:     2
                Device Type:     CL_DEVICE_TYPE_GPU
                Device ID:     4098
                Board name:     AMD Radeon HD 7970M
                Device Topology:     PCI[ B#1, D#0, F#0 ]
                Max compute units:     20
                Max work items dimensions:     3
              Max work items[0]:     256
              Max work items[1]:     256
              Max work items[2]:     256
                Max work group size:     256
                Preferred vector width char:     4
                Preferred vector width short:     2
                Preferred vector width int:     1
                Preferred vector width long:     1
                Preferred vector width float:     1
                Preferred vector width double:     1
                Native vector width char:     4
                Native vector width short:     2
                Native vector width int:     1
                Native vector width long:     1
                Native vector width float:     1
                Native vector width double:     1
                Max clock frequency:     850Mhz
                Address bits:     32
                Max memory allocation:     1073741824
                Image support:     Yes
                Max number of images read arguments:     128
                Max number of images write arguments:     8
                Max image 2D width:     16384
                Max image 2D height:     16384
                Max image 3D width:     2048
                Max image 3D height:     2048
                Max image 3D depth:     2048
                Max samplers within kernel:     16
                Max size of kernel argument:     1024
                Alignment (bits) of base address:     2048

                Minimum alignment (bytes) for any datatype:     128

                Single precision floating point capability

              Denorms:     No
              Quiet NaNs:     Yes
              Round to nearest even:     Yes
              Round to zero:     Yes
              Round to +ve and infinity:     Yes
              IEEE754-2008 fused multiply-add:     Yes
                Cache type:     Read/Write
                Cache line size:     64
                Cache size:     16384
                Global memory size:     1786773504
                Constant buffer size:     65536
                Max number of constant args:     8
                Local memory type:     Scratchpad
                Local memory size:     32768

                Kernel Preferred work group size multiple:     64

                Error correction support:     0
                Unified memory for Host and Device:     0
                Profiling timer resolution:     1
                Device endianess:     Little
                Available:     Yes
                Compiler available:     Yes
                Execution capabilities:    
              Execute OpenCL kernels:     Yes
              Execute native function:     No
                Queue properties:    
              Out-of-Order:     No
              Profiling :     Yes
                Platform ID:     0x00007feb93972fc0
                Name:     Pitcairn
                Vendor:     Advanced Micro Devices, Inc.
                Device OpenCL C version:     OpenCL C 1.2
                Driver version:     1214.3 (VM)
                Profile:     FULL_PROFILE
                Version:     OpenCL 1.2 AMD-APP (1214.3)
                Extensions:     cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_image2d_from_buffer

               

               

                Device Type:     CL_DEVICE_TYPE_CPU
                Device ID:     4098
                Board name:    
                Max compute units:     8
                Max work items dimensions:     3
              Max work items[0]:     1024
              Max work items[1]:     1024
              Max work items[2]:     1024
                Max work group size:     1024
                Preferred vector width char:     16
                Preferred vector width short:     8
                Preferred vector width int:     4
                Preferred vector width long:     2
                Preferred vector width float:     8
                Preferred vector width double:     4
                Native vector width char:     16
                Native vector width short:     8
                Native vector width int:     4
                Native vector width long:     2
                Native vector width float:     8
                Native vector width double:     4
                Max clock frequency:     2401Mhz
                Address bits:     64
                Max memory allocation:     2147483648
                Image support:     Yes
                Max number of images read arguments:     128
                Max number of images write arguments:     8
                Max image 2D width:     8192
                Max image 2D height:     8192
                Max image 3D width:     2048
                Max image 3D height:     2048
                Max image 3D depth:     2048
                Max samplers within kernel:     16
                Max size of kernel argument:     4096
                Alignment (bits) of base address:     1024

                Minimum alignment (bytes) for any datatype:     128

                Single precision floating point capability

              Denorms:     Yes
              Quiet NaNs:     Yes
              Round to nearest even:     Yes
              Round to zero:     Yes
              Round to +ve and infinity:     Yes
              IEEE754-2008 fused multiply-add:     Yes
                Cache type:     Read/Write
                Cache line size:     64
                Cache size:     32768
                Global memory size:     8255393792
                Constant buffer size:     65536
                Max number of constant args:     8
                Local memory type:     Global
                Local memory size:     32768

                Kernel Preferred work group size multiple:     1

                Error correction support:     0
                Unified memory for Host and Device:     1
                Profiling timer resolution:     1
                Device endianess:     Little
                Available:     Yes
                Compiler available:     Yes
                Execution capabilities:    
              Execute OpenCL kernels:     Yes
              Execute native function:     Yes
                Queue properties:    
              Out-of-Order:     No
              Profiling :     Yes
                Platform ID:     0x00007feb93972fc0
                Name:     Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
                Vendor:     GenuineIntel
                Device OpenCL C version:     OpenCL C 1.2
                Driver version:     1214.3 (sse2,avx)
                Profile:     FULL_PROFILE
                Version:     OpenCL 1.2 AMD-APP (1214.3)
                Extensions:     cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt
                • Re: ACML with pwscf
                  timmy.liu

                  It looks like the GPU you have is recognized by the name "Pitcairn". So it makes sense that you don't see gpu running, assuming everything else runs correctly.

                   

                  ACML uses the lua scripting to decide at run time when to offload certain computation (right now only BLAS level 3 and FFTW API are enabled) to CPU or GPU. Since we only did performance tests on systems with Kaveri (Spectre) and Tahiti in the lab, only the lua files with heuristics (when to offload) for Kaveri and Tahiti systems are shipped with the library. Computation running on any other system will automatically fall back to CPU only.

                   

                  But you can enable the GPU computation by

                  1, go to the directory ${PATH-TO-ACML}/lib/resources

                  2, create a folder of name "Pitcairn" next to Spectre, Tahiti and Default

                  3, copy all the files from /Tahiti to /Pitcairn

                  4, you may want to modify the individual lua files. Note "return true" means this routine will run on the GPU.

                   

                  Ultimately you would come up with a set of lua files that best fit your system (i7/Pitcairn).

                   

                  Thanks,

                  Timmy

                    • Re: ACML with pwscf
                      aharbilx

                      Dear Timmy,

                       

                      Thank you for the good point, i have copied (from "Tahiti" folder)   and modified (Return true) .lua files for both gfortran and gfortran_mp to new folder "Pitcairn" just inside "resources" directory.

                      Unfortunately I still can't see anything sent to the GPU, I passed 2 days testing and retesting.

                       

                      Just to make sure that the origin of problem isn't my application, I tested the examples given in folder "Performance" and it return me the same result, all calculation goes the CPU.

                       

                      for more information:

                      Env Var:

                      export PATH=$PATH:/home/youssef/Documents/Soft/espresso/espresso-5.1rc1/bin

                      export LD_LIBRARY_PATH=/opt/acml6/gfortran64_mp/lib:/opt/acml/gfortran64_mp/lib  (I tried also export LD_LIBRARY_PATH=/opt/acml6/gfortran64_mp/lib)

                      export AMDAPPSDKROOT=/opt/AMDAPP/lib/x86_64

                      export ACML_LOG_FILTER=1 (I can't see any log created, why?)

                       

                      Compiling information:

                      ./configure

                      -enable-openmp F77=gfortran F90=gfortran -disable-parallel

                      BLAS_LIBS="-L/home/youssef/Documents/Soft/acml6/acml-6.0.3.97-Beta-gfortran64/gfortran64_mp/lib -lacml_mp -fopenmp -m64"

                      FFT_LIBS="-L/home/youssef/Documents/Soft/acml6/acml-6.0.3.97-Beta-gfortran64/gfortran64_mp/lib -lacml_mp -fopenmp -m64"

                       

                      GPU information:

                      Radeon HD 7970M (Mobile GPU) + Intel HD 4000==> I configured CCC to take 7970M as default graphics adapter

                       

                      Thanks in advance.