14 Replies Latest reply on Apr 27, 2014 4:46 PM by bsp2020

    ACML and iGPU of Kaveri

    huber

      Will (in the future) ACML use iGPU in Kaveri, for example, to execute instructions BLAS?.

      I use ACML instead of  BLAS (package "libblas3") under Ubuntu just linked by  "update-alternatives" (as described, for example, here Accelerate your (matrix) computations with ACML on (K)Ubuntu 11.10 | Luis E's thoughts... ). Can it be used in the same way the computational power iGPU Kaveri after updating ACML (I'm talking about future releases)?

        • Re: ACML and iGPU of Kaveri
          kknox

          Hi Huber~

          Check out the beta of ACML 6, which begins integration of heterogeneous computing through the clMath libraries: http://developer.amd.com/community/blog/2014/04/16/acml-beta-6-0-released/.  Kaveri & Tahiti are supported devices in the beta.

            • Re: ACML and iGPU of Kaveri
              bsp2020

              Hi,

              I'm trying to build GNU Octave 3.8.1 with ACML 6 (acml-6-0-3-92Beta-gfortran-64bit.tgz) and getting the following error.

               

              configure: error: A BLAS library was detected but found incompatible with your Fortran 77 compiler settings.

               

              I followed the instructions in Accelerate your (matrix) computations with ACML on (K)Ubuntu 11.10 | Luis E's thoughts...  to setup the alternative BLAS/LAPACK.

               

              My OS is Ubuntu 13.10 and I'm running alpha2 kernel from HSAFoundation/Linux-HSA-Drivers-And-Images-AMD · GitHub and my hardware is Kaveri A10-7850K with ASUS A88x-PRO

               

              Can you help me get it compiled? I'm interested in comparing GNU Octave performance between default/ACML CPU/ACML GPU.

               

              Thanks!

               

              Brian

                • Re: ACML and iGPU of Kaveri
                  kknox

                  Hi bsp2020~

                  configure: error: A BLAS library was detected but found incompatible with your Fortran 77 compiler settings.

                  This error is too generic for us to be able to make any sense out of.  If you go into the examples folder of ACML, can you run the makefile and build all the examples on your machine?  ACML 6 is compiled with gfortran 4.7.1; is your compiler at least this new?  Also, this beta only contains shared libraries, the static libraries were not shipped to simplify support.  Make sure that your configure script can find and use the .so files.

                   

                  An additional note: you don't need an HSA driver to utilize ACML 6; it runs on a vanilla OpenCL installation.  So, you will need the Catalyst driver suite to install OpenCL runtime on the machine. 

                  Kent

                    • Re: ACML and iGPU of Kaveri
                      bsp2020

                      Hi Kent,

                      I have gfortran 4.8.1. All the packages I installed on this machine is from the default repositories except the HSA libraries and kernels I got from bitbucket/github. I was going to try C++AMP with HSA first but I changed my mind and am now trying is to compare the default/ACML CPU/ACML GPU performance in GNU Octave. So, I'm trying to get Octave built with ACML without AMD OpenCL driver first, to get ACML CPU performance. Then, I'll install Catalyst driver to enable OpenCL.

                       

                      I tried to build the examples and got:

                       

                      briansp@kaveri:~/acml/gfortran64_mp/examples$ make -f GNUmakefile

                       

                      Compiling program acmlinfo.f:

                      gfortran -c acmlinfo.f -o acmlinfo.o

                      Linking program acmlinfo.exe:

                      gfortran -fopenmp acmlinfo.o -L ../lib -lrt -ldl -lstdc++ -lacml_mp -o acmlinfo.exe

                      ../lib/libacml_mp.so: undefined reference to `__cxa_guard_release'

                      ../lib/libacml_mp.so: undefined reference to `dlsym'

                      ../lib/libacml_mp.so: undefined reference to `__gxx_personality_v0'

                      ../lib/libacml_mp.so: undefined reference to `__cxa_guard_abort'

                      ../lib/libacml_mp.so: undefined reference to `std::ios_base::Init::~Init()'

                      ../lib/libacml_mp.so: undefined reference to `__cxa_guard_acquire'

                      ../lib/libacml_mp.so: undefined reference to `std::ios_base::Init::Init()'

                      ../lib/libacml_mp.so: undefined reference to `dlopen'

                      ../lib/libacml_mp.so: undefined reference to `dlclose'

                      collect2: error: ld returned 1 exit status

                      make: *** [acmlinfo.res] Error 1

                       

                      Can you help me?

                       

                      Brian

                        • Re: ACML and iGPU of Kaveri
                          bsp2020

                          I tried ACML 5.3.1 and was able to build Octave with it. However, I noticed that ACML 5.3.1 came with an install script which seemed to fix up the make file. Was 6.0.3 also supposed to come with an install script?

                           

                          Brian

                          • Re: ACML and iGPU of Kaveri
                            kknox

                            This looks like the correct compile line:

                            gfortran -fopenmp acmlinfo.o -L ../lib -lrt -ldl -lstdc++ -lacml_mp -o acmlinfo.exe

                             

                            These symbols should be defined in -ldl:

                            ../lib/libacml_mp.so: undefined reference to `dlsym'

                            ../lib/libacml_mp.so: undefined reference to `dlopen'

                            ../lib/libacml_mp.so: undefined reference to `dlclose'

                             

                            These symbols should be defined in -lstdc++:

                            ../lib/libacml_mp.so: undefined reference to `__cxa_guard_release'

                            ../lib/libacml_mp.so: undefined reference to `__gxx_personality_v0'

                            ../lib/libacml_mp.so: undefined reference to `__cxa_guard_abort'

                            ../lib/libacml_mp.so: undefined reference to `std::ios_base::Init::~Init()'

                            ../lib/libacml_mp.so: undefined reference to `__cxa_guard_acquire'

                            ../lib/libacml_mp.so: undefined reference to `std::ios_base::Init::Init()'

                             

                            These are library dependencies that have been added since ACML 5.3.1.  For whatever reason, it appears your ld is either not finding your libraries in your predefined paths, or the symbols are mangled in a different way.  What happens when you run ldd on libacml_mp.so?

                              • Re: ACML and iGPU of Kaveri
                                bsp2020

                                Output from ldd

                                ACML5.3.1

                                briansp@kaveri:~$ ldd /opt/acml5.3.1/gfortran64_mp/lib/libacml_mp.so

                                  linux-vdso.so.1 =>  (0x00007fffcf200000)

                                  librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f8126618000)

                                  libgfortran.so.3 => /usr/lib/x86_64-linux-gnu/libgfortran.so.3 (0x00007f8126300000)

                                  libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f8125ff8000)

                                  libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007f8125de8000)

                                  libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f8125bc8000)

                                  libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f8125800000)

                                  libquadmath.so.0 => /usr/lib/x86_64-linux-gnu/libquadmath.so.0 (0x00007f81255c0000)

                                  libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f81253a8000)

                                  /lib64/ld-linux-x86-64.so.2 (0x00007f8128bf8000)

                                 

                                ACML6.0.3

                                briansp@kaveri:~$ ldd /opt/acml6.0.3/gfortran64_mp/lib/libacml_mp.so

                                  linux-vdso.so.1 =>  (0x00007fff68c00000)

                                  librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f2f17568000)

                                  libgfortran.so.3 => /usr/lib/x86_64-linux-gnu/libgfortran.so.3 (0x00007f2f17250000)

                                  libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f2f16f48000)

                                  libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007f2f16d38000)

                                  libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f2f16b20000)

                                  libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f2f16900000)

                                  libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f2f16538000)

                                  libquadmath.so.0 => /usr/lib/x86_64-linux-gnu/libquadmath.so.0 (0x00007f2f162f8000)

                                  /lib64/ld-linux-x86-64.so.2 (0x00007f2f19b60000)

                                 

                                I'm not sure what is wrong... I think I'll try C++AMP again and come back to this later.

                                 

                                Brian

                                  • Re: ACML and iGPU of Kaveri
                                    kknox

                                    Hi bsp2020~

                                     

                                    I think we root caused this issue, and new packages are available for download from the download table (6.0.3.97 for gfortran):

                                    http://developer.amd.com/tools-and-sdks/cpu-development/amd-core-math-library-acml/acml-downloads-resources/

                                     

                                    If you have time, give the new package a try.

                                     

                                    Kent

                                      • Re: ACML and iGPU of Kaveri
                                        bsp2020

                                        Kent,

                                        I was able to build octave 3.8.1 using ACML6.0.3.97. However, the performance was a bit disappointing. I did

                                         

                                        a = rand(5000, 5000)

                                        svd(a)

                                         

                                        as shown from the linked article and found that using GPU is actually slower. I don't know much about ACML/octave/svd to figure out what's going on. When using precomiled octave package 3.6.4, svd(a) took 2 min 20 sec and it used only 1 core. Using octave 3.8.1 I built, svd(a) took 1 min 10 seconds when I did not install Catalyst driver and used all 4 cores. After installing Catalyst svd(a) took 1 min 42 seconds and still used 4 cores. I'm not sure whether GPU was actually getting used at all since all 4 cores were loaded up to 95%+.

                                         

                                        Thanks for helping me get it compiled. I wish I had more knowledge to figure out what is going on...

                                         

                                        Brian

                                          • Re: ACML and iGPU of Kaveri
                                            kknox

                                            Hmm, this probably deserves a new thread with an appropriate title; could you fork this?  If you do, could you clarify what BLAS is used in the 3.6.4 package.  For 3.8.1, which catalyst drivers did you use?  Also, could you provide the output of clinfo on that system?  Did you build 3.8.1 from the mercurial repo?

                                             

                                            If your cores were loaded 95%, then it sounds to me like you were using the GPU very little.  Remember that we only have GPU accelerated versions of individual BLAS routines, not traditional LAPACK routines like svd.  Does Octave default to single or double precision in your example above?

                                             

                                            If you try to do a matrix multiply ( BLAS L3 GEMM) on a big matrix, do you see a speedup?

                                              • Re: ACML and iGPU of Kaveri
                                                bsp2020

                                                Kent,

                                                Thanks for the prompt reply. I'm not too familiar with how this forum work. When you said "fork", you mean just create a new post? Or, is there a fork  button some where and I'm missing it...

                                                 

                                                Also, I'm not sure what routines use LAPACK and  BLAS etc. You will have to give me some more detailed instructions or pointers an online posting/page where I can do some learning to get up to speed. I am interested in figuring out what is going on. So, I'd appreciate your guidance.

                                                 

                                                What I know now.

                                                Catalyst version used :14.4rc v1.0 apr17

                                                I changed kernel back to Ubuntu stock kernel in order to install catalyst driver. I'm no longer using HSA image.

                                                I got Octave source in a tar ball from ftp://ftp.gnu.org/gnu/octave

                                                 

                                                When I go home, I'll get other information and post. BTW, is there a way to force ACML to use CPU even when Catalyst is installed?

                                                 

                                                Thanks

                                                 

                                                Brian

                                                  • Re: ACML and iGPU of Kaveri
                                                    kknox

                                                    Yes, I meant start a new thread.  Unfortunately, the ACML project does not have an externally visible bug tracking system, so this forum serves as such.  There is no fork button, but a new thread.

                                                     

                                                    The easiest way to distinguish between LAPACK and BLAS routines is to look at the documentation at the source, http://www.netlib.org/.  Look at the LAPACK or BLAS documentation to see what each library suports. The octave SVD computation probably resolves internally to one of these lapack routines http://www.netlib.org/lapack/lug/node53.html

                                                     

                                                    The easiest way to force ACML to use only CPU compute is to rename the libacml_bridge.so file to anything else, so that libacml can not find it.  If it does not find the bridge library, it will revert to CPU only.

                                                      • Re: ACML and iGPU of Kaveri
                                                        bsp2020

                                                        Kent,

                                                        I was able to compile GNU Octave 3.8.1 using ACML and was able to get some meaningful number. Unfortunately, I'm not sure about the exact cause of the trouble I initially had. Just before the weekend, my Ubuntu install stopped working and I deleted the partition to start from scratch. It turned out that BIOS update I applied (Version 1001, 2014/04/15), is broken and caused my Ubuntu setup to not boot up at all. I have already deleted the partition when I figured out the problem.

                                                         

                                                        After going back to BIOS 0802, I installed Ubuntu 13.10 again and was able to compile Octave using ACML. Below is some benchmark numbers I got, in case you are interested. In summary, ACML single threaded without GPU is about 20% faster for svd and 2.5 times faster for matrix multiplication. Using ACML mp + GPU is 2 times faster than ACML single threaded without GPU. However when I rename libacml_bridge.so to something else, ACML mp performance did not change at all. So, I'm not 100% sure whether GPU is really being used. I did not write down timing number I got using ACML mp before I installed Catlayst driver 14.04. So, I'm not sure whether the benchmark is running any faster before and after installation of Catalayst driver. clinfo output is also pasted below.

                                                         

                                                        I'll do some more experiment and create a new post if I still have issues.

                                                         

                                                        Thanks

                                                         

                                                        Linux Default ->

                                                        octave:1> a = rand(5000,5000,"single");

                                                        octave:2> tic();svd(a);elapsed_time = toc()

                                                        elapsed_time =  59.904

                                                        octave:3> tic();a*a;elapsed_time = toc()

                                                        elapsed_time =  15.024

                                                        octave:4> a = rand(5000,5000);

                                                        octave:5> tic();svd(a);elapsed_time = toc()

                                                        elapsed_time =  132.46

                                                        octave:6> tic();a*a;elapsed_time = toc()

                                                        elapsed_time =  28.595

                                                         

                                                         

                                                        ACML:no GPU ->

                                                        octave:1> a = rand(5000,5000,"single");

                                                        octave:2> tic();svd(a);elapsed_time = toc()

                                                        elapsed_time =  49.774

                                                        octave:3> tic();a*a;elapsed_time = toc()

                                                        elapsed_time =  5.6783

                                                        octave:4> a = rand(5000,5000);

                                                        octave:5> tic();svd(a);elapsed_time = toc()

                                                        elapsed_time =  82.735

                                                        octave:6> tic();a*a;elapsed_time = toc()

                                                        elapsed_time =  11.423

                                                         

                                                         

                                                        ACML_mp:with GPU ->

                                                        octave:1> a = rand(5000,5000,"single");

                                                        octave:2> tic();svd(a);elapsed_time = toc()

                                                        elapsed_time =  27.879

                                                        octave:3> tic();a*a;elapsed_time = toc()

                                                        elapsed_time =  2.4581

                                                        octave:4> a = rand(5000,5000);

                                                        octave:5> tic();svd(a);elapsed_time = toc()

                                                        elapsed_time =  52.070

                                                        octave:6> tic();a*a;elapsed_time = toc()

                                                        elapsed_time =  5.0666

                                                         

                                                        $ clinfo

                                                        Number of platforms: 1

                                                          Platform Profile: FULL_PROFILE

                                                          Platform Version: OpenCL 1.2 AMD-APP (1445.5)

                                                          Platform Name: AMD Accelerated Parallel Processing

                                                          Platform Vendor: Advanced Micro Devices, Inc.

                                                          Platform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices cl_amd_hsa

                                                         

                                                         

                                                         

                                                         

                                                          Platform Name: AMD Accelerated Parallel Processing

                                                        Number of devices: 2

                                                          Device Type: CL_DEVICE_TYPE_GPU

                                                          Vendor ID: 1002h

                                                          Board name: AMD Radeon(TM) R7 Graphics

                                                          Device Topology: PCI[ B#0, D#1, F#0 ]

                                                          Max compute units: 8

                                                          Max work items dimensions: 3

                                                            Max work items[0]: 256

                                                            Max work items[1]: 256

                                                            Max work items[2]: 256

                                                          Max work group size: 256

                                                          Preferred vector width char: 4

                                                          Preferred vector width short: 2

                                                          Preferred vector width int: 1

                                                          Preferred vector width long: 1

                                                          Preferred vector width float: 1

                                                          Preferred vector width double: 1

                                                          Native vector width char: 4

                                                          Native vector width short: 2

                                                          Native vector width int: 1

                                                          Native vector width long: 1

                                                          Native vector width float: 1

                                                          Native vector width double: 1

                                                          Max clock frequency: 900Mhz

                                                          Address bits: 32

                                                          Max memory allocation: 289406976

                                                          Image support: Yes

                                                          Max number of images read arguments: 128

                                                          Max number of images write arguments: 8

                                                          Max image 2D width: 16384

                                                          Max image 2D height: 16384

                                                          Max image 3D width: 2048

                                                          Max image 3D height: 2048

                                                          Max image 3D depth: 2048

                                                          Max samplers within kernel: 16

                                                          Max size of kernel argument: 1024

                                                          Alignment (bits) of base address: 2048

                                                          Minimum alignment (bytes) for any datatype: 128

                                                          Single precision floating point capability

                                                            Denorms: No

                                                            Quiet NaNs: Yes

                                                            Round to nearest even: Yes

                                                            Round to zero: Yes

                                                            Round to +ve and infinity: Yes

                                                            IEEE754-2008 fused multiply-add: Yes

                                                          Cache type: Read/Write

                                                          Cache line size: 64

                                                          Cache size: 16384

                                                          Global memory size: 1157627904

                                                          Constant buffer size: 65536

                                                          Max number of constant args: 8

                                                          Local memory type: Scratchpad

                                                          Local memory size: 32768

                                                          Kernel Preferred work group size multiple: 64

                                                          Error correction support: 0

                                                          Unified memory for Host and Device: 1

                                                          Profiling timer resolution: 1

                                                          Device endianess: Little

                                                          Available: Yes

                                                          Compiler available: Yes

                                                          Execution capabilities: 

                                                            Execute OpenCL kernels: Yes

                                                            Execute native function: No

                                                          Queue properties: 

                                                            Out-of-Order: No

                                                            Profiling : Yes

                                                          Platform ID: 0x00007fcb9b1c8080

                                                          Name: Spectre

                                                          Vendor: Advanced Micro Devices, Inc.

                                                          Device OpenCL C version: OpenCL C 1.2

                                                          Driver version: 1445.5 (VM)

                                                          Profile: FULL_PROFILE

                                                          Version: OpenCL 1.2 AMD-APP (1445.5)

                                                          Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_image2d_from_buffer cl_khr_spir cl_khr_gl_event

                                                         

                                                         

                                                         

                                                         

                                                          Device Type: CL_DEVICE_TYPE_CPU

                                                          Vendor ID: 1002h

                                                          Board name: 

                                                          Max compute units: 4

                                                          Max work items dimensions: 3

                                                            Max work items[0]: 1024

                                                            Max work items[1]: 1024

                                                            Max work items[2]: 1024

                                                          Max work group size: 1024

                                                          Preferred vector width char: 16

                                                          Preferred vector width short: 8

                                                          Preferred vector width int: 4

                                                          Preferred vector width long: 2

                                                          Preferred vector width float: 8

                                                          Preferred vector width double: 4

                                                          Native vector width char: 16

                                                          Native vector width short: 8

                                                          Native vector width int: 4

                                                          Native vector width long: 2

                                                          Native vector width float: 8

                                                          Native vector width double: 4

                                                          Max clock frequency: 1700Mhz

                                                          Address bits: 64

                                                          Max memory allocation: 2147483648

                                                          Image support: Yes

                                                          Max number of images read arguments: 128

                                                          Max number of images write arguments: 8

                                                          Max image 2D width: 8192

                                                          Max image 2D height: 8192

                                                          Max image 3D width: 2048

                                                          Max image 3D height: 2048

                                                          Max image 3D depth: 2048

                                                          Max samplers within kernel: 16

                                                          Max size of kernel argument: 4096

                                                          Alignment (bits) of base address: 1024

                                                          Minimum alignment (bytes) for any datatype: 128

                                                          Single precision floating point capability

                                                            Denorms: Yes

                                                            Quiet NaNs: Yes

                                                            Round to nearest even: Yes

                                                            Round to zero: Yes

                                                            Round to +ve and infinity: Yes

                                                            IEEE754-2008 fused multiply-add: Yes

                                                          Cache type: Read/Write

                                                          Cache line size: 64

                                                          Cache size: 16384

                                                          Global memory size: 7184535552

                                                          Constant buffer size: 65536

                                                          Max number of constant args: 8

                                                          Local memory type: Global

                                                          Local memory size: 32768

                                                          Kernel Preferred work group size multiple: 1

                                                          Error correction support: 0

                                                          Unified memory for Host and Device: 1

                                                          Profiling timer resolution: 1

                                                          Device endianess: Little

                                                          Available: Yes

                                                          Compiler available: Yes

                                                          Execution capabilities: 

                                                            Execute OpenCL kernels: Yes

                                                            Execute native function: Yes

                                                          Queue properties: 

                                                            Out-of-Order: No

                                                            Profiling : Yes

                                                          Platform ID: 0x00007fcb9b1c8080

                                                          Name: AMD A10-7850K APU with Radeon(TM) R7 Graphics

                                                          Vendor: AuthenticAMD

                                                          Device OpenCL C version: OpenCL C 1.2

                                                          Driver version: 1445.5 (sse2,avx,fma4)

                                                          Profile: FULL_PROFILE

                                                          Version: OpenCL 1.2 AMD-APP (1445.5)

                                                          Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_spir cl_amd_svm cl_khr_gl_event