7 Replies Latest reply on May 1, 2014 3:36 PM by kknox

    GNU Octave 3.8.1 with ACML 6 on Ubuntu 13.10 does not seem to use iGPU on Kaveri platform.

    bsp2020

      I'm trying to get GNU Octave 3.8.1 with ACML 6 on Ubuntu 13.10 to use iGPU of Kaveri. However, it does not seem to work. I was able to build Octave to use ACML mp (ACML and iGPU of Kaveri) but its performance is the same whether I have Catalyst driver installed or not. Below is clinfo output and the performance number I got. I'd really like to get the iGPU working and compare the performance and appreciate any help.

       

      clinfo output

      $ clinfo

      Number of platforms: 1

        Platform Profile: FULL_PROFILE

        Platform Version: OpenCL 1.2 AMD-APP (1445.5)

        Platform Name: AMD Accelerated Parallel Processing

        Platform Vendor: Advanced Micro Devices, Inc.

        Platform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices cl_amd_hsa

         

        Platform Name: AMD Accelerated Parallel Processing

      Number of devices: 2

        Device Type: CL_DEVICE_TYPE_GPU

        Vendor ID: 1002h

        Board name: AMD Radeon(TM) R7 Graphics

        Device Topology: PCI[ B#0, D#1, F#0 ]

        Max compute units: 8

        Max work items dimensions: 3

          Max work items[0]: 256

          Max work items[1]: 256

          Max work items[2]: 256

        Max work group size: 256

        Preferred vector width char: 4

        Preferred vector width short: 2

        Preferred vector width int: 1

        Preferred vector width long: 1

        Preferred vector width float: 1

        Preferred vector width double: 1

        Native vector width char: 4

        Native vector width short: 2

        Native vector width int: 1

        Native vector width long: 1

        Native vector width float: 1

        Native vector width double: 1

        Max clock frequency: 900Mhz

        Address bits: 32

        Max memory allocation: 289406976

        Image support: Yes

        Max number of images read arguments: 128

        Max number of images write arguments: 8

        Max image 2D width: 16384

        Max image 2D height: 16384

        Max image 3D width: 2048

        Max image 3D height: 2048

        Max image 3D depth: 2048

        Max samplers within kernel: 16

        Max size of kernel argument: 1024

        Alignment (bits) of base address: 2048

        Minimum alignment (bytes) for any datatype: 128

        Single precision floating point capability

          Denorms: No

          Quiet NaNs: Yes

          Round to nearest even: Yes

          Round to zero: Yes

          Round to +ve and infinity: Yes

          IEEE754-2008 fused multiply-add: Yes

        Cache type: Read/Write

        Cache line size: 64

        Cache size: 16384

        Global memory size: 1157627904

        Constant buffer size: 65536

        Max number of constant args: 8

        Local memory type: Scratchpad

        Local memory size: 32768

        Kernel Preferred work group size multiple: 64

        Error correction support: 0

        Unified memory for Host and Device: 1

        Profiling timer resolution: 1

        Device endianess: Little

        Available: Yes

        Compiler available: Yes

        Execution capabilities:

          Execute OpenCL kernels: Yes

          Execute native function: No

        Queue properties:

          Out-of-Order: No

          Profiling : Yes

        Platform ID: 0x00007fcb9b1c8080

        Name: Spectre

        Vendor: Advanced Micro Devices, Inc.

        Device OpenCL C version: OpenCL C 1.2

        Driver version: 1445.5 (VM)

        Profile: FULL_PROFILE

        Version: OpenCL 1.2 AMD-APP (1445.5)

        Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_image2d_from_buffer cl_khr_spir cl_khr_gl_event

         

        Device Type: CL_DEVICE_TYPE_CPU

        Vendor ID: 1002h

        Board name:

        Max compute units: 4

        Max work items dimensions: 3

          Max work items[0]: 1024

          Max work items[1]: 1024

          Max work items[2]: 1024

        Max work group size: 1024

        Preferred vector width char: 16

        Preferred vector width short: 8

        Preferred vector width int: 4

        Preferred vector width long: 2

        Preferred vector width float: 8

        Preferred vector width double: 4

        Native vector width char: 16

        Native vector width short: 8

        Native vector width int: 4

        Native vector width long: 2

        Native vector width float: 8

        Native vector width double: 4

        Max clock frequency: 1700Mhz

        Address bits: 64

        Max memory allocation: 2147483648

        Image support: Yes

        Max number of images read arguments: 128

        Max number of images write arguments: 8

        Max image 2D width: 8192

        Max image 2D height: 8192

        Max image 3D width: 2048

        Max image 3D height: 2048

        Max image 3D depth: 2048

        Max samplers within kernel: 16

        Max size of kernel argument: 4096

        Alignment (bits) of base address: 1024

        Minimum alignment (bytes) for any datatype: 128

        Single precision floating point capability

          Denorms: Yes

          Quiet NaNs: Yes

          Round to nearest even: Yes

          Round to zero: Yes

          Round to +ve and infinity: Yes

          IEEE754-2008 fused multiply-add: Yes

        Cache type: Read/Write

        Cache line size: 64

        Cache size: 16384

        Global memory size: 7184535552

        Constant buffer size: 65536

        Max number of constant args: 8

        Local memory type: Global

        Local memory size: 32768

        Kernel Preferred work group size multiple: 1

        Error correction support: 0

        Unified memory for Host and Device: 1

        Profiling timer resolution: 1

        Device endianess: Little

        Available: Yes

        Compiler available: Yes

        Execution capabilities:

          Execute OpenCL kernels: Yes

          Execute native function: Yes

        Queue properties:

          Out-of-Order: No

          Profiling : Yes

        Platform ID: 0x00007fcb9b1c8080

        Name: AMD A10-7850K APU with Radeon(TM) R7 Graphics

        Vendor: AuthenticAMD

        Device OpenCL C version: OpenCL C 1.2

        Driver version: 1445.5 (sse2,avx,fma4)

        Profile: FULL_PROFILE

        Version: OpenCL 1.2 AMD-APP (1445.5)

        Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_spir cl_amd_svm cl_khr_gl_event

       

      Some performance number I got.

      Linux Default BLAS/LAPACK ->

      octave:1> a = rand(5000,5000,"single");

      octave:2> tic();svd(a);elapsed_time = toc()

      elapsed_time =  59.904

      octave:3> tic();a*a;elapsed_time = toc()

      elapsed_time =  15.024

      octave:4> a = rand(5000,5000);

      octave:5> tic();svd(a);elapsed_time = toc()

      elapsed_time =  132.46

      octave:6> tic();a*a;elapsed_time = toc()

      elapsed_time =  28.595

       

       

      ACML:no GPU ->

      octave:1> a = rand(5000,5000,"single");

      octave:2> tic();svd(a);elapsed_time = toc()

      elapsed_time =  49.774

      octave:3> tic();a*a;elapsed_time = toc()

      elapsed_time =  5.6783

      octave:4> a = rand(5000,5000);

      octave:5> tic();svd(a);elapsed_time = toc()

      elapsed_time =  82.735

      octave:6> tic();a*a;elapsed_time = toc()

      elapsed_time =  11.423

       

      ACML_mp:with and without GPU (with and without Catalyst 14.04 installed) ->

      octave:1> a = rand(5000,5000,"single");

      octave:2> tic();svd(a);elapsed_time = toc()

      elapsed_time =  27.879

      octave:3> tic();a*a;elapsed_time = toc()

      elapsed_time =  2.4581

      octave:4> a = rand(5000,5000);

      octave:5> tic();svd(a);elapsed_time = toc()

      elapsed_time =  52.070

      octave:6> tic();a*a;elapsed_time = toc()

      elapsed_time =  5.0666

        • Re: GNU Octave 3.8.1 with ACML 6 on Ubuntu 13.10 does not seem to use iGPU on Kaveri platform.
          kknox

          Hi bsp2020~

           

          One thing you can do to determine if work is being assigned to the GPU is to enable the ACML logger.  You can control this by setting the environment variable ACML_LOG_FILTER=1.  This will generate a text file in your pwd with a timestamp in the filename.  The logger is still under development; apologies if it looks erratic right now.  You should look for lines that begin with Threshold in them; for instance GemmThreshold.  The last flag on that line is usegpu( x ); if x is 1, then the function was offloaded to OpenCL, if 0 then it was computed on host.

           

          You can use this log to identify which L3 BLAS routines (if any) are being called with svd in octave.  When you figure that out, you can force computation on the GPU by editing the ACMLScript files.  For instance, to force all GEMM calls to compute on GPU, open the "./resources/Spectre/gemm.lua" file, and change it to unconditionally return true:

           

          function heuristic( transa, transb, m, n, k, alpha_real, alpha_imag, lda, ldb, beta_real, beta_imag, ldc, precision )
          return true
          end
          
            • Re: GNU Octave 3.8.1 with ACML 6 on Ubuntu 13.10 does not seem to use iGPU on Kaveri platform.
              bsp2020

              Hi Kent,

              Thanks for the helpful guide. However, it does not seem like it is working. I set ACML_LOG_FILTER=1 using export ACML_LOG_FILTER=1 command but I do not see any log file created. To make it simpler, I'm now using example that came with ACML.

               

              My current setup :

              AMD A10 7850K + ASUS A88X-PRO BIOS 0802, Ubuntu 14.04 + Catalyst 14.04, OpenCL Driver Version 1445.5

              I extracted acml 6.0.3.97 to ~/acml and I am using example files under ~/acml/gfortran64_mp/examples/performance

               

              I verified that time_dgemm.exe is using acml_mp using ldd

              $ ldd time_dgemm.exe

                linux-vdso.so.1 =>  (0x00007fff203f0000)

                libacml_mp.so => /home/briansp/acml/gfortran64_mp/lib/libacml_mp.so (0x00007fa5343c8000)

                libgfortran.so.3 => /usr/lib/x86_64-linux-gnu/libgfortran.so.3 (0x00007fa534088000)

                libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fa533cc0000)

                libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fa5339b8000)

                librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fa5337b0000)

                libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fa5335a8000)

                libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fa5332a0000)

                libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007fa533090000)

                libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fa532e78000)

                libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fa532c58000)

                libquadmath.so.0 => /usr/lib/x86_64-linux-gnu/libquadmath.so.0 (0x00007fa532a18000)

                /lib64/ld-linux-x86-64.so.2 (0x00007fa536798000)

              I modified gemm.lua under ~/acml/gfortran64_mp/lib/resources/Spectre

              $ cat ~/acml/gfortran64_mp/lib/resources/Spectre/gemm.lua

              -- Versioning for the Lua script; incrementED when the API changes

              VERSION = {

                MAJOR = 0,

                MINOR = 0,

                PATCH = 3,

                TWEAK = 0,

              }

               

               

              --------------------------------------------------------------------

              -- Code above here will probably not need to be modified by users --

              --------------------------------------------------------------------

               

               

              -- Constants used as thresholds to determine loading boundaries

              -- s == single precision constants

              -- d == double precision constants

              -- c == single precision complex constants

              -- z == double precision complex constants

              local tableOfThresholds = {

                s = { m = 64, n = 64, k = 64, psize = 64000000 },

                d = { m = 64, n = 64, k = 64, psize = 64000000 },

                c = { m = 64, n = 64, k = 64, psize = 8000000 },

                z = { m = 64, n = 64, k = 64, psize = 27000000 },

              }

               

               

              -- The heuristic function analyses input parameters, and determines where a give problem

              -- should be computer, on host or device.  The signature is similar in nature to the corresponding

              -- blas API.  Documentation for the individual parameters can be found online in the netlib website.

              -- type( transa ) == string; either 'n' or 't' or 'c'

              -- type( transb ) == string; either 'n' or 't' or 'c'

              -- type( m ) == number

              -- type( n ) == number

              -- type( k ) == number

              -- type( alpha_real ) == number; real portion of a complex number, set for 'real' functions

              -- type( alpha_imag ) == number; imaginary portion of a complex number, 0 for 'real' types

              -- type( lda ) == number

              -- type( ldb ) == number

              -- type( beta_real ) == number; real portion of a complex number, set for 'real' functions

              -- type( beta_imag ) == number; imaginary portion of a complex number, 0 for 'real' types

              -- type( ldc ) == number

              -- type( precision ) == string; either 's or 'd' or 'c' or 'z'

              -- return boolean expression as integer; true means to offload problem to device, false to offload on host

              function heuristic( transa, transb, m, n, k, alpha_real, alpha_imag, lda, ldb, beta_real, beta_imag, ldc, precision )

                print("gemm.lua is running")

                return true

              end

               

               

              function memalloc( )

                -- supported choices of memory allocation for GEMM are -- 1, default flags

                -- 2, zero_copy_at_host

                -- 3, copy rectangular data (good performance if lda, ldb or ldc is big)

               

               

                local memalloc_choice = 1

                --print("memalloc_choice is ",memalloc_choice)

                return memalloc_choice

              end

              It still does not seem to use GPU.

               

              What am I doing wrong?

               

              Brian

                • Re: GNU Octave 3.8.1 with ACML 6 on Ubuntu 13.10 does not seem to use iGPU on Kaveri platform.
                  kknox

                  Hi bsp2020~

                   

                  We have a theory; we dynamically load libOpenCL with dlopen() and we might not be finding it correctly on Ubuntu.  If opencl is not loaded, the logger is not loaded (we want to fix this eventually).  Where on your system is libOpenCL.so located?

                   

                  We defined a helper environment variable that we read for the path to the OpenCL shared library.  Try setting OPENCL_LIB_FILE to the full path of opencl shared library, such as /usr/lib64/libOpenCL.so.1.  Then try your tests again.

                  1 of 1 people found this helpful
                    • Re: GNU Octave 3.8.1 with ACML 6 on Ubuntu 13.10 does not seem to use iGPU on Kaveri platform.
                      bsp2020

                      Kent,

                      I think it's now running on GPU. I set OPENCL_LIB_FILE=/usr/lib/libOpenCL.so.1 and it now seems to work. However, large matrix multiplication did not work correctly.

                       

                      octave:1> a = rand(10000,10000,"single");

                      octave:2>

                      octave:2> tic();a*a;elapsed_time = toc()

                      elapsed_time =  8.4935

                      octave:3>

                      octave:3> a = rand(10000,10000);

                      octave:4>

                      octave:4> tic();a*a;elapsed_time = toc()

                      V_OpenCL< -61, 268 >: clCreateBuffer( ) failed to allocate new device memory

                      V_OpenCL< -61, 271 >: clCreateBuffer( ) failed to allocate new device memory

                      V_OpenCL< -61, 274 >: clCreateBuffer( ) failed to allocate new device memory

                      V_OpenCL< -61, 282 >: clEnqueueWriteBuffer( ) failed

                      V_OpenCL< -61, 285 >: clEnqueueWriteBuffer( ) failed

                      V_OpenCL< -61, 288 >: clEnqueueWriteBuffer( ) failed

                      V_OpenCL< -1022, 292 >: clBlasGemm( ) failed

                      V_OpenCL< -1022, 295 >: clEnqueueReadBuffer( ) failed

                      elapsed_time =  0.0018559

                      Single precision matrix multiplication was a bit more than 2 times faster, running on GPU. Double precision matrix multiplication performance running on GPU was about the same as CPU. I expected (hoped?) that single precision performance would be about 4 times that of CPU since the theoretical peak performance of the GPU is more than 4 times the CPU (AnandTech Portal | Floating point peak performance of Kaveri and other recent AMD and Intel chips). Maybe Kaveri is memory bandwidth limited when performing matrix multiplication?

                        • Re: GNU Octave 3.8.1 with ACML 6 on Ubuntu 13.10 does not seem to use iGPU on Kaveri platform.
                          kknox

                          Unfortunately, we are limited to how big of a matrix that we can send to the GPU by the OpenCL maximum buffer allocation size.  You can find this on your particular system by running clinfo, and looking at the property 'Max memory allocation'.  In your example above, a 10000 squared matrix double precision is 763 MB, and single precision would be 381 MB.  On Kaveri, I assume that the maximum buffer size is 512MB. 

                          We should do a better job of handling matrices that are too big, and I will file a bug against this.  Easiest solution would be to just run the entire problem on CPU; a more savvy solution would be to tile the matrix into smaller sub-matrices and send them over to the GPU in chunks, and stitch the big matrix back together again. 

                          As for performance, we are using the clMathLibraries clBLAS project for GPU acceleration.  If I recall, the GEMM implementation in there is more tuned for VLIW architectures than for the newer GCN architectures, so I do believe that there is room for kernel improvement.  However, given that we are doing our best to maintain API compatibility with traditional BLAS, we lose efficiency because each call to BLAS has to copy the data to the OpenCL device, and back to maintain memory consistency from the API/users perspective.  A round-trip memcpy() for each buffer on every BLAS call.  Even so, its great to hear that you are seeing a 2x performance boost on Kaveri (for single precision) in octave, and all that was required was to rebuild with ACML 6, i.e. no source changes.  This is the goal of ACML 6.