10 Replies Latest reply on Sep 6, 2015 7:11 PM by nirv_knox

    GPU_MAX_ALLOC_PERCENT and 13.1 drivers failure

    liwoog

      While my code was running well in production using GPU_MAX_ALLOC_PERCENT at up to 100% with the 12.4 drivers, it fails (CL_OUT_OF_RESOURCES) with the 13.1 drivers (I allocate up to 90% of memory from the code). I tried changing 100% to 80% to no avail.

       

      Only being able to use 2GB of the 3GB on the card would render it useless for my next project. I need every bit of memory I can use.

       

      Is there a workaround?

       

      Machine:

      4x HD 7970

      Catalyst 13.1 driver on CentOS 6.3


      Operating System Version (name), Linux version 2.6.32-279.19.1.el6.centos.plus.x86_64 (mockbuild@c6b7.bsys.dev.centos.org) (gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) (GCC) ) #1 SMP Wed Dec 19 06:20:23 UTC 2012

       

      Operating System Version (number), 2.6.32

      Number Of Processors, 32

      System Type, Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz

      Total Physical Memory, 64392 MB

      Available Physical Memory, 62184 MB

      Total Virtual Memory, 33554431 MB

      Available Virtual Memory, 33519322 MB

      Total Page Files, 8191 MB

      Available Page Files, 8191 MB

       

      Platform ID, 1, 1, 1, 1, 1

      Device Type, GPU, GPU, GPU, GPU, CPU

      Device Name, Tahiti, Tahiti, Tahiti, Tahiti, Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz

      Vendor, Advanced Micro Devices, Inc., Advanced Micro Devices, Inc., Advanced Micro Devices, Inc., Advanced Micro Devices, Inc., GenuineIntel

      Command Queue Properties, Queue profiling, Queue profiling, Queue profiling, Queue profiling, Queue profiling

      Is Available, Yes, Yes, Yes, Yes, Yes

      Is Compiler Available, Yes, Yes, Yes, Yes, Yes

      Is Little Endian, Yes, Yes, Yes, Yes, Yes

      Error Correction Support, No, No, No, No, No

      Execution Capabilities, Kernel Execution, Kernel Execution, Kernel Execution, Kernel Execution, Kernel Execution, Native Kernel Execution

      Global Memory Cache Size, 16 KB, 16 KB, 16 KB, 16 KB, 32 KB

      Memory Cache Type, Read Write, Read Write, Read Write, Read Write, Read Write

      Global Memory Cache Line Size, 64 bytes, 64 bytes, 64 bytes, 64 bytes, 64 bytes

      Global Memory Size, 2,048 MB, 2,048 MB, 2,048 MB, 2,048 MB, 64,393 MB

      Host Unified Memory, No, No, No, No, Yes

      Are Images Supported, Yes, Yes, Yes, Yes, Yes

      Max Image 2D Dimensions, (256w, 256h), (256w, 256h), (256w, 256h), (256w, 256h), (1024w, 1024h)

      Max Image 3D Dimensions, (256w, 256h, 256d), (256w, 256h, 256d), (256w, 256h, 256d), (256w, 256h, 256d), (1024w, 1024h, 1024d)

      Local Memory Size, 32 KB, 32 KB, 32 KB, 32 KB, 32 KB

      Local Memory Type, Local, Local, Local, Local, Global

      Max Clock Frequency, 1050, 1050, 1050, 1050, 1200

      Max Compute Units, 32, 32, 32, 32, 32

      Max Constant Arguments, 8, 8, 8, 8, 8

      Max Constant Buffer Size, 64 KB, 64 KB, 64 KB, 64 KB, 64 KB

      Max Memory Allocation Size, 512 MB, 512 MB, 512 MB, 512 MB, 16,099 MB

      Max Parameter Size, 1,024 bytes, 1,024 bytes, 1,024 bytes, 1,024 bytes, 4 KB

      Read Image Arguments, 128, 128, 128, 128, 128

      Max Samplers, 16, 16, 16, 16, 16

      Max Workgroup Size, 256, 256, 256, 256, 1024

      Max Work Item Dimensions, 3, 3, 3, 3, 3

      Max Work Item Sizes, (256,256,256), (256,256,256), (256,256,256), (256,256,256), (1024,1024,1024)

      Max Write Image Arguments, 8, 8, 8, 8, 8

      Memory Base Address Alignment, 2048, 2048, 2048, 2048, 1024

      Minimal Data Type Alignment Size, 128 bytes, 128 bytes, 128 bytes, 128 bytes, 128 bytes

      OpenCL C Version, OpenCL C 1.2 , OpenCL C 1.2 , OpenCL C 1.2 , OpenCL C 1.2 , OpenCL C 1.2

      Native Char Vector Width, 4, 4, 4, 4, 16

      Native Short Vector Width, 2, 2, 2, 2, 8

      Native Int Vector Width, 1, 1, 1, 1, 4

      Native Long Vector Width, 1, 1, 1, 1, 2

      Native Float Vector Width, 1, 1, 1, 1, 8

      Native Double Vector Width, 1, 1, 1, 1, 4

      Native Half Vector Width, 1, 1, 1, 1, 4

      Preferred Char Vector Width, 4, 4, 4, 4, 16

      Preferred Short Vector Width, 2, 2, 2, 2, 8

      Preferred Int Vector Width, 1, 1, 1, 1, 4

      Preferred Long Vector Width, 1, 1, 1, 1, 2

      Preferred Float Vector Width, 1, 1, 1, 1, 8

      Preferred Double Vector Width, 1, 1, 1, 1, 4

      Preferred Half Vector Width, 1, 1, 1, 1, 4

      Profile, FULL_PROFILE, FULL_PROFILE, FULL_PROFILE, FULL_PROFILE, FULL_PROFILE

      Profiling Timer Resolution, 1, 1, 1, 1, 1

      Vendor ID, OpenCL 1.2 AMD-APP (1113.2), OpenCL 1.2 AMD-APP (1113.2), OpenCL 1.2 AMD-APP (1113.2), OpenCL 1.2 AMD-APP (1113.2), OpenCL 1.2 AMD-APP (1113.2)

        • Re: GPU_MAX_ALLOC_PERCENT and 13.1 drivers failure
          himanshu.gautam

          the output posted above looks like some modification of clinfo output. Can you share the source, it may help others as clinfo is having a issue when some platforms are OpenCL 1.1 and some are OpenCL 1.2 compliant.

          I will ask the runtime guys and let you know if there is a way to enable the full memory. Can you check once with 12.10 driver(and 13.2 beta)? Do you still get 2GB out of 3GB memory for your tahiti cards. Thanks for reporting it.

            • Re: GPU_MAX_ALLOC_PERCENT and 13.1 drivers failure
              liwoog

              This was a copy/paste from CodeXL system's info.

               

              Here is the clinfo output below

               

              Number of platforms:


              1
                Platform Profile:


              FULL_PROFILE
                Platform Version:


              OpenCL 1.2 AMD-APP (1113.2)
                Platform Name:


              AMD Accelerated Parallel Processing
                Platform Vendor:


              Advanced Micro Devices, Inc.
                Platform Extensions:


              cl_khr_icd cl_amd_event_callback cl_amd_offline_devices

               

               

               

               

                Platform Name:


              AMD Accelerated Parallel Processing
              Number of devices:


              5
                Device Type:



              CL_DEVICE_TYPE_GPU
                Device ID:



              4098
                Board name:



              AMD Radeon HD 7900 Series
                Device Topology:


              PCI[ B#2, D#0, F#0 ]
                Max compute units:


              32
                Max work items dimensions:

              3
                  Max work items[0]:


              256
                  Max work items[1]:


              256
                  Max work items[2]:


              256
                Max work group size:


              256
                Preferred vector width char:

              4
                Preferred vector width short:

              2
                Preferred vector width int:

              1
                Preferred vector width long:

              1
                Preferred vector width float:

              1
                Preferred vector width double:
              1
                Native vector width char:

              4
                Native vector width short:

              2
                Native vector width int:

              1
                Native vector width long:

              1
                Native vector width float:

              1
                Native vector width double:

              1
                Max clock frequency:


              1050Mhz
                Address bits:



              32
                Max memory allocation:

              536870912
                Image support:


              Yes
                Max number of images read arguments:
              128
                Max number of images write arguments:
              8
                Max image 2D width:


              16384
                Max image 2D height:


              16384
                Max image 3D width:


              2048
                Max image 3D height:


              2048
                Max image 3D depth:


              2048
                Max samplers within kernel:

              16
                Max size of kernel argument:

              1024
                Alignment (bits) of base address:
              2048
                Minimum alignment (bytes) for any datatype: 128

                Single precision floating point capability

                  Denorms:



              No
                  Quiet NaNs:



              Yes
                  Round to nearest even:

              Yes
                  Round to zero:


              Yes
                  Round to +ve and infinity:

              Yes
                  IEEE754-2008 fused multiply-add:
              Yes
                Cache type:



              Read/Write
                Cache line size:


              64
                Cache size:



              16384
                Global memory size:


              2147483648
                Constant buffer size:


              65536
                Max number of constant args:

              8
                Local memory type:


              Scratchpad
                Local memory size:


              32768
                Kernel Preferred work group size multiple: 64
                Error correction support:

              0
                Unified memory for Host and Device:
              0
                Profiling timer resolution:

              1
                Device endianess:


              Little
                Available:



              Yes
                Compiler available:


              Yes
                Execution capabilities:



                  Execute OpenCL kernels:

              Yes
                  Execute native function:

              No
                Queue properties:



                  Out-of-Order:


              No
                  Profiling :



              Yes
                Platform ID:



              0x00007ffab08f64e0
                Name:




              Tahiti
                Vendor:



              Advanced Micro Devices, Inc.
                Device OpenCL C version:

              OpenCL C 1.2
                Driver version:


              1113.2 (VM)
                Profile:



              FULL_PROFILE
                Version:



              OpenCL 1.2 AMD-APP (1113.2)
                Extensions:



              cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_popcnt cl_amd_c1x_atomics

               

               

               

               

                Device Type:



              CL_DEVICE_TYPE_GPU
                Device ID:



              4098
                Board name:



              AMD Radeon HD 7900 Series
                Device Topology:


              PCI[ B#3, D#0, F#0 ]
                Max compute units:


              32
                Max work items dimensions:

              3
                  Max work items[0]:


              256
                  Max work items[1]:


              256
                  Max work items[2]:


              256
                Max work group size:


              256
                Preferred vector width char:

              4
                Preferred vector width short:

              2
                Preferred vector width int:

              1
                Preferred vector width long:

              1
                Preferred vector width float:

              1
                Preferred vector width double:
              1
                Native vector width char:

              4
                Native vector width short:

              2
                Native vector width int:

              1
                Native vector width long:

              1
                Native vector width float:

              1
                Native vector width double:

              1
                Max clock frequency:


              1050Mhz
                Address bits:



              32
                Max memory allocation:

              536870912
                Image support:


              Yes
                Max number of images read arguments:
              128
                Max number of images write arguments:
              8
                Max image 2D width:


              16384
                Max image 2D height:


              16384
                Max image 3D width:


              2048
                Max image 3D height:


              2048
                Max image 3D depth:


              2048
                Max samplers within kernel:

              16
                Max size of kernel argument:

              1024
                Alignment (bits) of base address:
              2048
                Minimum alignment (bytes) for any datatype: 128

                Single precision floating point capability

                  Denorms:



              No
                  Quiet NaNs:



              Yes
                  Round to nearest even:

              Yes
                  Round to zero:


              Yes
                  Round to +ve and infinity:

              Yes
                  IEEE754-2008 fused multiply-add:
              Yes
                Cache type:



              Read/Write
                Cache line size:


              64
                Cache size:



              16384
                Global memory size:


              2147483648
                Constant buffer size:


              65536
                Max number of constant args:

              8
                Local memory type:


              Scratchpad
                Local memory size:


              32768
                Kernel Preferred work group size multiple: 64
                Error correction support:

              0
                Unified memory for Host and Device:
              0
                Profiling timer resolution:

              1
                Device endianess:


              Little
                Available:



              Yes
                Compiler available:


              Yes
                Execution capabilities:



                  Execute OpenCL kernels:

              Yes
                  Execute native function:

              No
                Queue properties:



                  Out-of-Order:


              No
                  Profiling :



              Yes
                Platform ID:



              0x00007ffab08f64e0
                Name:




              Tahiti
                Vendor:



              Advanced Micro Devices, Inc.
                Device OpenCL C version:

              OpenCL C 1.2
                Driver version:


              1113.2 (VM)
                Profile:



              FULL_PROFILE
                Version:



              OpenCL 1.2 AMD-APP (1113.2)
                Extensions:



              cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_popcnt cl_amd_c1x_atomics

               

               

               

               

                Device Type:



              CL_DEVICE_TYPE_GPU
                Device ID:



              4098
                Board name:



              AMD Radeon HD 7900 Series
                Device Topology:


              PCI[ B#-125, D#0, F#0 ]
                Max compute units:


              32
                Max work items dimensions:

              3
                  Max work items[0]:


              256
                  Max work items[1]:


              256
                  Max work items[2]:


              256
                Max work group size:


              256
                Preferred vector width char:

              4
                Preferred vector width short:

              2
                Preferred vector width int:

              1
                Preferred vector width long:

              1
                Preferred vector width float:

              1
                Preferred vector width double:
              1
                Native vector width char:

              4
                Native vector width short:

              2
                Native vector width int:

              1
                Native vector width long:

              1
                Native vector width float:

              1
                Native vector width double:

              1
                Max clock frequency:


              1050Mhz
                Address bits:



              32
                Max memory allocation:

              536870912
                Image support:


              Yes
                Max number of images read arguments:
              128
                Max number of images write arguments:
              8
                Max image 2D width:


              16384
                Max image 2D height:


              16384
                Max image 3D width:


              2048
                Max image 3D height:


              2048
                Max image 3D depth:


              2048
                Max samplers within kernel:

              16
                Max size of kernel argument:

              1024
                Alignment (bits) of base address:
              2048
                Minimum alignment (bytes) for any datatype: 128

                Single precision floating point capability

                  Denorms:



              No
                  Quiet NaNs:



              Yes
                  Round to nearest even:

              Yes
                  Round to zero:


              Yes
                  Round to +ve and infinity:

              Yes
                  IEEE754-2008 fused multiply-add:
              Yes
                Cache type:



              Read/Write
                Cache line size:


              64
                Cache size:



              16384
                Global memory size:


              2147483648
                Constant buffer size:


              65536
                Max number of constant args:

              8
                Local memory type:


              Scratchpad
                Local memory size:


              32768
                Kernel Preferred work group size multiple: 64
                Error correction support:

              0
                Unified memory for Host and Device:
              0
                Profiling timer resolution:

              1
                Device endianess:


              Little
                Available:



              Yes
                Compiler available:


              Yes
                Execution capabilities:



                  Execute OpenCL kernels:

              Yes
                  Execute native function:

              No
                Queue properties:



                  Out-of-Order:


              No
                  Profiling :



              Yes
                Platform ID:



              0x00007ffab08f64e0
                Name:




              Tahiti
                Vendor:



              Advanced Micro Devices, Inc.
                Device OpenCL C version:

              OpenCL C 1.2
                Driver version:


              1113.2 (VM)
                Profile:



              FULL_PROFILE
                Version:



              OpenCL 1.2 AMD-APP (1113.2)
                Extensions:



              cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_popcnt cl_amd_c1x_atomics

               

               

               

               

                Device Type:



              CL_DEVICE_TYPE_GPU
                Device ID:



              4098
                Board name:



              AMD Radeon HD 7900 Series
                Device Topology:


              PCI[ B#-124, D#0, F#0 ]
                Max compute units:


              32
                Max work items dimensions:

              3
                  Max work items[0]:


              256
                  Max work items[1]:


              256
                  Max work items[2]:


              256
                Max work group size:


              256
                Preferred vector width char:

              4
                Preferred vector width short:

              2
                Preferred vector width int:

              1
                Preferred vector width long:

              1
                Preferred vector width float:

              1
                Preferred vector width double:
              1
                Native vector width char:

              4
                Native vector width short:

              2
                Native vector width int:

              1
                Native vector width long:

              1
                Native vector width float:

              1
                Native vector width double:

              1
                Max clock frequency:


              1050Mhz
                Address bits:



              32
                Max memory allocation:

              536870912
                Image support:


              Yes
                Max number of images read arguments:
              128
                Max number of images write arguments:
              8
                Max image 2D width:


              16384
                Max image 2D height:


              16384
                Max image 3D width:


              2048
                Max image 3D height:


              2048
                Max image 3D depth:


              2048
                Max samplers within kernel:

              16
                Max size of kernel argument:

              1024
                Alignment (bits) of base address:
              2048
                Minimum alignment (bytes) for any datatype: 128

                Single precision floating point capability

                  Denorms:



              No
                  Quiet NaNs:



              Yes
                  Round to nearest even:

              Yes
                  Round to zero:


              Yes
                  Round to +ve and infinity:

              Yes
                  IEEE754-2008 fused multiply-add:
              Yes
                Cache type:



              Read/Write
                Cache line size:


              64
                Cache size:



              16384
                Global memory size:


              2147483648
                Constant buffer size:


              65536
                Max number of constant args:

              8
                Local memory type:


              Scratchpad
                Local memory size:


              32768
                Kernel Preferred work group size multiple: 64
                Error correction support:

              0
                Unified memory for Host and Device:
              0
                Profiling timer resolution:

              1
                Device endianess:


              Little
                Available:



              Yes
                Compiler available:


              Yes
                Execution capabilities:



                  Execute OpenCL kernels:

              Yes
                  Execute native function:

              No
                Queue properties:



                  Out-of-Order:


              No
                  Profiling :



              Yes
                Platform ID:



              0x00007ffab08f64e0
                Name:




              Tahiti
                Vendor:



              Advanced Micro Devices, Inc.
                Device OpenCL C version:

              OpenCL C 1.2
                Driver version:


              1113.2 (VM)
                Profile:



              FULL_PROFILE
                Version:



              OpenCL 1.2 AMD-APP (1113.2)
                Extensions:



              cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_popcnt cl_amd_c1x_atomics

               

               

               

               

                Device Type:



              CL_DEVICE_TYPE_CPU
                Device ID:



              4098
                Board name:




                Max compute units:


              32
                Max work items dimensions:

              3
                  Max work items[0]:


              1024
                  Max work items[1]:


              1024
                  Max work items[2]:


              1024
                Max work group size:


              1024
                Preferred vector width char:

              16
                Preferred vector width short:

              8
                Preferred vector width int:

              4
                Preferred vector width long:

              2
                Preferred vector width float:

              8
                Preferred vector width double:
              4
                Native vector width char:

              16
                Native vector width short:

              8
                Native vector width int:

              4
                Native vector width long:

              2
                Native vector width float:

              8
                Native vector width double:

              4
                Max clock frequency:


              2601Mhz
                Address bits:



              64
                Max memory allocation:

              16880146432
                Image support:


              Yes
                Max number of images read arguments:
              128
                Max number of images write arguments:
              8
                Max image 2D width:


              8192
                Max image 2D height:


              8192
                Max image 3D width:


              2048
                Max image 3D height:


              2048
                Max image 3D depth:


              2048
                Max samplers within kernel:

              16
                Max size of kernel argument:

              4096
                Alignment (bits) of base address:
              1024
                Minimum alignment (bytes) for any datatype: 128

                Single precision floating point capability

                  Denorms:



              Yes
                  Quiet NaNs:



              Yes
                  Round to nearest even:

              Yes
                  Round to zero:


              Yes
                  Round to +ve and infinity:

              Yes
                  IEEE754-2008 fused multiply-add:
              Yes
                Cache type:



              Read/Write
                Cache line size:


              64
                Cache size:



              32768
                Global memory size:


              67520585728
                Constant buffer size:


              65536
                Max number of constant args:

              8
                Local memory type:


              Global
                Local memory size:


              32768
                Kernel Preferred work group size multiple: 1
                Error correction support:

              0
                Unified memory for Host and Device:
              1
                Profiling timer resolution:

              1
                Device endianess:


              Little
                Available:



              Yes
                Compiler available:


              Yes
                Execution capabilities:



                  Execute OpenCL kernels:

              Yes
                  Execute native function:

              Yes
                Queue properties:



                  Out-of-Order:


              No
                  Profiling :



              Yes
                Platform ID:



              0x00007ffab08f64e0
                Name:




              Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
                Vendor:



              GenuineIntel
                Device OpenCL C version:

              OpenCL C 1.2
                Driver version:


              1113.2 (sse2,avx)
                Profile:



              FULL_PROFILE
                Version:



              OpenCL 1.2 AMD-APP (1113.2)
                Extensions:



              cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_popcnt
              • Re: GPU_MAX_ALLOC_PERCENT and 13.1 drivers failure
                liwoog

                With 12.8 and up, I can run with GPU_MAX_ALLOC_PERCENT set up to 45. Thankfully it also ups the global mem size to 3074424832, which is the important factor.

              • Re: GPU_MAX_ALLOC_PERCENT and 13.1 drivers failure
                dipak

                Hi,

                I'm reviving this thread.

                In this thread Large buffers, drallan described a workaround to allocate large memory. You may try and check whether it works for you or not.

                 

                Regards,

                • Re: GPU_MAX_ALLOC_PERCENT and 13.1 drivers failure
                  nirv_knox

                  Use the following command and find out the list of environment variables supported by you GPU devices.

                  :~$ strings /usr/lib/libamdocl64.so | grep GPU

                  From the list of variables, you can set the percentages accordingly. You can also check if the variables GPU_MAX_HEAP_SIZE, GPU_MAX_ALLOC_PERCENT etc. are available or not. Also, you can check which variables can be tweaked for optimized usage of your hardware. E.g. if you have a card of global memory size of 1024 MB, then you might consider having two different buffers of sizes 512 MB each. Or you might consider have one buffer with full 1024 MB allotted. The choice is yours.