7 Replies Latest reply on Feb 24, 2015 3:41 AM by dipak

    BufferBandwidth results on Kaveri

    yurtesen

      Hello,

      I was wondering why the CPU read/writes are so slow on the BufferBandwidth example irrelevant of if the memory is allocated in host or not? Also why the GPU writes are slow if the kernel is writing to host memory?

      Device  0        Spectre
      Build:           release
      GPU work items:  8192
      Buffer size:     33554432
      CPU workers:     1
      Timing loops:    20
      Repeats:         1
      Kernel loops:    20
      inputBuffer:     CL_MEM_READ_ONLY
      outputBuffer:    CL_MEM_WRITE_ONLY

       

      Host baseline (naive):

       

      Timer resolution 256.22  ns
      Page fault       942.38  ns
      CPU read         6.28 GB/s
      memcpy()         8.81 GB/s
      memset(,1,)      6.87 GB/s
      memset(,0,)      6.87 GB/s

       

       

      AVERAGES (over loops 2 - 19, use -l for complete log)

      --------

       

       

      1. Host mapped write to inputBuffer

      ---------------------------------------|---------------

      clEnqueueMapBuffer -- WRITE (GBPS) | 2331.320

      ---------------------------------------|---------------

      memset() (GBPS)                    | 6.717

      ---------------------------------------|---------------

      clEnqueueUnmapMemObject() (GBPS)   | 10.404

       

       

      2. GPU kernel read of inputBuffer

      ---------------------------------------|---------------

      clEnqueueNDRangeKernel() (GBPS)    | 29.747

       

      Verification Passed!

       

       

      3. GPU kernel write to outputBuffer

      ---------------------------------------|---------------

      clEnqueueNDRangeKernel() (GBPS)    | 23.172

       

       

      4. Host mapped read of outputBuffer

      ---------------------------------------|---------------

      clEnqueueMapBuffer -- READ (GBPS)  | 10.927

      ---------------------------------------|---------------

      CPU read (GBPS)                    | 6.228

      ---------------------------------------|---------------

      clEnqueueUnmapMemObject() (GBPS)   | 645.145

       

       

       

       

      Device  0        Spectre
      Build:           release
      GPU work items:  8192
      Buffer size:     33554432
      CPU workers:     1
      Timing loops:    20
      Repeats:         1
      Kernel loops:    20
      inputBuffer:     CL_MEM_READ_ONLY CL_MEM_ALLOC_HOST_PTR
      outputBuffer:    CL_MEM_WRITE_ONLY CL_MEM_ALLOC_HOST_PTR

       

      Host baseline (naive):

       

      Timer resolution 256.48  ns
      Page fault       974.34  ns
      CPU read         6.15 GB/s
      memcpy()         8.82 GB/s
      memset(,1,)      6.73 GB/s
      memset(,0,)      6.72 GB/s

       

       

      AVERAGES (over loops 2 - 19, use -l for complete log)

      --------

       

       

      1. Host mapped write to inputBuffer

      ---------------------------------------|---------------

      clEnqueueMapBuffer -- WRITE (GBPS) | 2880.703

      ---------------------------------------|---------------

      memset() (GBPS)                    | 9.079

      ---------------------------------------|---------------

      clEnqueueUnmapMemObject() (GBPS)   | 917.657

       

       

      2. GPU kernel read of inputBuffer

      ---------------------------------------|---------------

      clEnqueueNDRangeKernel() (GBPS)    | 28.579

       

      Verification Passed!

       

       

      3. GPU kernel write to outputBuffer

      ---------------------------------------|---------------

      clEnqueueNDRangeKernel() (GBPS)    | 8.098

       

       

      4. Host mapped read of outputBuffer

      ---------------------------------------|---------------

      clEnqueueMapBuffer -- READ (GBPS)  | 3166.840

      ---------------------------------------|---------------

      CPU read (GBPS)                    | 6.195

      ---------------------------------------|---------------

      clEnqueueUnmapMemObject() (GBPS)   | 794.376

       

       

       

      Thanks,

      Evren

        • Re: BufferBandwidth results on Kaveri
          dipak

          Hi Evren,

          I guess, this is somewhat expected. Most numbers are similar for both the scenario except following cases. Please find my comments regarding those cases.

           

          GPU MemoryALLOC_Host memoryComments
          Host mapped write to inputBuffer - clEnqueueUnmapMemObject() (GBPS)10.404917.657GPU case bandwidth is lower since, during the unmap, data transfer is needed to GPU memory and writes to GPU memory from host is slower than writes to host memory
          GPU kernel write to outputBuffer -clEnqueueNDRangeKernel() (GBPS) 23.1728.098ALLOC_Host case is lower since the write from GPU to host happens through slower memory bus (Onion) as compared to GPU memory bus(Garlic)
          Host mapped read of outputBuffer - clEnqueueMapBuffer -- READ (GBPS) 10.9273166.84GPU case is slower since host has to read from GPU memory which is much slower than reads from host memory

           

          Regards,

            • Re: Re: BufferBandwidth results on Kaveri
              yurtesen

              Hmm, but why is ALLOC_Host memory reads are fast but writes are slow?

               

              Also, I modified the BufferBandwidth and tried to see how the CPU would perform (here the GPU kernel read/writes are made by kernel running on CPU). The kernel read/write speeds are super low. Is this normal? Why?

               

              Device  0        AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G
              Build:           release
              GPU work items:  4096
              Buffer size:     33554432
              CPU workers:     1
              Timing loops:    20
              Repeats:         1
              Kernel loops:    20
              inputBuffer:     CL_MEM_READ_ONLY
              outputBuffer:    CL_MEM_WRITE_ONLY

               

              Host baseline (naive):

               

              Timer resolution 256.64  ns
              Page fault       971.97  ns
              CPU read         6.16 GB/s
              memcpy()         4.08 GB/s
              memset(,1,)      6.69 GB/s
              memset(,0,)      6.71 GB/s

               

               

              AVERAGES (over loops 2 - 19, use -l for complete log)

              --------

               

               

              1. Host mapped write to inputBuffer

              ---------------------------------------|---------------

              clEnqueueMapBuffer -- WRITE (GBPS) | 4060.750

              ---------------------------------------|---------------

              memset() (GBPS)                    | 6.667

              ---------------------------------------|---------------

              clEnqueueUnmapMemObject() (GBPS)   | 726.832

               

               

              2. GPU kernel read of inputBuffer

              ---------------------------------------|---------------

              clEnqueueNDRangeKernel() (GBPS)    | 0.709

               

              Verification Passed!

               

               

              3. GPU kernel write to outputBuffer

              ---------------------------------------|---------------

              clEnqueueNDRangeKernel() (GBPS)    | 0.347

               

               

              4. Host mapped read of outputBuffer

              ---------------------------------------|---------------

              clEnqueueMapBuffer -- READ (GBPS)  | 1201.271

              ---------------------------------------|---------------

              CPU read (GBPS)                    | 6.247

              ---------------------------------------|---------------

              clEnqueueUnmapMemObject() (GBPS)   | 706.588

               

              Verification Passed!

               

               

              • Re: Re: BufferBandwidth results on Kaveri
                yurtesen

                Hello Dipak,

                 

                I also tried the new version of the BufferBandwidth from new SDK. Kernel reads super slow.... Shouldn't it be higher?

                $ /opt/AMDAPPSDK-3.0-0-Beta/samples/opencl/bin/x86_64/BufferBandwidth --device cpu

                Platform 0 : Advanced Micro Devices, Inc.

                Platform 1 : Intel(R) Corporation

                Platform found : Advanced Micro Devices, Inc.

                 

                Selected Platform Vendor : Advanced Micro Devices, Inc.

                Device 0 : AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G Device ID is 0x262bea0

                Build:               release

                GPU work items:      4096

                Buffer size:         33554432

                CPU workers:         1

                Timing loops:        20

                Repeats:             1

                Kernel loops:        20

                inputBuffer:         CL_MEM_READ_ONLY

                outputBuffer:        CL_MEM_WRITE_ONLY

                 

                Host baseline (naive):

                 

                Timer resolution     1000.52 ns

                Page fault           836.79  ns

                CPU read             6.38 GB/s

                memcpy()             8.79 GB/s

                memset(,1,)          6.70 GB/s

                memset(,0,)          6.70 GB/s

                 

                 

                AVERAGES (over loops 2 - 19, use -l for complete log)

                --------

                 

                 

                1. Host mapped write to inputBuffer

                ---------------------------------------|---------------

                clEnqueueMapBuffer -- WRITE (GBPS)     | 4712.193

                ---------------------------------------|---------------

                memset() (GBPS)                        | 6.675

                ---------------------------------------|---------------

                clEnqueueUnmapMemObject() (GBPS)       | 555.184

                 

                 

                2. GPU kernel read of inputBuffer

                ---------------------------------------|---------------

                clEnqueueNDRangeKernel() (GBPS)        | 0.709

                 

                Verification Passed!

                 

                 

                3. GPU kernel write to outputBuffer

                ---------------------------------------|---------------

                clEnqueueNDRangeKernel() (GBPS)        | 0.349

                 

                 

                4. Host mapped read of outputBuffer

                ---------------------------------------|---------------

                clEnqueueMapBuffer -- READ (GBPS)      | 1100.001

                ---------------------------------------|---------------

                CPU read (GBPS)                        | 6.270

                ---------------------------------------|---------------

                clEnqueueUnmapMemObject() (GBPS)       | 659.707

                 

                Verification Passed!

                 

                 

                Passed!

                 

                 

                also

                 

                $ /opt/AMDAPPSDK-3.0-0-Beta/samples/opencl/bin/x86_64/BufferBandwidth --device cpu -if 5 -of 5 -cf 5

                Platform 0 : Advanced Micro Devices, Inc.

                Platform 1 : Intel(R) Corporation

                Platform found : Advanced Micro Devices, Inc.

                 

                Selected Platform Vendor : Advanced Micro Devices, Inc.

                Device 0 : AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G Device ID is 0x27e0c60

                Build:               release

                GPU work items:      4096

                Buffer size:         33554432

                CPU workers:         1

                Timing loops:        20

                Repeats:             1

                Kernel loops:        20

                inputBuffer:         CL_MEM_READ_ONLY CL_MEM_ALLOC_HOST_PTR

                outputBuffer:        CL_MEM_WRITE_ONLY CL_MEM_ALLOC_HOST_PTR

                 

                Host baseline (naive):

                 

                Timer resolution     1000.65 ns

                Page fault           875.38  ns

                CPU read             6.38 GB/s

                memcpy()             8.87 GB/s

                memset(,1,)          6.93 GB/s

                memset(,0,)          6.92 GB/s

                 

                 

                AVERAGES (over loops 2 - 19, use -l for complete log)

                --------

                 

                 

                1. Host mapped write to inputBuffer

                ---------------------------------------|---------------

                clEnqueueMapBuffer -- WRITE (GBPS)     | 3847.436

                ---------------------------------------|---------------

                memset() (GBPS)                        | 6.853

                ---------------------------------------|---------------

                clEnqueueUnmapMemObject() (GBPS)       | 588.324

                 

                 

                2. GPU kernel read of inputBuffer

                ---------------------------------------|---------------

                clEnqueueNDRangeKernel() (GBPS)        | 0.720

                 

                Verification Passed!

                 

                 

                3. GPU kernel write to outputBuffer

                ---------------------------------------|---------------

                clEnqueueNDRangeKernel() (GBPS)        | 0.352

                 

                 

                4. Host mapped read of outputBuffer

                ---------------------------------------|---------------

                clEnqueueMapBuffer -- READ (GBPS)      | 1152.796

                ---------------------------------------|---------------

                CPU read (GBPS)                        | 6.320

                ---------------------------------------|---------------

                clEnqueueUnmapMemObject() (GBPS)       | 707.233

                 

                Verification Passed!

                 

                 

                Passed!

                 

                  • Re: BufferBandwidth results on Kaveri
                    dipak

                    Could you please mention your setup details such as OS, catalyst driver version etc.? Please also share your clinfo output.

                     

                    Regards,

                      • Re: Re: BufferBandwidth results on Kaveri
                        yurtesen

                        Dipak, it is a normal kaveri system with an asrock mobo. Down is the clinfo and the dmidecode output. I am using the omega drivers with newest 3.0beta SDK (but older SDKs give the same result). OS is Ubuntu 14.04

                         

                         

                        Number of platforms: 

                          2
                          Platform Profile:    FULL_PROFILE
                          Platform Version:    OpenCL 1.2 LINUX
                          Platform Name:    Intel(R) OpenCL
                          Platform Vendor:    Intel(R) Corporation
                          Platform Extensions:    cl_khr_fp64 cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_intel_printf cl_ext_device_fission cl_intel_exec_by_local_thread
                          Platform Profile:    FULL_PROFILE
                          Platform Version:    OpenCL 2.0 AMD-APP (1642.5)
                          Platform Name:    AMD Accelerated Parallel Processing
                          Platform Vendor:    Advanced Micro Devices, Inc.
                          Platform Extensions:    cl_khr_icd cl_amd_event_callback cl_amd_offline_devices

                         

                         

                          Platform Name:    Intel(R) OpenCL
                        Number of devices:    1
                          Device Type:    CL_DEVICE_TYPE_CPU
                          Vendor ID:    8086h
                          Max compute units:    4
                          Max work items dimensions:    3
                        Max work items[0]:    1024
                        Max work items[1]:    1024
                        Max work items[2]:    1024
                          Max work group size:    1024
                          Preferred vector width char:    16
                          Preferred vector width short:    8
                          Preferred vector width int:    4
                          Preferred vector width long:    2
                          Preferred vector width float:    4
                          Preferred vector width double:    2
                          Native vector width char:    16
                          Native vector width short:    8
                          Native vector width int:    4
                          Native vector width long:    2
                          Native vector width float:    4
                          Native vector width double:    2
                          Max clock frequency:    0Mhz
                          Address bits:    64
                          Max memory allocation:    3641622528
                          Image support:    Yes
                          Max number of images read arguments:    480
                          Max number of images write arguments:    480
                          Max image 2D width:    16384
                          Max image 2D height:    16384
                          Max image 3D width:    2048
                          Max image 3D height:    2048
                          Max image 3D depth:    2048
                          Max samplers within kernel:    480
                          Max size of kernel argument:    3840
                          Alignment (bits) of base address:    1024

                          Minimum alignment (bytes) for any datatype:     128

                          Single precision floating point capability

                        Denorms:    Yes
                        Quiet NaNs:    Yes
                        Round to nearest even:    Yes
                        Round to zero:    No
                        Round to +ve and infinity:    No
                        IEEE754-2008 fused multiply-add:    No
                          Cache type:    Read/Write
                          Cache line size:    64
                          Cache size:    2097152
                          Global memory size:    14566490112
                          Constant buffer size:    131072
                          Max number of constant args:    480
                          Local memory type:    Global
                          Local memory size:    32768

                          Kernel Preferred work group size multiple:     128

                          Error correction support:    0
                          Unified memory for Host and Device:    1
                          Profiling timer resolution:    1
                          Device endianess:    Little
                          Available:    Yes
                          Compiler available:    Yes
                          Execution capabilities: 
                        Execute OpenCL kernels:    Yes
                        Execute native function:    Yes
                          Queue on Host properties: 
                        Out-of-Order:    Yes
                        Profiling :    Yes
                          Platform ID:    0x256e700
                          Name:    AMD A10-7850K APU with Radeon(TM) R7 Graphics
                          Vendor:    Intel(R) Corporation
                          Device OpenCL C version:    OpenCL C 1.2
                          Driver version:    1.2
                          Profile:    FULL_PROFILE
                          Version:    OpenCL 1.2 (Build 56860)
                          Extensions:    cl_khr_fp64 cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_intel_printf cl_ext_device_fission cl_intel_exec_by_local_thread

                         

                         

                          Platform Name:    AMD Accelerated Parallel Processing
                        Number of devices:    2
                          Device Type:    CL_DEVICE_TYPE_GPU
                          Vendor ID:    1002h
                          Board name:    AMD Radeon(TM) R7 Graphics
                          Device Topology:    PCI[ B#0, D#1, F#0 ]
                          Max compute units:    8
                          Max work items dimensions:    3
                        Max work items[0]:    256
                        Max work items[1]:    256
                        Max work items[2]:    256
                          Max work group size:    256
                          Preferred vector width char:    4
                          Preferred vector width short:    2
                          Preferred vector width int:    1
                          Preferred vector width long:    1
                          Preferred vector width float:    1
                          Preferred vector width double:    1
                          Native vector width char:    4
                          Native vector width short:    2
                          Native vector width int:    1
                          Native vector width long:    1
                          Native vector width float:    1
                          Native vector width double:    1
                          Max clock frequency:    900Mhz
                          Address bits:    64
                          Max memory allocation:    1206806118
                          Image support:    Yes
                          Max number of images read arguments:    128
                          Max number of images write arguments:    64
                          Max image 2D width:    16384
                          Max image 2D height:    16384
                          Max image 3D width:    2048
                          Max image 3D height:    2048
                          Max image 3D depth:    2048
                          Max samplers within kernel:    16
                          Max size of kernel argument:    1024
                          Alignment (bits) of base address:    2048

                          Minimum alignment (bytes) for any datatype:     128

                          Single precision floating point capability

                        Denorms:    No
                        Quiet NaNs:    Yes
                        Round to nearest even:    Yes
                        Round to zero:    Yes
                        Round to +ve and infinity:    Yes
                        IEEE754-2008 fused multiply-add:    Yes
                          Cache type:    Read/Write
                          Cache line size:    64
                          Cache size:    16384
                          Global memory size:    2569011200
                          Constant buffer size:    65536
                          Max number of constant args:    8
                          Local memory type:    Scratchpad
                          Local memory size:    32768
                          Max pipe arguments:    16
                          Max pipe active reservations:    16
                          Max pipe packet size:    1206806118
                          Max global variable size:    1086125312

                          Max global variable preferred total size:     2569011200

                          Max read/write image args:    64
                          Max on device events:    1024
                          Queue on device max size:    524288
                          Max on device queues:    1
                          Queue on device preferred size:    16384
                          SVM capabilities: 
                        Coarse grain buffer:    Yes
                        Fine grain buffer:    Yes
                        Fine grain system:    No
                        Atomics:    No
                          Preferred platform atomic alignment:    0
                          Preferred global atomic alignment:    0
                          Preferred local atomic alignment:    0

                          Kernel Preferred work group size multiple:     64

                          Error correction support:    0
                          Unified memory for Host and Device:    1
                          Profiling timer resolution:    1
                          Device endianess:    Little
                          Available:    Yes
                          Compiler available:    Yes
                          Execution capabilities: 
                        Execute OpenCL kernels:    Yes
                        Execute native function:    No
                          Queue on Host properties: 
                        Out-of-Order:    No
                        Profiling :    Yes
                          Queue on Device properties: 
                        Out-of-Order:    Yes
                        Profiling :    Yes
                          Platform ID:    0x7f61e4e1cfd0
                          Name:    Spectre
                          Vendor:    Advanced Micro Devices, Inc.
                          Device OpenCL C version:    OpenCL C 2.0
                          Driver version:    1642.5 (VM)
                          Profile:    FULL_PROFILE
                          Version:    OpenCL 2.0 AMD-APP (1642.5)
                          Extensions:    cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_image2d_from_buffer cl_khr_spir cl_khr_subgroups cl_khr_gl_event cl_khr_depth_images

                         

                         

                          Device Type:    CL_DEVICE_TYPE_CPU
                          Vendor ID:    1002h
                          Board name: 
                          Max compute units:    4
                          Max work items dimensions:    3
                        Max work items[0]:    1024
                        Max work items[1]:    1024
                        Max work items[2]:    1024
                          Max work group size:    1024
                          Preferred vector width char:    16
                          Preferred vector width short:    8
                          Preferred vector width int:    4
                          Preferred vector width long:    2
                          Preferred vector width float:    8
                          Preferred vector width double:    4
                          Native vector width char:    16
                          Native vector width short:    8
                          Native vector width int:    4
                          Native vector width long:    2
                          Native vector width float:    8
                          Native vector width double:    4
                          Max clock frequency:    4200Mhz
                          Address bits:    64
                          Max memory allocation:    3641622528
                          Image support:    Yes
                          Max number of images read arguments:    128
                          Max number of images write arguments:    64
                          Max image 2D width:    8192
                          Max image 2D height:    8192
                          Max image 3D width:    2048
                          Max image 3D height:    2048
                          Max image 3D depth:    2048
                          Max samplers within kernel:    16
                          Max size of kernel argument:    4096
                          Alignment (bits) of base address:    1024

                          Minimum alignment (bytes) for any datatype:     128

                          Single precision floating point capability

                        Denorms:    Yes
                        Quiet NaNs:    Yes
                        Round to nearest even:    Yes
                        Round to zero:    Yes
                        Round to +ve and infinity:    Yes
                        IEEE754-2008 fused multiply-add:    Yes
                          Cache type:    Read/Write
                          Cache line size:    64
                          Cache size:    16384
                          Global memory size:    14566490112
                          Constant buffer size:    65536
                          Max number of constant args:    8
                          Local memory type:    Global
                          Local memory size:    32768
                          Max pipe arguments:    16
                          Max pipe active reservations:    16
                          Max pipe packet size:    3641622528
                          Max global variable size:    1879048192

                          Max global variable preferred total size:     1879048192

                          Max read/write image args:    64
                          Max on device events:    0
                          Queue on device max size:    0
                          Max on device queues:    0
                          Queue on device preferred size:    0
                          SVM capabilities: 
                        Coarse grain buffer:    Yes
                        Fine grain buffer:    Yes
                        Fine grain system:    Yes
                        Atomics:    Yes
                          Preferred platform atomic alignment:    0
                          Preferred global atomic alignment:    0
                          Preferred local atomic alignment:    0

                          Kernel Preferred work group size multiple:     1

                          Error correction support:    0
                          Unified memory for Host and Device:    1
                          Profiling timer resolution:    1
                          Device endianess:    Little
                          Available:    Yes
                          Compiler available:    Yes
                          Execution capabilities: 
                        Execute OpenCL kernels:    Yes
                        Execute native function:    Yes
                          Queue on Host properties: 
                        Out-of-Order:    No
                        Profiling :    Yes
                          Queue on Device properties: 
                        Out-of-Order:    No
                        Profiling :    No
                          Platform ID:    0x7f61e4e1cfd0
                          Name:    AMD A10-7850K APU with Radeon(TM) R7 Graphics
                          Vendor:    AuthenticAMD
                          Device OpenCL C version:    OpenCL C 1.2
                          Driver version:    1642.5 (sse2,avx,fma4)
                          Profile:    FULL_PROFILE
                          Version:    OpenCL 1.2 AMD-APP (1642.5)
                          Extensions:    cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_spir cl_khr_gl_event

                         

                         

                         

                        # dmidecode 2.12

                        SMBIOS 2.7 present.

                        22 structures occupying 1358 bytes.

                        Table at 0x000EBF50.

                         

                        Handle 0x0000, DMI type 0, 24 bytes

                        BIOS Information

                            Vendor: American Megatrends Inc.

                            Version: P2.10

                            Release Date: 02/20/2014

                            Address: 0xF0000

                            Runtime Size: 64 kB

                            ROM Size: 8192 kB

                            Characteristics:

                                PCI is supported

                                BIOS is upgradeable

                                BIOS shadowing is allowed

                                Boot from CD is supported

                                Selectable boot is supported

                                BIOS ROM is socketed

                                EDD is supported

                                5.25"/1.2 MB floppy services are supported (int 13h)

                                3.5"/720 kB floppy services are supported (int 13h)

                                3.5"/2.88 MB floppy services are supported (int 13h)

                                Print screen service is supported (int 5h)

                                8042 keyboard services are supported (int 9h)

                                Serial services are supported (int 14h)

                                Printer services are supported (int 17h)

                                ACPI is supported

                                USB legacy is supported

                                BIOS boot specification is supported

                                Targeted content distribution is supported

                                UEFI is supported

                            BIOS Revision: 4.6

                         

                        Handle 0x0001, DMI type 1, 27 bytes

                        System Information

                            Manufacturer: To Be Filled By O.E.M.

                            Product Name: To Be Filled By O.E.M.

                            Version: To Be Filled By O.E.M.

                            Serial Number: To Be Filled By O.E.M.

                            UUID: 03000200-0400-0500-0006-000700080009

                            Wake-up Type: Power Switch

                            SKU Number: To Be Filled By O.E.M.

                            Family: To Be Filled By O.E.M.

                         

                        Handle 0x0002, DMI type 2, 15 bytes

                        Base Board Information

                            Manufacturer: ASRock

                            Product Name: FM2A88M Extreme4+

                            Version:                     

                            Serial Number: E80-3A010000081

                            Asset Tag:                     

                            Features:

                                Board is a hosting board

                                Board is replaceable

                            Location In Chassis:                     

                            Chassis Handle: 0x0003

                            Type: Motherboard

                            Contained Object Handles: 0

                         

                        Handle 0x0003, DMI type 3, 22 bytes

                        Chassis Information

                            Manufacturer: To Be Filled By O.E.M.

                            Type: Desktop

                            Lock: Not Present

                            Version: To Be Filled By O.E.M.

                            Serial Number: To Be Filled By O.E.M.

                            Asset Tag: To Be Filled By O.E.M.

                            Boot-up State: Safe

                            Power Supply State: Safe

                            Thermal State: Safe

                            Security Status: None

                            OEM Information: 0x00000000

                            Height: Unspecified

                            Number Of Power Cords: 1

                            Contained Elements: 0

                            SKU Number: To be filled by O.E.M.

                         

                        Handle 0x0004, DMI type 9, 17 bytes

                        System Slot Information

                            Designation: PCI1

                            Type: 32-bit PCI

                            Current Usage: In Use

                            Length: Short

                            ID: 1

                            Characteristics:

                                3.3 V is provided

                                Opening is shared

                                PME signal is supported

                         

                        Handle 0x0005, DMI type 9, 17 bytes

                        System Slot Information

                            Designation: PCIE1

                            Type: x16 PCI Express

                            Current Usage: In Use

                            Length: Short

                            ID: 17

                            Characteristics:

                                3.3 V is provided

                                Opening is shared

                                PME signal is supported

                            Bus Address: 0000:00:15.0

                         

                        Handle 0x0006, DMI type 9, 17 bytes

                        System Slot Information

                            Designation: PCIE2

                            Type: x1 PCI Express

                            Current Usage: In Use

                            Length: Short

                            ID: 18

                            Characteristics:

                                3.3 V is provided

                                Opening is shared

                                PME signal is supported

                            Bus Address: 0000:00:02.0

                         

                        Handle 0x0007, DMI type 9, 17 bytes

                        System Slot Information

                            Designation: PCIE3

                            Type: x4 PCI Express

                            Current Usage: In Use

                            Length: Short

                            ID: 19

                            Characteristics:

                                3.3 V is provided

                                Opening is shared

                                PME signal is supported

                            Bus Address: 0000:00:15.1

                         

                        Handle 0x0008, DMI type 11, 5 bytes

                        OEM Strings

                            String 1: To Be Filled By O.E.M.

                         

                        Handle 0x0009, DMI type 7, 19 bytes

                        Cache Information

                            Socket Designation: L1 CACHE

                            Configuration: Enabled, Not Socketed, Level 1

                            Operational Mode: Write Back

                            Location: Internal

                            Installed Size: 256 kB

                            Maximum Size: 256 kB

                            Supported SRAM Types:

                                Pipeline Burst

                            Installed SRAM Type: Pipeline Burst

                            Speed: 1 ns

                            Error Correction Type: Multi-bit ECC

                            System Type: Unified

                            Associativity: 2-way Set-associative

                         

                        Handle 0x000A, DMI type 7, 19 bytes

                        Cache Information

                            Socket Designation: L2 CACHE

                            Configuration: Enabled, Not Socketed, Level 2

                            Operational Mode: Write Back

                            Location: Internal

                            Installed Size: 4096 kB

                            Maximum Size: 4096 kB

                            Supported SRAM Types:

                                Pipeline Burst

                            Installed SRAM Type: Pipeline Burst

                            Speed: 1 ns

                            Error Correction Type: Multi-bit ECC

                            System Type: Unified

                            Associativity: 16-way Set-associative

                         

                        Handle 0x0013, DMI type 32, 20 bytes

                        System Boot Information

                            Status: No errors detected

                         

                        Handle 0x0015, DMI type 16, 23 bytes

                        Physical Memory Array

                            Location: System Board Or Motherboard

                            Use: System Memory

                            Error Correction Type: None

                            Maximum Capacity: 16 GB

                            Error Information Handle: Not Provided

                            Number Of Devices: 4

                         

                        Handle 0x0016, DMI type 19, 31 bytes

                        Memory Array Mapped Address

                            Starting Address: 0x00000000000

                            Ending Address: 0x003FFFFFFFF

                            Range Size: 16 GB

                            Physical Array Handle: 0x0015

                            Partition Width: 255

                         

                        Handle 0x0017, DMI type 17, 34 bytes

                        Memory Device

                            Array Handle: 0x0015

                            Error Information Handle: Not Provided

                            Total Width: 64 bits

                            Data Width: 64 bits

                            Size: 8192 MB

                            Form Factor: DIMM

                            Set: None

                            Locator: DIMM 0

                            Bank Locator: CHANNEL A

                            Type: DDR3

                            Type Detail: Synchronous Unbuffered (Unregistered)

                            Speed: 2400 MHz

                            Manufacturer: <BAD INDEX>

                            Serial Number: 00000000

                            Asset Tag: A1_AssetTagNum0

                            Part Number: Xtreem-LV-2400  

                            Rank: 2

                            Configured Clock Speed: 2400 MHz

                         

                        Handle 0x0018, DMI type 20, 35 bytes

                        Memory Device Mapped Address

                            Starting Address: 0x00000000000

                            Ending Address: 0x001FFFFFFFF

                            Range Size: 8 GB

                            Physical Device Handle: 0x0017

                            Memory Array Mapped Address Handle: 0x0016

                            Partition Row Position: 1

                         

                        Handle 0x0019, DMI type 17, 34 bytes

                        Memory Device

                            Array Handle: 0x0015

                            Error Information Handle: Not Provided

                            Total Width: 64 bits

                            Data Width: 64 bits

                            Size: No Module Installed

                            Form Factor: SODIMM

                            Set: None

                            Locator: DIMM 1

                            Bank Locator: CHANNEL A

                            Type: DDR3

                            Type Detail: None

                            Speed: Unknown

                            Manufacturer: A1_Manufacturer1

                            Serial Number: A1_SerialNum1

                            Asset Tag: A1_AssetTagNum1

                            Part Number: A1_PartNum1

                            Rank: Unknown

                            Configured Clock Speed: Unknown

                         

                        Handle 0x001A, DMI type 17, 34 bytes

                        Memory Device

                            Array Handle: 0x0015

                            Error Information Handle: Not Provided

                            Total Width: 64 bits

                            Data Width: 64 bits

                            Size: 8192 MB

                            Form Factor: DIMM

                            Set: None

                            Locator: DIMM 0

                            Bank Locator: CHANNEL B

                            Type: DDR3

                            Type Detail: Synchronous Unbuffered (Unregistered)

                            Speed: 2400 MHz

                            Manufacturer: <BAD INDEX>

                            Serial Number: 00000000

                            Asset Tag: A1_AssetTagNum2

                            Part Number: Xtreem-LV-2400  

                            Rank: 2

                            Configured Clock Speed: 2400 MHz

                         

                        Handle 0x001B, DMI type 20, 35 bytes

                        Memory Device Mapped Address

                            Starting Address: 0x00200000000

                            Ending Address: 0x003FFFFFFFF

                            Range Size: 8 GB

                            Physical Device Handle: 0x0019

                            Memory Array Mapped Address Handle: 0x0016

                            Partition Row Position: 1

                         

                        Handle 0x001C, DMI type 17, 34 bytes

                        Memory Device

                            Array Handle: 0x0015

                            Error Information Handle: Not Provided

                            Total Width: 64 bits

                            Data Width: 64 bits

                            Size: No Module Installed

                            Form Factor: SODIMM

                            Set: None

                            Locator: DIMM 1

                            Bank Locator: CHANNEL B

                            Type: DDR3

                            Type Detail: None

                            Speed: Unknown

                            Manufacturer: A1_Manufacturer3

                            Serial Number: A1_SerialNum3

                            Asset Tag: A1_AssetTagNum3

                            Part Number: A1_PartNum3

                            Rank: Unknown

                            Configured Clock Speed: Unknown

                         

                        Handle 0x001F, DMI type 4, 42 bytes

                        Processor Information

                            Socket Designation: CPUSocket

                            Type: Central Processor

                            Family: A-Series

                            Manufacturer: AMD

                            ID: 01 0F 63 00 FF FB 8B 17

                            Signature: Family 21, Model 48, Stepping 1

                            Flags:

                                FPU (Floating-point unit on-chip)

                                VME (Virtual mode extension)

                                DE (Debugging extension)

                                PSE (Page size extension)

                                TSC (Time stamp counter)

                                MSR (Model specific registers)

                                PAE (Physical address extension)

                                MCE (Machine check exception)

                                CX8 (CMPXCHG8 instruction supported)

                                APIC (On-chip APIC hardware supported)

                                SEP (Fast system call)

                                MTRR (Memory type range registers)

                                PGE (Page global enable)

                                MCA (Machine check architecture)

                                CMOV (Conditional move instruction supported)

                                PAT (Page attribute table)

                                PSE-36 (36-bit page size extension)

                                CLFSH (CLFLUSH instruction supported)

                                MMX (MMX technology supported)

                                FXSR (FXSAVE and FXSTOR instructions supported)

                                SSE (Streaming SIMD extensions)

                                SSE2 (Streaming SIMD extensions 2)

                                HTT (Multi-threading)

                            Version: AMD A10-7850K APU with Radeon(TM) R7 Graphics

                            Voltage: 1.3 V

                            External Clock: 100 MHz

                            Max Speed: 4200 MHz

                            Current Speed: 4200 MHz

                            Status: Populated, Enabled

                            Upgrade: Socket FM2

                            L1 Cache Handle: 0x0009

                            L2 Cache Handle: 0x000A

                            L3 Cache Handle: Not Provided

                            Serial Number: Not Specified

                            Asset Tag: Not Specified

                            Part Number: Not Specified

                            Core Count: 4

                            Core Enabled: 4

                            Thread Count: 4

                            Characteristics:

                                64-bit capable

                         

                        Handle 0x0020, DMI type 127, 4 bytes

                        End Of Table

                         

                          • Re: Re: BufferBandwidth results on Kaveri
                            dipak

                            Hi Evren,

                            I got similar findings after running the sample using Omega driver on Kaveri with Redhat7 (64bit). I've reported the issue to dev team and they are working on it. Once I get any update, I'll get back to you. Thanks for pointing the issue.

                             

                            Regards,

                            • Re: BufferBandwidth results on Kaveri
                              dipak

                              Hi Evren,

                              My apologies for this delayed reply.

                              We ran a few experiments at our end. The BufferBandwidth sample was actually intended for measuring the memory bandwidth during the map/unmap operation, not for benchmarking read/write bandwidth from kernels.

                              Information about read/write bandwidth from kernels is available in the GlobalMemoryBandwidth benchmark sample. The code in this sample is written to showcase this information. The GlobalMemoryBandwidth benchmark sample shows global memory accessing bandwidth in various data accessing scenarios, such as coalescing/uncoalescing, stride, and random.

                              Per your feedback, we will be modifying the BufferBandwidth sample to show only relevant information about map/unmap memory bandwidth.


                              Thanks,