cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

yurtesen
Miniboss

BufferBandwidth results on Kaveri

Hello,

I was wondering why the CPU read/writes are so slow on the BufferBandwidth example irrelevant of if the memory is allocated in host or not? Also why the GPU writes are slow if the kernel is writing to host memory?













































Device  0        Spectre
Build:           release
GPU work items:  8192
Buffer size:     33554432
CPU workers:     1
Timing loops:    20
Repeats:         1
Kernel loops:    20
inputBuffer:     CL_MEM_READ_ONLY
outputBuffer:    CL_MEM_WRITE_ONLY


Host baseline (naive):






























Timer resolution 256.22  ns
Page fault       942.38  ns
CPU read         6.28 GB/s
memcpy()         8.81 GB/s
memset(,1,)      6.87 GB/s
memset(,0,)      6.87 GB/s



AVERAGES (over loops 2 - 19, use -l for complete log)


--------




1. Host mapped write to inputBuffer


---------------------------------------|---------------









clEnqueueMapBuffer -- WRITE (GBPS) | 2331.320

---------------------------------------|---------------









memset() (GBPS)                    | 6.717

---------------------------------------|---------------









clEnqueueUnmapMemObject() (GBPS)   | 10.404



2. GPU kernel read of inputBuffer


---------------------------------------|---------------









clEnqueueNDRangeKernel() (GBPS)    | 29.747


Verification Passed!




3. GPU kernel write to outputBuffer


---------------------------------------|---------------









clEnqueueNDRangeKernel() (GBPS)    | 23.172



4. Host mapped read of outputBuffer


---------------------------------------|---------------









clEnqueueMapBuffer -- READ (GBPS)  | 10.927

---------------------------------------|---------------









CPU read (GBPS)                    | 6.228

---------------------------------------|---------------









clEnqueueUnmapMemObject() (GBPS)   | 645.145















































Device  0        Spectre
Build:           release
GPU work items:  8192
Buffer size:     33554432
CPU workers:     1
Timing loops:    20
Repeats:         1
Kernel loops:    20
inputBuffer:     CL_MEM_READ_ONLY CL_MEM_ALLOC_HOST_PTR
outputBuffer:    CL_MEM_WRITE_ONLY CL_MEM_ALLOC_HOST_PTR


Host baseline (naive):






























Timer resolution 256.48  ns
Page fault       974.34  ns
CPU read         6.15 GB/s
memcpy()         8.82 GB/s
memset(,1,)      6.73 GB/s
memset(,0,)      6.72 GB/s



AVERAGES (over loops 2 - 19, use -l for complete log)


--------




1. Host mapped write to inputBuffer


---------------------------------------|---------------









clEnqueueMapBuffer -- WRITE (GBPS) | 2880.703

---------------------------------------|---------------









memset() (GBPS)                    | 9.079

---------------------------------------|---------------









clEnqueueUnmapMemObject() (GBPS)   | 917.657



2. GPU kernel read of inputBuffer


---------------------------------------|---------------









clEnqueueNDRangeKernel() (GBPS)    | 28.579


Verification Passed!




3. GPU kernel write to outputBuffer


---------------------------------------|---------------









clEnqueueNDRangeKernel() (GBPS)    | 8.098



4. Host mapped read of outputBuffer


---------------------------------------|---------------









clEnqueueMapBuffer -- READ (GBPS)  | 3166.840

---------------------------------------|---------------









CPU read (GBPS)                    | 6.195

---------------------------------------|---------------









clEnqueueUnmapMemObject() (GBPS)   | 794.376



Thanks,

Evren

0 Likes
1 Solution

Hi Evren,

My apologies for this delayed reply.

We ran a few experiments at our end. The BufferBandwidth sample was actually intended for measuring the memory bandwidth during the map/unmap operation, not for benchmarking read/write bandwidth from kernels.

Information about read/write bandwidth from kernels is available in the GlobalMemoryBandwidth benchmark sample. The code in this sample is written to showcase this information. The GlobalMemoryBandwidth benchmark sample shows global memory accessing bandwidth in various data accessing scenarios, such as coalescing/uncoalescing, stride, and random.

Per your feedback, we will be modifying the BufferBandwidth sample to show only relevant information about map/unmap memory bandwidth.


Thanks,

View solution in original post

0 Likes
7 Replies
dipak
Big Boss

Hi Evren,

I guess, this is somewhat expected. Most numbers are similar for both the scenario except following cases. Please find my comments regarding those cases.

GPU MemoryALLOC_Host memoryComments
Host mapped write to inputBuffer - clEnqueueUnmapMemObject() (GBPS)10.404917.657GPU case bandwidth is lower since, during the unmap, data transfer is needed to GPU memory and writes to GPU memory from host is slower than writes to host memory
GPU kernel write to outputBuffer -clEnqueueNDRangeKernel() (GBPS) 23.1728.098ALLOC_Host case is lower since the write from GPU to host happens through slower memory bus (Onion) as compared to GPU memory bus(Garlic)
Host mapped read of outputBuffer - clEnqueueMapBuffer -- READ (GBPS) 10.9273166.84GPU case is slower since host has to read from GPU memory which is much slower than reads from host memory

Regards,

0 Likes

Hmm, but why is ALLOC_Host memory reads are fast but writes are slow?

Also, I modified the BufferBandwidth and tried to see how the CPU would perform (here the GPU kernel read/writes are made by kernel running on CPU). The kernel read/write speeds are super low. Is this normal? Why?













































Device  0        AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G
Build:           release
GPU work items:  4096
Buffer size:     33554432
CPU workers:     1
Timing loops:    20
Repeats:         1
Kernel loops:    20
inputBuffer:     CL_MEM_READ_ONLY
outputBuffer:    CL_MEM_WRITE_ONLY


Host baseline (naive):






























Timer resolution 256.64  ns
Page fault       971.97  ns
CPU read         6.16 GB/s
memcpy()         4.08 GB/s
memset(,1,)      6.69 GB/s
memset(,0,)      6.71 GB/s



AVERAGES (over loops 2 - 19, use -l for complete log)


--------




1. Host mapped write to inputBuffer


---------------------------------------|---------------









clEnqueueMapBuffer -- WRITE (GBPS) | 4060.750

---------------------------------------|---------------









memset() (GBPS)                    | 6.667

---------------------------------------|---------------









clEnqueueUnmapMemObject() (GBPS)   | 726.832



2. GPU kernel read of inputBuffer


---------------------------------------|---------------









clEnqueueNDRangeKernel() (GBPS)    | 0.709


Verification Passed!




3. GPU kernel write to outputBuffer


---------------------------------------|---------------









clEnqueueNDRangeKernel() (GBPS)    | 0.347



4. Host mapped read of outputBuffer


---------------------------------------|---------------









clEnqueueMapBuffer -- READ (GBPS)  | 1201.271

---------------------------------------|---------------









CPU read (GBPS)                    | 6.247

---------------------------------------|---------------









clEnqueueUnmapMemObject() (GBPS)   | 706.588


Verification Passed!




0 Likes

Hello Dipak,

I also tried the new version of the BufferBandwidth from new SDK. Kernel reads super slow.... Shouldn't it be higher?


$ /opt/AMDAPPSDK-3.0-0-Beta/samples/opencl/bin/x86_64/BufferBandwidth --device cpu


Platform 0 : Advanced Micro Devices, Inc.


Platform 1 : Intel(R) Corporation


Platform found : Advanced Micro Devices, Inc.



Selected Platform Vendor : Advanced Micro Devices, Inc.


Device 0 : AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G Device ID is 0x262bea0


Build:               release


GPU work items:      4096


Buffer size:         33554432


CPU workers:         1


Timing loops:        20


Repeats:             1


Kernel loops:        20


inputBuffer:         CL_MEM_READ_ONLY


outputBuffer:        CL_MEM_WRITE_ONLY



Host baseline (naive):



Timer resolution     1000.52 ns


Page fault           836.79  ns


CPU read             6.38 GB/s


memcpy()             8.79 GB/s


memset(,1,)          6.70 GB/s


memset(,0,)          6.70 GB/s




AVERAGES (over loops 2 - 19, use -l for complete log)


--------




1. Host mapped write to inputBuffer


---------------------------------------|---------------


clEnqueueMapBuffer -- WRITE (GBPS)     | 4712.193


---------------------------------------|---------------


memset() (GBPS)                        | 6.675


---------------------------------------|---------------


clEnqueueUnmapMemObject() (GBPS)       | 555.184




2. GPU kernel read of inputBuffer


---------------------------------------|---------------


clEnqueueNDRangeKernel() (GBPS)        | 0.709



Verification Passed!




3. GPU kernel write to outputBuffer


---------------------------------------|---------------


clEnqueueNDRangeKernel() (GBPS)        | 0.349




4. Host mapped read of outputBuffer


---------------------------------------|---------------


clEnqueueMapBuffer -- READ (GBPS)      | 1100.001


---------------------------------------|---------------


CPU read (GBPS)                        | 6.270


---------------------------------------|---------------


clEnqueueUnmapMemObject() (GBPS)       | 659.707



Verification Passed!




Passed!



also


$ /opt/AMDAPPSDK-3.0-0-Beta/samples/opencl/bin/x86_64/BufferBandwidth --device cpu -if 5 -of 5 -cf 5


Platform 0 : Advanced Micro Devices, Inc.


Platform 1 : Intel(R) Corporation


Platform found : Advanced Micro Devices, Inc.



Selected Platform Vendor : Advanced Micro Devices, Inc.


Device 0 : AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G Device ID is 0x27e0c60


Build:               release


GPU work items:      4096


Buffer size:         33554432


CPU workers:         1


Timing loops:        20


Repeats:             1


Kernel loops:        20


inputBuffer:         CL_MEM_READ_ONLY CL_MEM_ALLOC_HOST_PTR


outputBuffer:        CL_MEM_WRITE_ONLY CL_MEM_ALLOC_HOST_PTR



Host baseline (naive):



Timer resolution     1000.65 ns


Page fault           875.38  ns


CPU read             6.38 GB/s


memcpy()             8.87 GB/s


memset(,1,)          6.93 GB/s


memset(,0,)          6.92 GB/s




AVERAGES (over loops 2 - 19, use -l for complete log)


--------




1. Host mapped write to inputBuffer


---------------------------------------|---------------


clEnqueueMapBuffer -- WRITE (GBPS)     | 3847.436


---------------------------------------|---------------


memset() (GBPS)                        | 6.853


---------------------------------------|---------------


clEnqueueUnmapMemObject() (GBPS)       | 588.324




2. GPU kernel read of inputBuffer


---------------------------------------|---------------


clEnqueueNDRangeKernel() (GBPS)        | 0.720



Verification Passed!




3. GPU kernel write to outputBuffer


---------------------------------------|---------------


clEnqueueNDRangeKernel() (GBPS)        | 0.352




4. Host mapped read of outputBuffer


---------------------------------------|---------------


clEnqueueMapBuffer -- READ (GBPS)      | 1152.796


---------------------------------------|---------------


CPU read (GBPS)                        | 6.320


---------------------------------------|---------------


clEnqueueUnmapMemObject() (GBPS)       | 707.233



Verification Passed!




Passed!



0 Likes

Could you please mention your setup details such as OS, catalyst driver version etc.? Please also share your clinfo output.

Regards,

0 Likes

Dipak, it is a normal kaveri system with an asrock mobo. Down is the clinfo and the dmidecode output. I am using the omega drivers with newest 3.0beta SDK (but older SDKs give the same result). OS is Ubuntu 14.04










































































Number of platforms: 


  2
  Platform Profile:    FULL_PROFILE
  Platform Version:    OpenCL 1.2 LINUX
  Platform Name:    Intel(R) OpenCL
  Platform Vendor:    Intel(R) Corporation
  Platform Extensions:    cl_khr_fp64 cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_intel_printf cl_ext_device_fission cl_intel_exec_by_local_thread
  Platform Profile:    FULL_PROFILE
  Platform Version:    OpenCL 2.0 AMD-APP (1642.5)
  Platform Name:    AMD Accelerated Parallel Processing
  Platform Vendor:    Advanced Micro Devices, Inc.
  Platform Extensions:    cl_khr_icd cl_amd_event_callback cl_amd_offline_devices













































































































































































































  Platform Name:    Intel(R) OpenCL
Number of devices:    1
  Device Type:    CL_DEVICE_TYPE_CPU
  Vendor ID:    8086h
  Max compute units:    4
  Max work items dimensions:    3
Max work items[0]:    1024
Max work items[1]:    1024
Max work items[2]:    1024
  Max work group size:    1024
  Preferred vector width char:    16
  Preferred vector width short:    8
  Preferred vector width int:    4
  Preferred vector width long:    2
  Preferred vector width float:    4
  Preferred vector width double:    2
  Native vector width char:    16
  Native vector width short:    8
  Native vector width int:    4
  Native vector width long:    2
  Native vector width float:    4
  Native vector width double:    2
  Max clock frequency:    0Mhz
  Address bits:    64
  Max memory allocation:    3641622528
  Image support:    Yes
  Max number of images read arguments:    480
  Max number of images write arguments:    480
  Max image 2D width:    16384
  Max image 2D height:    16384
  Max image 3D width:    2048
  Max image 3D height:    2048
  Max image 3D depth:    2048
  Max samplers within kernel:    480
  Max size of kernel argument:    3840
  Alignment (bits) of base address:    1024

  Minimum alignment (bytes) for any datatype:     128


  Single precision floating point capability






























































































Denorms:    Yes
Quiet NaNs:    Yes
Round to nearest even:    Yes
Round to zero:    No
Round to +ve and infinity:    No
IEEE754-2008 fused multiply-add:    No
  Cache type:    Read/Write
  Cache line size:    64
  Cache size:    2097152
  Global memory size:    14566490112
  Constant buffer size:    131072
  Max number of constant args:    480
  Local memory type:    Global
  Local memory size:    32768

  Kernel Preferred work group size multiple:     128



































































































































  Error correction support:    0
  Unified memory for Host and Device:    1
  Profiling timer resolution:    1
  Device endianess:    Little
  Available:    Yes
  Compiler available:    Yes
  Execution capabilities: 
Execute OpenCL kernels:    Yes
Execute native function:    Yes
  Queue on Host properties: 
Out-of-Order:    Yes
Profiling :    Yes
  Platform ID:    0x256e700
  Name:    AMD A10-7850K APU with Radeon(TM) R7 Graphics
  Vendor:    Intel(R) Corporation
  Device OpenCL C version:    OpenCL C 1.2
  Driver version:    1.2
  Profile:    FULL_PROFILE
  Version:    OpenCL 1.2 (Build 56860)
  Extensions:    cl_khr_fp64 cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_intel_printf cl_ext_device_fission cl_intel_exec_by_local_thread


























































































































































































































  Platform Name:    AMD Accelerated Parallel Processing
Number of devices:    2
  Device Type:    CL_DEVICE_TYPE_GPU
  Vendor ID:    1002h
  Board name:    AMD Radeon(TM) R7 Graphics
  Device Topology:    PCI[ B#0, D#1, F#0 ]
  Max compute units:    8
  Max work items dimensions:    3
Max work items[0]:    256
Max work items[1]:    256
Max work items[2]:    256
  Max work group size:    256
  Preferred vector width char:    4
  Preferred vector width short:    2
  Preferred vector width int:    1
  Preferred vector width long:    1
  Preferred vector width float:    1
  Preferred vector width double:    1
  Native vector width char:    4
  Native vector width short:    2
  Native vector width int:    1
  Native vector width long:    1
  Native vector width float:    1
  Native vector width double:    1
  Max clock frequency:    900Mhz
  Address bits:    64
  Max memory allocation:    1206806118
  Image support:    Yes
  Max number of images read arguments:    128
  Max number of images write arguments:    64
  Max image 2D width:    16384
  Max image 2D height:    16384
  Max image 3D width:    2048
  Max image 3D height:    2048
  Max image 3D depth:    2048
  Max samplers within kernel:    16
  Max size of kernel argument:    1024
  Alignment (bits) of base address:    2048

  Minimum alignment (bytes) for any datatype:     128


  Single precision floating point capability




















































































































Denorms:    No
Quiet NaNs:    Yes
Round to nearest even:    Yes
Round to zero:    Yes
Round to +ve and infinity:    Yes
IEEE754-2008 fused multiply-add:    Yes
  Cache type:    Read/Write
  Cache line size:    64
  Cache size:    16384
  Global memory size:    2569011200
  Constant buffer size:    65536
  Max number of constant args:    8
  Local memory type:    Scratchpad
  Local memory size:    32768
  Max pipe arguments:    16
  Max pipe active reservations:    16
  Max pipe packet size:    1206806118
  Max global variable size:    1086125312

  Max global variable preferred total size:     2569011200













































































  Max read/write image args:    64
  Max on device events:    1024
  Queue on device max size:    524288
  Max on device queues:    1
  Queue on device preferred size:    16384
  SVM capabilities: 
Coarse grain buffer:    Yes
Fine grain buffer:    Yes
Fine grain system:    No
Atomics:    No
  Preferred platform atomic alignment:    0
  Preferred global atomic alignment:    0
  Preferred local atomic alignment:    0

  Kernel Preferred work group size multiple:     64
























































































































































  Error correction support:    0
  Unified memory for Host and Device:    1
  Profiling timer resolution:    1
  Device endianess:    Little
  Available:    Yes
  Compiler available:    Yes
  Execution capabilities: 
Execute OpenCL kernels:    Yes
Execute native function:    No
  Queue on Host properties: 
Out-of-Order:    No
Profiling :    Yes
  Queue on Device properties: 
Out-of-Order:    Yes
Profiling :    Yes
  Platform ID:    0x7f61e4e1cfd0
  Name:    Spectre
  Vendor:    Advanced Micro Devices, Inc.
  Device OpenCL C version:    OpenCL C 2.0
  Driver version:    1642.5 (VM)
  Profile:    FULL_PROFILE
  Version:    OpenCL 2.0 AMD-APP (1642.5)
  Extensions:    cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_image2d_from_buffer cl_khr_spir cl_khr_subgroups cl_khr_gl_event cl_khr_depth_images








































































































































































































  Device Type:    CL_DEVICE_TYPE_CPU
  Vendor ID:    1002h
  Board name: 
  Max compute units:    4
  Max work items dimensions:    3
Max work items[0]:    1024
Max work items[1]:    1024
Max work items[2]:    1024
  Max work group size:    1024
  Preferred vector width char:    16
  Preferred vector width short:    8
  Preferred vector width int:    4
  Preferred vector width long:    2
  Preferred vector width float:    8
  Preferred vector width double:    4
  Native vector width char:    16
  Native vector width short:    8
  Native vector width int:    4
  Native vector width long:    2
  Native vector width float:    8
  Native vector width double:    4
  Max clock frequency:    4200Mhz
  Address bits:    64
  Max memory allocation:    3641622528
  Image support:    Yes
  Max number of images read arguments:    128
  Max number of images write arguments:    64
  Max image 2D width:    8192
  Max image 2D height:    8192
  Max image 3D width:    2048
  Max image 3D height:    2048
  Max image 3D depth:    2048
  Max samplers within kernel:    16
  Max size of kernel argument:    4096
  Alignment (bits) of base address:    1024

  Minimum alignment (bytes) for any datatype:     128


  Single precision floating point capability




















































































































Denorms:    Yes
Quiet NaNs:    Yes
Round to nearest even:    Yes
Round to zero:    Yes
Round to +ve and infinity:    Yes
IEEE754-2008 fused multiply-add:    Yes
  Cache type:    Read/Write
  Cache line size:    64
  Cache size:    16384
  Global memory size:    14566490112
  Constant buffer size:    65536
  Max number of constant args:    8
  Local memory type:    Global
  Local memory size:    32768
  Max pipe arguments:    16
  Max pipe active reservations:    16
  Max pipe packet size:    3641622528
  Max global variable size:    1879048192

  Max global variable preferred total size:     1879048192













































































  Max read/write image args:    64
  Max on device events:    0
  Queue on device max size:    0
  Max on device queues:    0
  Queue on device preferred size:    0
  SVM capabilities: 
Coarse grain buffer:    Yes
Fine grain buffer:    Yes
Fine grain system:    Yes
Atomics:    Yes
  Preferred platform atomic alignment:    0
  Preferred global atomic alignment:    0
  Preferred local atomic alignment:    0

  Kernel Preferred work group size multiple:     1
























































































































































  Error correction support:    0
  Unified memory for Host and Device:    1
  Profiling timer resolution:    1
  Device endianess:    Little
  Available:    Yes
  Compiler available:    Yes
  Execution capabilities: 
Execute OpenCL kernels:    Yes
Execute native function:    Yes
  Queue on Host properties: 
Out-of-Order:    No
Profiling :    Yes
  Queue on Device properties: 
Out-of-Order:    No
Profiling :    No
  Platform ID:    0x7f61e4e1cfd0
  Name:    AMD A10-7850K APU with Radeon(TM) R7 Graphics
  Vendor:    AuthenticAMD
  Device OpenCL C version:    OpenCL C 1.2
  Driver version:    1642.5 (sse2,avx,fma4)
  Profile:    FULL_PROFILE
  Version:    OpenCL 1.2 AMD-APP (1642.5)
  Extensions:    cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_spir cl_khr_gl_event





# dmidecode 2.12


SMBIOS 2.7 present.


22 structures occupying 1358 bytes.


Table at 0x000EBF50.



Handle 0x0000, DMI type 0, 24 bytes


BIOS Information


    Vendor: American Megatrends Inc.


    Version: P2.10


    Release Date: 02/20/2014


    Address: 0xF0000


    Runtime Size: 64 kB


    ROM Size: 8192 kB


    Characteristics:


        PCI is supported


        BIOS is upgradeable


        BIOS shadowing is allowed


        Boot from CD is supported


        Selectable boot is supported


        BIOS ROM is socketed


        EDD is supported


        5.25"/1.2 MB floppy services are supported (int 13h)


        3.5"/720 kB floppy services are supported (int 13h)


        3.5"/2.88 MB floppy services are supported (int 13h)


        Print screen service is supported (int 5h)


        8042 keyboard services are supported (int 9h)


        Serial services are supported (int 14h)


        Printer services are supported (int 17h)


        ACPI is supported


        USB legacy is supported


        BIOS boot specification is supported


        Targeted content distribution is supported


        UEFI is supported


    BIOS Revision: 4.6



Handle 0x0001, DMI type 1, 27 bytes


System Information


    Manufacturer: To Be Filled By O.E.M.


    Product Name: To Be Filled By O.E.M.


    Version: To Be Filled By O.E.M.


    Serial Number: To Be Filled By O.E.M.


    UUID: 03000200-0400-0500-0006-000700080009


    Wake-up Type: Power Switch


    SKU Number: To Be Filled By O.E.M.


    Family: To Be Filled By O.E.M.



Handle 0x0002, DMI type 2, 15 bytes


Base Board Information


    Manufacturer: ASRock


    Product Name: FM2A88M Extreme4+


    Version:                     


    Serial Number: E80-3A010000081


    Asset Tag:                     


    Features:


        Board is a hosting board


        Board is replaceable


    Location In Chassis:                     


    Chassis Handle: 0x0003


    Type: Motherboard


    Contained Object Handles: 0



Handle 0x0003, DMI type 3, 22 bytes


Chassis Information


    Manufacturer: To Be Filled By O.E.M.


    Type: Desktop


    Lock: Not Present


    Version: To Be Filled By O.E.M.


    Serial Number: To Be Filled By O.E.M.


    Asset Tag: To Be Filled By O.E.M.


    Boot-up State: Safe


    Power Supply State: Safe


    Thermal State: Safe


    Security Status: None


    OEM Information: 0x00000000


    Height: Unspecified


    Number Of Power Cords: 1


    Contained Elements: 0


    SKU Number: To be filled by O.E.M.



Handle 0x0004, DMI type 9, 17 bytes


System Slot Information


    Designation: PCI1


    Type: 32-bit PCI


    Current Usage: In Use


    Length: Short


    ID: 1


    Characteristics:


        3.3 V is provided


        Opening is shared


        PME signal is supported



Handle 0x0005, DMI type 9, 17 bytes


System Slot Information


    Designation: PCIE1


    Type: x16 PCI Express


    Current Usage: In Use


    Length: Short


    ID: 17


    Characteristics:


        3.3 V is provided


        Opening is shared


        PME signal is supported


    Bus Address: 0000:00:15.0



Handle 0x0006, DMI type 9, 17 bytes


System Slot Information


    Designation: PCIE2


    Type: x1 PCI Express


    Current Usage: In Use


    Length: Short


    ID: 18


    Characteristics:


        3.3 V is provided


        Opening is shared


        PME signal is supported


    Bus Address: 0000:00:02.0



Handle 0x0007, DMI type 9, 17 bytes


System Slot Information


    Designation: PCIE3


    Type: x4 PCI Express


    Current Usage: In Use


    Length: Short


    ID: 19


    Characteristics:


        3.3 V is provided


        Opening is shared


        PME signal is supported


    Bus Address: 0000:00:15.1



Handle 0x0008, DMI type 11, 5 bytes


OEM Strings


    String 1: To Be Filled By O.E.M.



Handle 0x0009, DMI type 7, 19 bytes


Cache Information


    Socket Designation: L1 CACHE


    Configuration: Enabled, Not Socketed, Level 1


    Operational Mode: Write Back


    Location: Internal


    Installed Size: 256 kB


    Maximum Size: 256 kB


    Supported SRAM Types:


        Pipeline Burst


    Installed SRAM Type: Pipeline Burst


    Speed: 1 ns


    Error Correction Type: Multi-bit ECC


    System Type: Unified


    Associativity: 2-way Set-associative



Handle 0x000A, DMI type 7, 19 bytes


Cache Information


    Socket Designation: L2 CACHE


    Configuration: Enabled, Not Socketed, Level 2


    Operational Mode: Write Back


    Location: Internal


    Installed Size: 4096 kB


    Maximum Size: 4096 kB


    Supported SRAM Types:


        Pipeline Burst


    Installed SRAM Type: Pipeline Burst


    Speed: 1 ns


    Error Correction Type: Multi-bit ECC


    System Type: Unified


    Associativity: 16-way Set-associative



Handle 0x0013, DMI type 32, 20 bytes


System Boot Information


    Status: No errors detected



Handle 0x0015, DMI type 16, 23 bytes


Physical Memory Array


    Location: System Board Or Motherboard


    Use: System Memory


    Error Correction Type: None


    Maximum Capacity: 16 GB


    Error Information Handle: Not Provided


    Number Of Devices: 4



Handle 0x0016, DMI type 19, 31 bytes


Memory Array Mapped Address


    Starting Address: 0x00000000000


    Ending Address: 0x003FFFFFFFF


    Range Size: 16 GB


    Physical Array Handle: 0x0015


    Partition Width: 255



Handle 0x0017, DMI type 17, 34 bytes


Memory Device


    Array Handle: 0x0015


    Error Information Handle: Not Provided


    Total Width: 64 bits


    Data Width: 64 bits


    Size: 8192 MB


    Form Factor: DIMM


    Set: None


    Locator: DIMM 0


    Bank Locator: CHANNEL A


    Type: DDR3


    Type Detail: Synchronous Unbuffered (Unregistered)


    Speed: 2400 MHz


    Manufacturer: <BAD INDEX>


    Serial Number: 00000000


    Asset Tag: A1_AssetTagNum0


    Part Number: Xtreem-LV-2400  


    Rank: 2


    Configured Clock Speed: 2400 MHz



Handle 0x0018, DMI type 20, 35 bytes


Memory Device Mapped Address


    Starting Address: 0x00000000000


    Ending Address: 0x001FFFFFFFF


    Range Size: 8 GB


    Physical Device Handle: 0x0017


    Memory Array Mapped Address Handle: 0x0016


    Partition Row Position: 1



Handle 0x0019, DMI type 17, 34 bytes


Memory Device


    Array Handle: 0x0015


    Error Information Handle: Not Provided


    Total Width: 64 bits


    Data Width: 64 bits


    Size: No Module Installed


    Form Factor: SODIMM


    Set: None


    Locator: DIMM 1


    Bank Locator: CHANNEL A


    Type: DDR3


    Type Detail: None


    Speed: Unknown


    Manufacturer: A1_Manufacturer1


    Serial Number: A1_SerialNum1


    Asset Tag: A1_AssetTagNum1


    Part Number: A1_PartNum1


    Rank: Unknown


    Configured Clock Speed: Unknown



Handle 0x001A, DMI type 17, 34 bytes


Memory Device


    Array Handle: 0x0015


    Error Information Handle: Not Provided


    Total Width: 64 bits


    Data Width: 64 bits


    Size: 8192 MB


    Form Factor: DIMM


    Set: None


    Locator: DIMM 0


    Bank Locator: CHANNEL B


    Type: DDR3


    Type Detail: Synchronous Unbuffered (Unregistered)


    Speed: 2400 MHz


    Manufacturer: <BAD INDEX>


    Serial Number: 00000000


    Asset Tag: A1_AssetTagNum2


    Part Number: Xtreem-LV-2400  


    Rank: 2


    Configured Clock Speed: 2400 MHz



Handle 0x001B, DMI type 20, 35 bytes


Memory Device Mapped Address


    Starting Address: 0x00200000000


    Ending Address: 0x003FFFFFFFF


    Range Size: 8 GB


    Physical Device Handle: 0x0019


    Memory Array Mapped Address Handle: 0x0016


    Partition Row Position: 1



Handle 0x001C, DMI type 17, 34 bytes


Memory Device


    Array Handle: 0x0015


    Error Information Handle: Not Provided


    Total Width: 64 bits


    Data Width: 64 bits


    Size: No Module Installed


    Form Factor: SODIMM


    Set: None


    Locator: DIMM 1


    Bank Locator: CHANNEL B


    Type: DDR3


    Type Detail: None


    Speed: Unknown


    Manufacturer: A1_Manufacturer3


    Serial Number: A1_SerialNum3


    Asset Tag: A1_AssetTagNum3


    Part Number: A1_PartNum3


    Rank: Unknown


    Configured Clock Speed: Unknown



Handle 0x001F, DMI type 4, 42 bytes


Processor Information


    Socket Designation: CPUSocket


    Type: Central Processor


    Family: A-Series


    Manufacturer: AMD


    ID: 01 0F 63 00 FF FB 8B 17


    Signature: Family 21, Model 48, Stepping 1


    Flags:


        FPU (Floating-point unit on-chip)


        VME (Virtual mode extension)


        DE (Debugging extension)


        PSE (Page size extension)


        TSC (Time stamp counter)


        MSR (Model specific registers)


        PAE (Physical address extension)


        MCE (Machine check exception)


        CX8 (CMPXCHG8 instruction supported)


        APIC (On-chip APIC hardware supported)


        SEP (Fast system call)


        MTRR (Memory type range registers)


        PGE (Page global enable)


        MCA (Machine check architecture)


        CMOV (Conditional move instruction supported)


        PAT (Page attribute table)


        PSE-36 (36-bit page size extension)


        CLFSH (CLFLUSH instruction supported)


        MMX (MMX technology supported)


        FXSR (FXSAVE and FXSTOR instructions supported)


        SSE (Streaming SIMD extensions)


        SSE2 (Streaming SIMD extensions 2)


        HTT (Multi-threading)


    Version: AMD A10-7850K APU with Radeon(TM) R7 Graphics


    Voltage: 1.3 V


    External Clock: 100 MHz


    Max Speed: 4200 MHz


    Current Speed: 4200 MHz


    Status: Populated, Enabled


    Upgrade: Socket FM2


    L1 Cache Handle: 0x0009


    L2 Cache Handle: 0x000A


    L3 Cache Handle: Not Provided


    Serial Number: Not Specified


    Asset Tag: Not Specified


    Part Number: Not Specified


    Core Count: 4


    Core Enabled: 4


    Thread Count: 4


    Characteristics:


        64-bit capable



Handle 0x0020, DMI type 127, 4 bytes


End Of Table




0 Likes

Hi Evren,

I got similar findings after running the sample using Omega driver on Kaveri with Redhat7 (64bit). I've reported the issue to dev team and they are working on it. Once I get any update, I'll get back to you. Thanks for pointing the issue.

Regards,

0 Likes

Hi Evren,

My apologies for this delayed reply.

We ran a few experiments at our end. The BufferBandwidth sample was actually intended for measuring the memory bandwidth during the map/unmap operation, not for benchmarking read/write bandwidth from kernels.

Information about read/write bandwidth from kernels is available in the GlobalMemoryBandwidth benchmark sample. The code in this sample is written to showcase this information. The GlobalMemoryBandwidth benchmark sample shows global memory accessing bandwidth in various data accessing scenarios, such as coalescing/uncoalescing, stride, and random.

Per your feedback, we will be modifying the BufferBandwidth sample to show only relevant information about map/unmap memory bandwidth.


Thanks,

0 Likes