Hello,
I was wondering why the CPU read/writes are so slow on the BufferBandwidth example irrelevant of if the memory is allocated in host or not? Also why the GPU writes are slow if the kernel is writing to host memory?
Device 0 Spectre Build: release GPU work items: 8192 Buffer size: 33554432 CPU workers: 1 Timing loops: 20 Repeats: 1 Kernel loops: 20 inputBuffer: CL_MEM_READ_ONLY outputBuffer: CL_MEM_WRITE_ONLY Host baseline (naive):
Timer resolution 256.22 ns Page fault 942.38 ns CPU read 6.28 GB/s memcpy() 8.81 GB/s memset(,1,) 6.87 GB/s memset(,0,) 6.87 GB/s AVERAGES (over loops 2 - 19, use -l for complete log)
--------
1. Host mapped write to inputBuffer
---------------------------------------|---------------
clEnqueueMapBuffer -- WRITE (GBPS) | 2331.320 ---------------------------------------|---------------
memset() (GBPS) | 6.717 ---------------------------------------|---------------
clEnqueueUnmapMemObject() (GBPS) | 10.404 2. GPU kernel read of inputBuffer
---------------------------------------|---------------
clEnqueueNDRangeKernel() (GBPS) | 29.747 Verification Passed!
3. GPU kernel write to outputBuffer
---------------------------------------|---------------
clEnqueueNDRangeKernel() (GBPS) | 23.172 4. Host mapped read of outputBuffer
---------------------------------------|---------------
clEnqueueMapBuffer -- READ (GBPS) | 10.927 ---------------------------------------|---------------
CPU read (GBPS) | 6.228 ---------------------------------------|---------------
clEnqueueUnmapMemObject() (GBPS) | 645.145
Device 0 Spectre Build: release GPU work items: 8192 Buffer size: 33554432 CPU workers: 1 Timing loops: 20 Repeats: 1 Kernel loops: 20 inputBuffer: CL_MEM_READ_ONLY CL_MEM_ALLOC_HOST_PTR outputBuffer: CL_MEM_WRITE_ONLY CL_MEM_ALLOC_HOST_PTR Host baseline (naive):
Timer resolution 256.48 ns Page fault 974.34 ns CPU read 6.15 GB/s memcpy() 8.82 GB/s memset(,1,) 6.73 GB/s memset(,0,) 6.72 GB/s AVERAGES (over loops 2 - 19, use -l for complete log)
--------
1. Host mapped write to inputBuffer
---------------------------------------|---------------
clEnqueueMapBuffer -- WRITE (GBPS) | 2880.703 ---------------------------------------|---------------
memset() (GBPS) | 9.079 ---------------------------------------|---------------
clEnqueueUnmapMemObject() (GBPS) | 917.657 2. GPU kernel read of inputBuffer
---------------------------------------|---------------
clEnqueueNDRangeKernel() (GBPS) | 28.579 Verification Passed!
3. GPU kernel write to outputBuffer
---------------------------------------|---------------
clEnqueueNDRangeKernel() (GBPS) | 8.098 4. Host mapped read of outputBuffer
---------------------------------------|---------------
clEnqueueMapBuffer -- READ (GBPS) | 3166.840 ---------------------------------------|---------------
CPU read (GBPS) | 6.195 ---------------------------------------|---------------
clEnqueueUnmapMemObject() (GBPS) | 794.376
Thanks,
Evren
Solved! Go to Solution.
Hi Evren,
My apologies for this delayed reply.
We ran a few experiments at our end. The BufferBandwidth sample was actually intended for measuring the memory bandwidth during the map/unmap operation, not for benchmarking read/write bandwidth from kernels.
Information about read/write bandwidth from kernels is available in the GlobalMemoryBandwidth benchmark sample. The code in this sample is written to showcase this information. The GlobalMemoryBandwidth benchmark sample shows global memory accessing bandwidth in various data accessing scenarios, such as coalescing/uncoalescing, stride, and random.
Per your feedback, we will be modifying the BufferBandwidth sample to show only relevant information about map/unmap memory bandwidth.
Thanks,
Hi Evren,
I guess, this is somewhat expected. Most numbers are similar for both the scenario except following cases. Please find my comments regarding those cases.
GPU Memory | ALLOC_Host memory | Comments | |
Host mapped write to inputBuffer - clEnqueueUnmapMemObject() (GBPS) | 10.404 | 917.657 | GPU case bandwidth is lower since, during the unmap, data transfer is needed to GPU memory and writes to GPU memory from host is slower than writes to host memory |
GPU kernel write to outputBuffer -clEnqueueNDRangeKernel() (GBPS) | 23.172 | 8.098 | ALLOC_Host case is lower since the write from GPU to host happens through slower memory bus (Onion) as compared to GPU memory bus(Garlic) |
Host mapped read of outputBuffer - clEnqueueMapBuffer -- READ (GBPS) | 10.927 | 3166.84 | GPU case is slower since host has to read from GPU memory which is much slower than reads from host memory |
Regards,
Hmm, but why is ALLOC_Host memory reads are fast but writes are slow?
Also, I modified the BufferBandwidth and tried to see how the CPU would perform (here the GPU kernel read/writes are made by kernel running on CPU). The kernel read/write speeds are super low. Is this normal? Why?
Device 0 AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G Build: release GPU work items: 4096 Buffer size: 33554432 CPU workers: 1 Timing loops: 20 Repeats: 1 Kernel loops: 20 inputBuffer: CL_MEM_READ_ONLY outputBuffer: CL_MEM_WRITE_ONLY Host baseline (naive):
Timer resolution 256.64 ns Page fault 971.97 ns CPU read 6.16 GB/s memcpy() 4.08 GB/s memset(,1,) 6.69 GB/s memset(,0,) 6.71 GB/s AVERAGES (over loops 2 - 19, use -l for complete log)
--------
1. Host mapped write to inputBuffer
---------------------------------------|---------------
clEnqueueMapBuffer -- WRITE (GBPS) | 4060.750 ---------------------------------------|---------------
memset() (GBPS) | 6.667 ---------------------------------------|---------------
clEnqueueUnmapMemObject() (GBPS) | 726.832 2. GPU kernel read of inputBuffer
---------------------------------------|---------------
clEnqueueNDRangeKernel() (GBPS) | 0.709 Verification Passed!
3. GPU kernel write to outputBuffer
---------------------------------------|---------------
clEnqueueNDRangeKernel() (GBPS) | 0.347 4. Host mapped read of outputBuffer
---------------------------------------|---------------
clEnqueueMapBuffer -- READ (GBPS) | 1201.271 ---------------------------------------|---------------
CPU read (GBPS) | 6.247 ---------------------------------------|---------------
clEnqueueUnmapMemObject() (GBPS) | 706.588 Verification Passed!
Hello Dipak,
I also tried the new version of the BufferBandwidth from new SDK. Kernel reads super slow.... Shouldn't it be higher?
$ /opt/AMDAPPSDK-3.0-0-Beta/samples/opencl/bin/x86_64/BufferBandwidth --device cpu
Platform 0 : Advanced Micro Devices, Inc.
Platform 1 : Intel(R) Corporation
Platform found : Advanced Micro Devices, Inc.
Selected Platform Vendor : Advanced Micro Devices, Inc.
Device 0 : AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G Device ID is 0x262bea0
Build: release
GPU work items: 4096
Buffer size: 33554432
CPU workers: 1
Timing loops: 20
Repeats: 1
Kernel loops: 20
inputBuffer: CL_MEM_READ_ONLY
outputBuffer: CL_MEM_WRITE_ONLY
Host baseline (naive):
Timer resolution 1000.52 ns
Page fault 836.79 ns
CPU read 6.38 GB/s
memcpy() 8.79 GB/s
memset(,1,) 6.70 GB/s
memset(,0,) 6.70 GB/s
AVERAGES (over loops 2 - 19, use -l for complete log)
--------
1. Host mapped write to inputBuffer
---------------------------------------|---------------
clEnqueueMapBuffer -- WRITE (GBPS) | 4712.193
---------------------------------------|---------------
memset() (GBPS) | 6.675
---------------------------------------|---------------
clEnqueueUnmapMemObject() (GBPS) | 555.184
2. GPU kernel read of inputBuffer
---------------------------------------|---------------
clEnqueueNDRangeKernel() (GBPS) | 0.709
Verification Passed!
3. GPU kernel write to outputBuffer
---------------------------------------|---------------
clEnqueueNDRangeKernel() (GBPS) | 0.349
4. Host mapped read of outputBuffer
---------------------------------------|---------------
clEnqueueMapBuffer -- READ (GBPS) | 1100.001
---------------------------------------|---------------
CPU read (GBPS) | 6.270
---------------------------------------|---------------
clEnqueueUnmapMemObject() (GBPS) | 659.707
Verification Passed!
Passed!
also
$ /opt/AMDAPPSDK-3.0-0-Beta/samples/opencl/bin/x86_64/BufferBandwidth --device cpu -if 5 -of 5 -cf 5
Platform 0 : Advanced Micro Devices, Inc.
Platform 1 : Intel(R) Corporation
Platform found : Advanced Micro Devices, Inc.
Selected Platform Vendor : Advanced Micro Devices, Inc.
Device 0 : AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G Device ID is 0x27e0c60
Build: release
GPU work items: 4096
Buffer size: 33554432
CPU workers: 1
Timing loops: 20
Repeats: 1
Kernel loops: 20
inputBuffer: CL_MEM_READ_ONLY CL_MEM_ALLOC_HOST_PTR
outputBuffer: CL_MEM_WRITE_ONLY CL_MEM_ALLOC_HOST_PTR
Host baseline (naive):
Timer resolution 1000.65 ns
Page fault 875.38 ns
CPU read 6.38 GB/s
memcpy() 8.87 GB/s
memset(,1,) 6.93 GB/s
memset(,0,) 6.92 GB/s
AVERAGES (over loops 2 - 19, use -l for complete log)
--------
1. Host mapped write to inputBuffer
---------------------------------------|---------------
clEnqueueMapBuffer -- WRITE (GBPS) | 3847.436
---------------------------------------|---------------
memset() (GBPS) | 6.853
---------------------------------------|---------------
clEnqueueUnmapMemObject() (GBPS) | 588.324
2. GPU kernel read of inputBuffer
---------------------------------------|---------------
clEnqueueNDRangeKernel() (GBPS) | 0.720
Verification Passed!
3. GPU kernel write to outputBuffer
---------------------------------------|---------------
clEnqueueNDRangeKernel() (GBPS) | 0.352
4. Host mapped read of outputBuffer
---------------------------------------|---------------
clEnqueueMapBuffer -- READ (GBPS) | 1152.796
---------------------------------------|---------------
CPU read (GBPS) | 6.320
---------------------------------------|---------------
clEnqueueUnmapMemObject() (GBPS) | 707.233
Verification Passed!
Passed!
Could you please mention your setup details such as OS, catalyst driver version etc.? Please also share your clinfo output.
Regards,
Dipak, it is a normal kaveri system with an asrock mobo. Down is the clinfo and the dmidecode output. I am using the omega drivers with newest 3.0beta SDK (but older SDKs give the same result). OS is Ubuntu 14.04
Number of platforms:
2 Platform Profile: FULL_PROFILE Platform Version: OpenCL 1.2 LINUX Platform Name: Intel(R) OpenCL Platform Vendor: Intel(R) Corporation Platform Extensions: cl_khr_fp64 cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_intel_printf cl_ext_device_fission cl_intel_exec_by_local_thread Platform Profile: FULL_PROFILE Platform Version: OpenCL 2.0 AMD-APP (1642.5) Platform Name: AMD Accelerated Parallel Processing Platform Vendor: Advanced Micro Devices, Inc. Platform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices
Platform Name: Intel(R) OpenCL Number of devices: 1 Device Type: CL_DEVICE_TYPE_CPU Vendor ID: 8086h Max compute units: 4 Max work items dimensions: 3 Max work items[0]: 1024 Max work items[1]: 1024 Max work items[2]: 1024 Max work group size: 1024 Preferred vector width char: 16 Preferred vector width short: 8 Preferred vector width int: 4 Preferred vector width long: 2 Preferred vector width float: 4 Preferred vector width double: 2 Native vector width char: 16 Native vector width short: 8 Native vector width int: 4 Native vector width long: 2 Native vector width float: 4 Native vector width double: 2 Max clock frequency: 0Mhz Address bits: 64 Max memory allocation: 3641622528 Image support: Yes Max number of images read arguments: 480 Max number of images write arguments: 480 Max image 2D width: 16384 Max image 2D height: 16384 Max image 3D width: 2048 Max image 3D height: 2048 Max image 3D depth: 2048 Max samplers within kernel: 480 Max size of kernel argument: 3840 Alignment (bits) of base address: 1024 Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: Yes Quiet NaNs: Yes Round to nearest even: Yes Round to zero: No Round to +ve and infinity: No IEEE754-2008 fused multiply-add: No Cache type: Read/Write Cache line size: 64 Cache size: 2097152 Global memory size: 14566490112 Constant buffer size: 131072 Max number of constant args: 480 Local memory type: Global Local memory size: 32768 Kernel Preferred work group size multiple: 128
Error correction support: 0 Unified memory for Host and Device: 1 Profiling timer resolution: 1 Device endianess: Little Available: Yes Compiler available: Yes Execution capabilities: Execute OpenCL kernels: Yes Execute native function: Yes Queue on Host properties: Out-of-Order: Yes Profiling : Yes Platform ID: 0x256e700 Name: AMD A10-7850K APU with Radeon(TM) R7 Graphics Vendor: Intel(R) Corporation Device OpenCL C version: OpenCL C 1.2 Driver version: 1.2 Profile: FULL_PROFILE Version: OpenCL 1.2 (Build 56860) Extensions: cl_khr_fp64 cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_intel_printf cl_ext_device_fission cl_intel_exec_by_local_thread
Platform Name: AMD Accelerated Parallel Processing Number of devices: 2 Device Type: CL_DEVICE_TYPE_GPU Vendor ID: 1002h Board name: AMD Radeon(TM) R7 Graphics Device Topology: PCI[ B#0, D#1, F#0 ] Max compute units: 8 Max work items dimensions: 3 Max work items[0]: 256 Max work items[1]: 256 Max work items[2]: 256 Max work group size: 256 Preferred vector width char: 4 Preferred vector width short: 2 Preferred vector width int: 1 Preferred vector width long: 1 Preferred vector width float: 1 Preferred vector width double: 1 Native vector width char: 4 Native vector width short: 2 Native vector width int: 1 Native vector width long: 1 Native vector width float: 1 Native vector width double: 1 Max clock frequency: 900Mhz Address bits: 64 Max memory allocation: 1206806118 Image support: Yes Max number of images read arguments: 128 Max number of images write arguments: 64 Max image 2D width: 16384 Max image 2D height: 16384 Max image 3D width: 2048 Max image 3D height: 2048 Max image 3D depth: 2048 Max samplers within kernel: 16 Max size of kernel argument: 1024 Alignment (bits) of base address: 2048 Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: No Quiet NaNs: Yes Round to nearest even: Yes Round to zero: Yes Round to +ve and infinity: Yes IEEE754-2008 fused multiply-add: Yes Cache type: Read/Write Cache line size: 64 Cache size: 16384 Global memory size: 2569011200 Constant buffer size: 65536 Max number of constant args: 8 Local memory type: Scratchpad Local memory size: 32768 Max pipe arguments: 16 Max pipe active reservations: 16 Max pipe packet size: 1206806118 Max global variable size: 1086125312 Max global variable preferred total size: 2569011200
Max read/write image args: 64 Max on device events: 1024 Queue on device max size: 524288 Max on device queues: 1 Queue on device preferred size: 16384 SVM capabilities: Coarse grain buffer: Yes Fine grain buffer: Yes Fine grain system: No Atomics: No Preferred platform atomic alignment: 0 Preferred global atomic alignment: 0 Preferred local atomic alignment: 0 Kernel Preferred work group size multiple: 64
Error correction support: 0 Unified memory for Host and Device: 1 Profiling timer resolution: 1 Device endianess: Little Available: Yes Compiler available: Yes Execution capabilities: Execute OpenCL kernels: Yes Execute native function: No Queue on Host properties: Out-of-Order: No Profiling : Yes Queue on Device properties: Out-of-Order: Yes Profiling : Yes Platform ID: 0x7f61e4e1cfd0 Name: Spectre Vendor: Advanced Micro Devices, Inc. Device OpenCL C version: OpenCL C 2.0 Driver version: 1642.5 (VM) Profile: FULL_PROFILE Version: OpenCL 2.0 AMD-APP (1642.5) Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_image2d_from_buffer cl_khr_spir cl_khr_subgroups cl_khr_gl_event cl_khr_depth_images
Device Type: CL_DEVICE_TYPE_CPU Vendor ID: 1002h Board name: Max compute units: 4 Max work items dimensions: 3 Max work items[0]: 1024 Max work items[1]: 1024 Max work items[2]: 1024 Max work group size: 1024 Preferred vector width char: 16 Preferred vector width short: 8 Preferred vector width int: 4 Preferred vector width long: 2 Preferred vector width float: 8 Preferred vector width double: 4 Native vector width char: 16 Native vector width short: 8 Native vector width int: 4 Native vector width long: 2 Native vector width float: 8 Native vector width double: 4 Max clock frequency: 4200Mhz Address bits: 64 Max memory allocation: 3641622528 Image support: Yes Max number of images read arguments: 128 Max number of images write arguments: 64 Max image 2D width: 8192 Max image 2D height: 8192 Max image 3D width: 2048 Max image 3D height: 2048 Max image 3D depth: 2048 Max samplers within kernel: 16 Max size of kernel argument: 4096 Alignment (bits) of base address: 1024 Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: Yes Quiet NaNs: Yes Round to nearest even: Yes Round to zero: Yes Round to +ve and infinity: Yes IEEE754-2008 fused multiply-add: Yes Cache type: Read/Write Cache line size: 64 Cache size: 16384 Global memory size: 14566490112 Constant buffer size: 65536 Max number of constant args: 8 Local memory type: Global Local memory size: 32768 Max pipe arguments: 16 Max pipe active reservations: 16 Max pipe packet size: 3641622528 Max global variable size: 1879048192 Max global variable preferred total size: 1879048192
Max read/write image args: 64 Max on device events: 0 Queue on device max size: 0 Max on device queues: 0 Queue on device preferred size: 0 SVM capabilities: Coarse grain buffer: Yes Fine grain buffer: Yes Fine grain system: Yes Atomics: Yes Preferred platform atomic alignment: 0 Preferred global atomic alignment: 0 Preferred local atomic alignment: 0 Kernel Preferred work group size multiple: 1
Error correction support: 0 Unified memory for Host and Device: 1 Profiling timer resolution: 1 Device endianess: Little Available: Yes Compiler available: Yes Execution capabilities: Execute OpenCL kernels: Yes Execute native function: Yes Queue on Host properties: Out-of-Order: No Profiling : Yes Queue on Device properties: Out-of-Order: No Profiling : No Platform ID: 0x7f61e4e1cfd0 Name: AMD A10-7850K APU with Radeon(TM) R7 Graphics Vendor: AuthenticAMD Device OpenCL C version: OpenCL C 1.2 Driver version: 1642.5 (sse2,avx,fma4) Profile: FULL_PROFILE Version: OpenCL 1.2 AMD-APP (1642.5) Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_spir cl_khr_gl_event
# dmidecode 2.12
SMBIOS 2.7 present.
22 structures occupying 1358 bytes.
Table at 0x000EBF50.
Handle 0x0000, DMI type 0, 24 bytes
BIOS Information
Vendor: American Megatrends Inc.
Version: P2.10
Release Date: 02/20/2014
Address: 0xF0000
Runtime Size: 64 kB
ROM Size: 8192 kB
Characteristics:
PCI is supported
BIOS is upgradeable
BIOS shadowing is allowed
Boot from CD is supported
Selectable boot is supported
BIOS ROM is socketed
EDD is supported
5.25"/1.2 MB floppy services are supported (int 13h)
3.5"/720 kB floppy services are supported (int 13h)
3.5"/2.88 MB floppy services are supported (int 13h)
Print screen service is supported (int 5h)
8042 keyboard services are supported (int 9h)
Serial services are supported (int 14h)
Printer services are supported (int 17h)
ACPI is supported
USB legacy is supported
BIOS boot specification is supported
Targeted content distribution is supported
UEFI is supported
BIOS Revision: 4.6
Handle 0x0001, DMI type 1, 27 bytes
System Information
Manufacturer: To Be Filled By O.E.M.
Product Name: To Be Filled By O.E.M.
Version: To Be Filled By O.E.M.
Serial Number: To Be Filled By O.E.M.
UUID: 03000200-0400-0500-0006-000700080009
Wake-up Type: Power Switch
SKU Number: To Be Filled By O.E.M.
Family: To Be Filled By O.E.M.
Handle 0x0002, DMI type 2, 15 bytes
Base Board Information
Manufacturer: ASRock
Product Name: FM2A88M Extreme4+
Version:
Serial Number: E80-3A010000081
Asset Tag:
Features:
Board is a hosting board
Board is replaceable
Location In Chassis:
Chassis Handle: 0x0003
Type: Motherboard
Contained Object Handles: 0
Handle 0x0003, DMI type 3, 22 bytes
Chassis Information
Manufacturer: To Be Filled By O.E.M.
Type: Desktop
Lock: Not Present
Version: To Be Filled By O.E.M.
Serial Number: To Be Filled By O.E.M.
Asset Tag: To Be Filled By O.E.M.
Boot-up State: Safe
Power Supply State: Safe
Thermal State: Safe
Security Status: None
OEM Information: 0x00000000
Height: Unspecified
Number Of Power Cords: 1
Contained Elements: 0
SKU Number: To be filled by O.E.M.
Handle 0x0004, DMI type 9, 17 bytes
System Slot Information
Designation: PCI1
Type: 32-bit PCI
Current Usage: In Use
Length: Short
ID: 1
Characteristics:
3.3 V is provided
Opening is shared
PME signal is supported
Handle 0x0005, DMI type 9, 17 bytes
System Slot Information
Designation: PCIE1
Type: x16 PCI Express
Current Usage: In Use
Length: Short
ID: 17
Characteristics:
3.3 V is provided
Opening is shared
PME signal is supported
Bus Address: 0000:00:15.0
Handle 0x0006, DMI type 9, 17 bytes
System Slot Information
Designation: PCIE2
Type: x1 PCI Express
Current Usage: In Use
Length: Short
ID: 18
Characteristics:
3.3 V is provided
Opening is shared
PME signal is supported
Bus Address: 0000:00:02.0
Handle 0x0007, DMI type 9, 17 bytes
System Slot Information
Designation: PCIE3
Type: x4 PCI Express
Current Usage: In Use
Length: Short
ID: 19
Characteristics:
3.3 V is provided
Opening is shared
PME signal is supported
Bus Address: 0000:00:15.1
Handle 0x0008, DMI type 11, 5 bytes
OEM Strings
String 1: To Be Filled By O.E.M.
Handle 0x0009, DMI type 7, 19 bytes
Cache Information
Socket Designation: L1 CACHE
Configuration: Enabled, Not Socketed, Level 1
Operational Mode: Write Back
Location: Internal
Installed Size: 256 kB
Maximum Size: 256 kB
Supported SRAM Types:
Pipeline Burst
Installed SRAM Type: Pipeline Burst
Speed: 1 ns
Error Correction Type: Multi-bit ECC
System Type: Unified
Associativity: 2-way Set-associative
Handle 0x000A, DMI type 7, 19 bytes
Cache Information
Socket Designation: L2 CACHE
Configuration: Enabled, Not Socketed, Level 2
Operational Mode: Write Back
Location: Internal
Installed Size: 4096 kB
Maximum Size: 4096 kB
Supported SRAM Types:
Pipeline Burst
Installed SRAM Type: Pipeline Burst
Speed: 1 ns
Error Correction Type: Multi-bit ECC
System Type: Unified
Associativity: 16-way Set-associative
Handle 0x0013, DMI type 32, 20 bytes
System Boot Information
Status: No errors detected
Handle 0x0015, DMI type 16, 23 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: None
Maximum Capacity: 16 GB
Error Information Handle: Not Provided
Number Of Devices: 4
Handle 0x0016, DMI type 19, 31 bytes
Memory Array Mapped Address
Starting Address: 0x00000000000
Ending Address: 0x003FFFFFFFF
Range Size: 16 GB
Physical Array Handle: 0x0015
Partition Width: 255
Handle 0x0017, DMI type 17, 34 bytes
Memory Device
Array Handle: 0x0015
Error Information Handle: Not Provided
Total Width: 64 bits
Data Width: 64 bits
Size: 8192 MB
Form Factor: DIMM
Set: None
Locator: DIMM 0
Bank Locator: CHANNEL A
Type: DDR3
Type Detail: Synchronous Unbuffered (Unregistered)
Speed: 2400 MHz
Manufacturer: <BAD INDEX>
Serial Number: 00000000
Asset Tag: A1_AssetTagNum0
Part Number: Xtreem-LV-2400
Rank: 2
Configured Clock Speed: 2400 MHz
Handle 0x0018, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x00000000000
Ending Address: 0x001FFFFFFFF
Range Size: 8 GB
Physical Device Handle: 0x0017
Memory Array Mapped Address Handle: 0x0016
Partition Row Position: 1
Handle 0x0019, DMI type 17, 34 bytes
Memory Device
Array Handle: 0x0015
Error Information Handle: Not Provided
Total Width: 64 bits
Data Width: 64 bits
Size: No Module Installed
Form Factor: SODIMM
Set: None
Locator: DIMM 1
Bank Locator: CHANNEL A
Type: DDR3
Type Detail: None
Speed: Unknown
Manufacturer: A1_Manufacturer1
Serial Number: A1_SerialNum1
Asset Tag: A1_AssetTagNum1
Part Number: A1_PartNum1
Rank: Unknown
Configured Clock Speed: Unknown
Handle 0x001A, DMI type 17, 34 bytes
Memory Device
Array Handle: 0x0015
Error Information Handle: Not Provided
Total Width: 64 bits
Data Width: 64 bits
Size: 8192 MB
Form Factor: DIMM
Set: None
Locator: DIMM 0
Bank Locator: CHANNEL B
Type: DDR3
Type Detail: Synchronous Unbuffered (Unregistered)
Speed: 2400 MHz
Manufacturer: <BAD INDEX>
Serial Number: 00000000
Asset Tag: A1_AssetTagNum2
Part Number: Xtreem-LV-2400
Rank: 2
Configured Clock Speed: 2400 MHz
Handle 0x001B, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x00200000000
Ending Address: 0x003FFFFFFFF
Range Size: 8 GB
Physical Device Handle: 0x0019
Memory Array Mapped Address Handle: 0x0016
Partition Row Position: 1
Handle 0x001C, DMI type 17, 34 bytes
Memory Device
Array Handle: 0x0015
Error Information Handle: Not Provided
Total Width: 64 bits
Data Width: 64 bits
Size: No Module Installed
Form Factor: SODIMM
Set: None
Locator: DIMM 1
Bank Locator: CHANNEL B
Type: DDR3
Type Detail: None
Speed: Unknown
Manufacturer: A1_Manufacturer3
Serial Number: A1_SerialNum3
Asset Tag: A1_AssetTagNum3
Part Number: A1_PartNum3
Rank: Unknown
Configured Clock Speed: Unknown
Handle 0x001F, DMI type 4, 42 bytes
Processor Information
Socket Designation: CPUSocket
Type: Central Processor
Family: A-Series
Manufacturer: AMD
ID: 01 0F 63 00 FF FB 8B 17
Signature: Family 21, Model 48, Stepping 1
Flags:
FPU (Floating-point unit on-chip)
VME (Virtual mode extension)
DE (Debugging extension)
PSE (Page size extension)
TSC (Time stamp counter)
MSR (Model specific registers)
PAE (Physical address extension)
MCE (Machine check exception)
CX8 (CMPXCHG8 instruction supported)
APIC (On-chip APIC hardware supported)
SEP (Fast system call)
MTRR (Memory type range registers)
PGE (Page global enable)
MCA (Machine check architecture)
CMOV (Conditional move instruction supported)
PAT (Page attribute table)
PSE-36 (36-bit page size extension)
CLFSH (CLFLUSH instruction supported)
MMX (MMX technology supported)
FXSR (FXSAVE and FXSTOR instructions supported)
SSE (Streaming SIMD extensions)
SSE2 (Streaming SIMD extensions 2)
HTT (Multi-threading)
Version: AMD A10-7850K APU with Radeon(TM) R7 Graphics
Voltage: 1.3 V
External Clock: 100 MHz
Max Speed: 4200 MHz
Current Speed: 4200 MHz
Status: Populated, Enabled
Upgrade: Socket FM2
L1 Cache Handle: 0x0009
L2 Cache Handle: 0x000A
L3 Cache Handle: Not Provided
Serial Number: Not Specified
Asset Tag: Not Specified
Part Number: Not Specified
Core Count: 4
Core Enabled: 4
Thread Count: 4
Characteristics:
64-bit capable
Handle 0x0020, DMI type 127, 4 bytes
End Of Table
Hi Evren,
I got similar findings after running the sample using Omega driver on Kaveri with Redhat7 (64bit). I've reported the issue to dev team and they are working on it. Once I get any update, I'll get back to you. Thanks for pointing the issue.
Regards,
Hi Evren,
My apologies for this delayed reply.
We ran a few experiments at our end. The BufferBandwidth sample was actually intended for measuring the memory bandwidth during the map/unmap operation, not for benchmarking read/write bandwidth from kernels.
Information about read/write bandwidth from kernels is available in the GlobalMemoryBandwidth benchmark sample. The code in this sample is written to showcase this information. The GlobalMemoryBandwidth benchmark sample shows global memory accessing bandwidth in various data accessing scenarios, such as coalescing/uncoalescing, stride, and random.
Per your feedback, we will be modifying the BufferBandwidth sample to show only relevant information about map/unmap memory bandwidth.
Thanks,