Archives Discussions

yurtesen · ‎01-05-2015

Hello,

I was wondering why the CPU read/writes are so slow on the BufferBandwidth example irrelevant of if the memory is allocated in host or not? Also why the GPU writes are slow if the kernel is writing to host memory?

Device 0        Spectre
Build:           release
GPU work items: 8192
Buffer size:     33554432
CPU workers:     1
Timing loops:    20
Repeats:         1
Kernel loops:    20
inputBuffer:     CL_MEM_READ_ONLY
outputBuffer:    CL_MEM_WRITE_ONLY

Host baseline (naive):

Timer resolution 256.22 ns
Page fault       942.38 ns
CPU read         6.28 GB/s
memcpy()         8.81 GB/s
memset(,1,)      6.87 GB/s
memset(,0,)      6.87 GB/s

AVERAGES (over loops 2 - 19, use -l for complete log)

--------

1. Host mapped write to inputBuffer

---------------------------------------|---------------

clEnqueueMapBuffer -- WRITE (GBPS) | 2331.320

---------------------------------------|---------------

memset() (GBPS)                    | 6.717

---------------------------------------|---------------

clEnqueueUnmapMemObject() (GBPS)   | 10.404

2. GPU kernel read of inputBuffer

---------------------------------------|---------------

clEnqueueNDRangeKernel() (GBPS)    | 29.747

Verification Passed!

3. GPU kernel write to outputBuffer

---------------------------------------|---------------

clEnqueueNDRangeKernel() (GBPS)    | 23.172

4. Host mapped read of outputBuffer

---------------------------------------|---------------

clEnqueueMapBuffer -- READ (GBPS) | 10.927

---------------------------------------|---------------

CPU read (GBPS)                    | 6.228

---------------------------------------|---------------

clEnqueueUnmapMemObject() (GBPS)   | 645.145

Device 0        Spectre
Build:           release
GPU work items: 8192
Buffer size:     33554432
CPU workers:     1
Timing loops:    20
Repeats:         1
Kernel loops:    20
inputBuffer:     CL_MEM_READ_ONLY CL_MEM_ALLOC_HOST_PTR
outputBuffer:    CL_MEM_WRITE_ONLY CL_MEM_ALLOC_HOST_PTR

Host baseline (naive):

Timer resolution 256.48 ns
Page fault       974.34 ns
CPU read         6.15 GB/s
memcpy()         8.82 GB/s
memset(,1,)      6.73 GB/s
memset(,0,)      6.72 GB/s

AVERAGES (over loops 2 - 19, use -l for complete log)

--------

1. Host mapped write to inputBuffer

---------------------------------------|---------------

clEnqueueMapBuffer -- WRITE (GBPS) | 2880.703

---------------------------------------|---------------

memset() (GBPS)                    | 9.079

---------------------------------------|---------------

clEnqueueUnmapMemObject() (GBPS)   | 917.657

2. GPU kernel read of inputBuffer

---------------------------------------|---------------

clEnqueueNDRangeKernel() (GBPS)    | 28.579

Verification Passed!

3. GPU kernel write to outputBuffer

---------------------------------------|---------------

clEnqueueNDRangeKernel() (GBPS)    | 8.098

4. Host mapped read of outputBuffer

---------------------------------------|---------------

clEnqueueMapBuffer -- READ (GBPS) | 3166.840

---------------------------------------|---------------

CPU read (GBPS)                    | 6.195

---------------------------------------|---------------

clEnqueueUnmapMemObject() (GBPS)   | 794.376

Thanks,

Evren

dipak · ‎02-24-2015

Hi Evren,

My apologies for this delayed reply.

We ran a few experiments at our end. The BufferBandwidth sample was actually intended for measuring the memory bandwidth during the map/unmap operation, not for benchmarking read/write bandwidth from kernels.

Information about read/write bandwidth from kernels is available in the GlobalMemoryBandwidth benchmark sample. The code in this sample is written to showcase this information. The GlobalMemoryBandwidth benchmark sample shows global memory accessing bandwidth in various data accessing scenarios, such as coalescing/uncoalescing, stride, and random.

Per your feedback, we will be modifying the BufferBandwidth sample to show only relevant information about map/unmap memory bandwidth.

Thanks,

View solution in original post

dipak · ‎01-06-2015

Hi Evren,

I guess, this is somewhat expected. Most numbers are similar for both the scenario except following cases. Please find my comments regarding those cases.

	GPU Memory	ALLOC_Host memory	Comments
Host mapped write to inputBuffer - clEnqueueUnmapMemObject() (GBPS)	10.404	917.657	GPU case bandwidth is lower since, during the unmap, data transfer is needed to GPU memory and writes to GPU memory from host is slower than writes to host memory

GPU kernel write to outputBuffer -clEnqueueNDRangeKernel() (GBPS)	23.172	8.098	ALLOC_Host case is lower since the write from GPU to host happens through slower memory bus (Onion) as compared to GPU memory bus(Garlic)

Host mapped read of outputBuffer - clEnqueueMapBuffer -- READ (GBPS)	10.927	3166.84	GPU case is slower since host has to read from GPU memory which is much slower than reads from host memory

Regards,

yurtesen · ‎01-07-2015

Hmm, but why is ALLOC_Host memory reads are fast but writes are slow?

Also, I modified the BufferBandwidth and tried to see how the CPU would perform (here the GPU kernel read/writes are made by kernel running on CPU). The kernel read/write speeds are super low. Is this normal? Why?

Device 0        AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G
Build:           release
GPU work items: 4096
Buffer size:     33554432
CPU workers:     1
Timing loops:    20
Repeats:         1
Kernel loops:    20
inputBuffer:     CL_MEM_READ_ONLY
outputBuffer:    CL_MEM_WRITE_ONLY

Host baseline (naive):

Timer resolution 256.64 ns
Page fault       971.97 ns
CPU read         6.16 GB/s
memcpy()         4.08 GB/s
memset(,1,)      6.69 GB/s
memset(,0,)      6.71 GB/s

AVERAGES (over loops 2 - 19, use -l for complete log)

--------

1. Host mapped write to inputBuffer

---------------------------------------|---------------

clEnqueueMapBuffer -- WRITE (GBPS) | 4060.750

---------------------------------------|---------------

memset() (GBPS)                    | 6.667

---------------------------------------|---------------

clEnqueueUnmapMemObject() (GBPS)   | 726.832

2. GPU kernel read of inputBuffer

---------------------------------------|---------------

clEnqueueNDRangeKernel() (GBPS)    | 0.709

Verification Passed!

3. GPU kernel write to outputBuffer

---------------------------------------|---------------

clEnqueueNDRangeKernel() (GBPS)    | 0.347

4. Host mapped read of outputBuffer

---------------------------------------|---------------

clEnqueueMapBuffer -- READ (GBPS) | 1201.271

---------------------------------------|---------------

CPU read (GBPS)                    | 6.247

---------------------------------------|---------------

clEnqueueUnmapMemObject() (GBPS)   | 706.588

Verification Passed!

yurtesen · ‎01-15-2015

Hello Dipak,

I also tried the new version of the BufferBandwidth from new SDK. Kernel reads super slow.... Shouldn't it be higher?

$ /opt/AMDAPPSDK-3.0-0-Beta/samples/opencl/bin/x86_64/BufferBandwidth --device cpu

Platform 0 : Advanced Micro Devices, Inc.

Platform 1 : Intel(R) Corporation

Platform found : Advanced Micro Devices, Inc.

Selected Platform Vendor : Advanced Micro Devices, Inc.

Device 0 : AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G Device ID is 0x262bea0

Build:               release

GPU work items:      4096

Buffer size:         33554432

CPU workers:         1

Timing loops:        20

Repeats:             1

Kernel loops:        20

inputBuffer:         CL_MEM_READ_ONLY

outputBuffer:        CL_MEM_WRITE_ONLY

Host baseline (naive):

Timer resolution     1000.52 ns

Page fault           836.79 ns

CPU read             6.38 GB/s

memcpy()             8.79 GB/s

memset(,1,)          6.70 GB/s

memset(,0,)          6.70 GB/s

AVERAGES (over loops 2 - 19, use -l for complete log)

--------

1. Host mapped write to inputBuffer

---------------------------------------|---------------

clEnqueueMapBuffer -- WRITE (GBPS)     | 4712.193

---------------------------------------|---------------

memset() (GBPS)                        | 6.675

---------------------------------------|---------------

clEnqueueUnmapMemObject() (GBPS)       | 555.184

2. GPU kernel read of inputBuffer

---------------------------------------|---------------

clEnqueueNDRangeKernel() (GBPS)        | 0.709

Verification Passed!

3. GPU kernel write to outputBuffer

---------------------------------------|---------------

clEnqueueNDRangeKernel() (GBPS)        | 0.349

4. Host mapped read of outputBuffer

---------------------------------------|---------------

clEnqueueMapBuffer -- READ (GBPS)      | 1100.001

---------------------------------------|---------------

CPU read (GBPS)                        | 6.270

---------------------------------------|---------------

clEnqueueUnmapMemObject() (GBPS)       | 659.707

Verification Passed!

Passed!

also

$ /opt/AMDAPPSDK-3.0-0-Beta/samples/opencl/bin/x86_64/BufferBandwidth --device cpu -if 5 -of 5 -cf 5

Platform 0 : Advanced Micro Devices, Inc.

Platform 1 : Intel(R) Corporation

Platform found : Advanced Micro Devices, Inc.

Selected Platform Vendor : Advanced Micro Devices, Inc.

Device 0 : AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G Device ID is 0x27e0c60

Build:               release

GPU work items:      4096

Buffer size:         33554432

CPU workers:         1

Timing loops:        20

Repeats:             1

Kernel loops:        20

inputBuffer:         CL_MEM_READ_ONLY CL_MEM_ALLOC_HOST_PTR

outputBuffer:        CL_MEM_WRITE_ONLY CL_MEM_ALLOC_HOST_PTR

Host baseline (naive):

Timer resolution     1000.65 ns

Page fault           875.38 ns

CPU read             6.38 GB/s

memcpy()             8.87 GB/s

memset(,1,)          6.93 GB/s

memset(,0,)          6.92 GB/s

AVERAGES (over loops 2 - 19, use -l for complete log)

--------

1. Host mapped write to inputBuffer

---------------------------------------|---------------

clEnqueueMapBuffer -- WRITE (GBPS)     | 3847.436

---------------------------------------|---------------

memset() (GBPS)                        | 6.853

---------------------------------------|---------------

clEnqueueUnmapMemObject() (GBPS)       | 588.324

2. GPU kernel read of inputBuffer

---------------------------------------|---------------

clEnqueueNDRangeKernel() (GBPS)        | 0.720

Verification Passed!

3. GPU kernel write to outputBuffer

---------------------------------------|---------------

clEnqueueNDRangeKernel() (GBPS)        | 0.352

4. Host mapped read of outputBuffer

---------------------------------------|---------------

clEnqueueMapBuffer -- READ (GBPS)      | 1152.796

---------------------------------------|---------------

CPU read (GBPS)                        | 6.320

---------------------------------------|---------------

clEnqueueUnmapMemObject() (GBPS)       | 707.233

Verification Passed!

Passed!

dipak · ‎01-16-2015

Could you please mention your setup details such as OS, catalyst driver version etc.? Please also share your clinfo output.

Regards,

yurtesen · ‎01-19-2015

Dipak, it is a normal kaveri system with an asrock mobo. Down is the clinfo and the dmidecode output. I am using the omega drivers with newest 3.0beta SDK (but older SDKs give the same result). OS is Ubuntu 14.04

Number of platforms:

2
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 1.2 LINUX
Platform Name: Intel(R) OpenCL
Platform Vendor: Intel(R) Corporation
Platform Extensions: cl_khr_fp64 cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_intel_printf cl_ext_device_fission cl_intel_exec_by_local_thread
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.0 AMD-APP (1642.5)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices

Platform Name: Intel(R) OpenCL
Number of devices: 1
Device Type: CL_DEVICE_TYPE_CPU
Vendor ID: 8086h
Max compute units: 4
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 1024
Max work group size: 1024
Preferred vector width char: 16
Preferred vector width short: 8
Preferred vector width int: 4
Preferred vector width long: 2
Preferred vector width float: 4
Preferred vector width double: 2
Native vector width char: 16
Native vector width short: 8
Native vector width int: 4
Native vector width long: 2
Native vector width float: 4
Native vector width double: 2
Max clock frequency: 0Mhz
Address bits: 64
Max memory allocation: 3641622528
Image support: Yes
Max number of images read arguments: 480
Max number of images write arguments: 480
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 480
Max size of kernel argument: 3840
Alignment (bits) of base address: 1024

Minimum alignment (bytes) for any datatype:     128

Single precision floating point capability

Denorms: Yes
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: No
Round to +ve and infinity: No
IEEE754-2008 fused multiply-add: No
Cache type: Read/Write
Cache line size: 64
Cache size: 2097152
Global memory size: 14566490112
Constant buffer size: 131072
Max number of constant args: 480
Local memory type: Global
Local memory size: 32768

Kernel Preferred work group size multiple:     128

Error correction support: 0
Unified memory for Host and Device: 1
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: Yes
Queue on Host properties:
Out-of-Order: Yes
Profiling : Yes
Platform ID: 0x256e700
Name: AMD A10-7850K APU with Radeon(TM) R7 Graphics
Vendor: Intel(R) Corporation
Device OpenCL C version: OpenCL C 1.2
Driver version: 1.2
Profile: FULL_PROFILE
Version: OpenCL 1.2 (Build 56860)
Extensions: cl_khr_fp64 cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_intel_printf cl_ext_device_fission cl_intel_exec_by_local_thread

Platform Name: AMD Accelerated Parallel Processing
Number of devices: 2
Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1002h
Board name: AMD Radeon(TM) R7 Graphics
Device Topology: PCI[ B#0, D#1, F#0 ]
Max compute units: 8
Max work items dimensions: 3
Max work items[0]: 256
Max work items[1]: 256
Max work items[2]: 256
Max work group size: 256
Preferred vector width char: 4
Preferred vector width short: 2
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 4
Native vector width short: 2
Native vector width int: 1
Native vector width long: 1
Native vector width float: 1
Native vector width double: 1
Max clock frequency: 900Mhz
Address bits: 64
Max memory allocation: 1206806118
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 64
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 1024
Alignment (bits) of base address: 2048

Minimum alignment (bytes) for any datatype:     128

Single precision floating point capability

Denorms: No
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 16384
Global memory size: 2569011200
Constant buffer size: 65536
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 32768
Max pipe arguments: 16
Max pipe active reservations: 16
Max pipe packet size: 1206806118
Max global variable size: 1086125312

Max global variable preferred total size:     2569011200

Max read/write image args: 64
Max on device events: 1024
Queue on device max size: 524288
Max on device queues: 1
Queue on device preferred size: 16384
SVM capabilities:
Coarse grain buffer: Yes
Fine grain buffer: Yes
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0

Kernel Preferred work group size multiple:     64

Error correction support: 0
Unified memory for Host and Device: 1
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: No
Profiling : Yes
Queue on Device properties:
Out-of-Order: Yes
Profiling : Yes
Platform ID: 0x7f61e4e1cfd0
Name: Spectre
Vendor: Advanced Micro Devices, Inc.
Device OpenCL C version: OpenCL C 2.0
Driver version: 1642.5 (VM)
Profile: FULL_PROFILE
Version: OpenCL 2.0 AMD-APP (1642.5)
Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_image2d_from_buffer cl_khr_spir cl_khr_subgroups cl_khr_gl_event cl_khr_depth_images

Device Type: CL_DEVICE_TYPE_CPU
Vendor ID: 1002h
Board name:
Max compute units: 4
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 1024
Max work group size: 1024
Preferred vector width char: 16
Preferred vector width short: 8
Preferred vector width int: 4
Preferred vector width long: 2
Preferred vector width float: 8
Preferred vector width double: 4
Native vector width char: 16
Native vector width short: 8
Native vector width int: 4
Native vector width long: 2
Native vector width float: 8
Native vector width double: 4
Max clock frequency: 4200Mhz
Address bits: 64
Max memory allocation: 3641622528
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 64
Max image 2D width: 8192
Max image 2D height: 8192
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 4096
Alignment (bits) of base address: 1024

Minimum alignment (bytes) for any datatype:     128

Single precision floating point capability

Denorms: Yes
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 16384
Global memory size: 14566490112
Constant buffer size: 65536
Max number of constant args: 8
Local memory type: Global
Local memory size: 32768
Max pipe arguments: 16
Max pipe active reservations: 16
Max pipe packet size: 3641622528
Max global variable size: 1879048192

Max global variable preferred total size:     1879048192

Max read/write image args: 64
Max on device events: 0
Queue on device max size: 0
Max on device queues: 0
Queue on device preferred size: 0
SVM capabilities:
Coarse grain buffer: Yes
Fine grain buffer: Yes
Fine grain system: Yes
Atomics: Yes
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0

Kernel Preferred work group size multiple:     1

Error correction support: 0
Unified memory for Host and Device: 1
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: Yes
Queue on Host properties:
Out-of-Order: No
Profiling : Yes
Queue on Device properties:
Out-of-Order: No
Profiling : No
Platform ID: 0x7f61e4e1cfd0
Name: AMD A10-7850K APU with Radeon(TM) R7 Graphics
Vendor: AuthenticAMD
Device OpenCL C version: OpenCL C 1.2
Driver version: 1642.5 (sse2,avx,fma4)
Profile: FULL_PROFILE
Version: OpenCL 1.2 AMD-APP (1642.5)
Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_spir cl_khr_gl_event

# dmidecode 2.12

SMBIOS 2.7 present.

22 structures occupying 1358 bytes.

Table at 0x000EBF50.

Handle 0x0000, DMI type 0, 24 bytes

BIOS Information

    Vendor: American Megatrends Inc.

    Version: P2.10

    Release Date: 02/20/2014

    Address: 0xF0000

    Runtime Size: 64 kB

    ROM Size: 8192 kB

    Characteristics:

        PCI is supported

        BIOS is upgradeable

        BIOS shadowing is allowed

        Boot from CD is supported

        Selectable boot is supported

        BIOS ROM is socketed

        EDD is supported

        5.25"/1.2 MB floppy services are supported (int 13h)

        3.5"/720 kB floppy services are supported (int 13h)

        3.5"/2.88 MB floppy services are supported (int 13h)

        Print screen service is supported (int 5h)

        8042 keyboard services are supported (int 9h)

        Serial services are supported (int 14h)

        Printer services are supported (int 17h)

        ACPI is supported

        USB legacy is supported

        BIOS boot specification is supported

        Targeted content distribution is supported

        UEFI is supported

    BIOS Revision: 4.6

Handle 0x0001, DMI type 1, 27 bytes

System Information

    Manufacturer: To Be Filled By O.E.M.

    Product Name: To Be Filled By O.E.M.

    Version: To Be Filled By O.E.M.

    Serial Number: To Be Filled By O.E.M.

    UUID: 03000200-0400-0500-0006-000700080009

    Wake-up Type: Power Switch

    SKU Number: To Be Filled By O.E.M.

    Family: To Be Filled By O.E.M.

Handle 0x0002, DMI type 2, 15 bytes

Base Board Information

    Manufacturer: ASRock

    Product Name: FM2A88M Extreme4+

    Version:

    Serial Number: E80-3A010000081

    Asset Tag:

    Features:

        Board is a hosting board

        Board is replaceable

    Location In Chassis:

    Chassis Handle: 0x0003

    Type: Motherboard

    Contained Object Handles: 0

Handle 0x0003, DMI type 3, 22 bytes

Chassis Information

    Manufacturer: To Be Filled By O.E.M.

    Type: Desktop

    Lock: Not Present

    Version: To Be Filled By O.E.M.

    Serial Number: To Be Filled By O.E.M.

    Asset Tag: To Be Filled By O.E.M.

    Boot-up State: Safe

    Power Supply State: Safe

    Thermal State: Safe

    Security Status: None

    OEM Information: 0x00000000

    Height: Unspecified

    Number Of Power Cords: 1

    Contained Elements: 0

    SKU Number: To be filled by O.E.M.

Handle 0x0004, DMI type 9, 17 bytes

System Slot Information

    Designation: PCI1

    Type: 32-bit PCI

    Current Usage: In Use

    Length: Short

    ID: 1

    Characteristics:

        3.3 V is provided

        Opening is shared

        PME signal is supported

Handle 0x0005, DMI type 9, 17 bytes

System Slot Information

    Designation: PCIE1

    Type: x16 PCI Express

    Current Usage: In Use

    Length: Short

    ID: 17

    Characteristics:

        3.3 V is provided

        Opening is shared

        PME signal is supported

    Bus Address: 0000:00:15.0

Handle 0x0006, DMI type 9, 17 bytes

System Slot Information

    Designation: PCIE2

    Type: x1 PCI Express

    Current Usage: In Use

    Length: Short

    ID: 18

    Characteristics:

        3.3 V is provided

        Opening is shared

        PME signal is supported

    Bus Address: 0000:00:02.0

Handle 0x0007, DMI type 9, 17 bytes

System Slot Information

    Designation: PCIE3

    Type: x4 PCI Express

    Current Usage: In Use

    Length: Short

    ID: 19

    Characteristics:

        3.3 V is provided

        Opening is shared

        PME signal is supported

    Bus Address: 0000:00:15.1

Handle 0x0008, DMI type 11, 5 bytes

OEM Strings

    String 1: To Be Filled By O.E.M.

Handle 0x0009, DMI type 7, 19 bytes

Cache Information

    Socket Designation: L1 CACHE

    Configuration: Enabled, Not Socketed, Level 1

    Operational Mode: Write Back

    Location: Internal

    Installed Size: 256 kB

    Maximum Size: 256 kB

    Supported SRAM Types:

        Pipeline Burst

    Installed SRAM Type: Pipeline Burst

    Speed: 1 ns

    Error Correction Type: Multi-bit ECC

    System Type: Unified

    Associativity: 2-way Set-associative

Handle 0x000A, DMI type 7, 19 bytes

Cache Information

    Socket Designation: L2 CACHE

    Configuration: Enabled, Not Socketed, Level 2

    Operational Mode: Write Back

    Location: Internal

    Installed Size: 4096 kB

    Maximum Size: 4096 kB

    Supported SRAM Types:

        Pipeline Burst

    Installed SRAM Type: Pipeline Burst

    Speed: 1 ns

    Error Correction Type: Multi-bit ECC

    System Type: Unified

    Associativity: 16-way Set-associative

Handle 0x0013, DMI type 32, 20 bytes

System Boot Information

    Status: No errors detected

Handle 0x0015, DMI type 16, 23 bytes

Physical Memory Array

    Location: System Board Or Motherboard

    Use: System Memory

    Error Correction Type: None

    Maximum Capacity: 16 GB

    Error Information Handle: Not Provided

    Number Of Devices: 4

Handle 0x0016, DMI type 19, 31 bytes

Memory Array Mapped Address

    Starting Address: 0x00000000000

    Ending Address: 0x003FFFFFFFF

    Range Size: 16 GB

    Physical Array Handle: 0x0015

    Partition Width: 255

Handle 0x0017, DMI type 17, 34 bytes

Memory Device

    Array Handle: 0x0015

    Error Information Handle: Not Provided

    Total Width: 64 bits

    Data Width: 64 bits

    Size: 8192 MB

    Form Factor: DIMM

    Set: None

    Locator: DIMM 0

    Bank Locator: CHANNEL A

    Type: DDR3

    Type Detail: Synchronous Unbuffered (Unregistered)

    Speed: 2400 MHz

    Manufacturer: <BAD INDEX>

    Serial Number: 00000000

    Asset Tag: A1_AssetTagNum0

    Part Number: Xtreem-LV-2400

    Rank: 2

    Configured Clock Speed: 2400 MHz

Handle 0x0018, DMI type 20, 35 bytes

Memory Device Mapped Address

    Starting Address: 0x00000000000

    Ending Address: 0x001FFFFFFFF

    Range Size: 8 GB

    Physical Device Handle: 0x0017

    Memory Array Mapped Address Handle: 0x0016

    Partition Row Position: 1

Handle 0x0019, DMI type 17, 34 bytes

Memory Device

    Array Handle: 0x0015

    Error Information Handle: Not Provided

    Total Width: 64 bits

    Data Width: 64 bits

    Size: No Module Installed

    Form Factor: SODIMM

    Set: None

    Locator: DIMM 1

    Bank Locator: CHANNEL A

    Type: DDR3

    Type Detail: None

    Speed: Unknown

    Manufacturer: A1_Manufacturer1

    Serial Number: A1_SerialNum1

    Asset Tag: A1_AssetTagNum1

    Part Number: A1_PartNum1

    Rank: Unknown

    Configured Clock Speed: Unknown

Handle 0x001A, DMI type 17, 34 bytes

Memory Device

    Array Handle: 0x0015

    Error Information Handle: Not Provided

    Total Width: 64 bits

    Data Width: 64 bits

    Size: 8192 MB

    Form Factor: DIMM

    Set: None

    Locator: DIMM 0

    Bank Locator: CHANNEL B

    Type: DDR3

    Type Detail: Synchronous Unbuffered (Unregistered)

    Speed: 2400 MHz

    Manufacturer: <BAD INDEX>

    Serial Number: 00000000

    Asset Tag: A1_AssetTagNum2

    Part Number: Xtreem-LV-2400

    Rank: 2

    Configured Clock Speed: 2400 MHz

Handle 0x001B, DMI type 20, 35 bytes

Memory Device Mapped Address

    Starting Address: 0x00200000000

    Ending Address: 0x003FFFFFFFF

    Range Size: 8 GB

    Physical Device Handle: 0x0019

    Memory Array Mapped Address Handle: 0x0016

    Partition Row Position: 1

Handle 0x001C, DMI type 17, 34 bytes

Memory Device

    Array Handle: 0x0015

    Error Information Handle: Not Provided

    Total Width: 64 bits

    Data Width: 64 bits

    Size: No Module Installed

    Form Factor: SODIMM

    Set: None

    Locator: DIMM 1

    Bank Locator: CHANNEL B

    Type: DDR3

    Type Detail: None

    Speed: Unknown

    Manufacturer: A1_Manufacturer3

    Serial Number: A1_SerialNum3

    Asset Tag: A1_AssetTagNum3

    Part Number: A1_PartNum3

    Rank: Unknown

    Configured Clock Speed: Unknown

Handle 0x001F, DMI type 4, 42 bytes

Processor Information

    Socket Designation: CPUSocket

    Type: Central Processor

    Family: A-Series

    Manufacturer: AMD

    ID: 01 0F 63 00 FF FB 8B 17

    Signature: Family 21, Model 48, Stepping 1

    Flags:

        FPU (Floating-point unit on-chip)

        VME (Virtual mode extension)

        DE (Debugging extension)

        PSE (Page size extension)

        TSC (Time stamp counter)

        MSR (Model specific registers)

        PAE (Physical address extension)

        MCE (Machine check exception)

        CX8 (CMPXCHG8 instruction supported)

        APIC (On-chip APIC hardware supported)

        SEP (Fast system call)

        MTRR (Memory type range registers)

        PGE (Page global enable)

        MCA (Machine check architecture)

        CMOV (Conditional move instruction supported)

        PAT (Page attribute table)

        PSE-36 (36-bit page size extension)

        CLFSH (CLFLUSH instruction supported)

        MMX (MMX technology supported)

        FXSR (FXSAVE and FXSTOR instructions supported)

        SSE (Streaming SIMD extensions)

        SSE2 (Streaming SIMD extensions 2)

        HTT (Multi-threading)

    Version: AMD A10-7850K APU with Radeon(TM) R7 Graphics

    Voltage: 1.3 V

    External Clock: 100 MHz

    Max Speed: 4200 MHz

    Current Speed: 4200 MHz

    Status: Populated, Enabled

    Upgrade: Socket FM2

    L1 Cache Handle: 0x0009

    L2 Cache Handle: 0x000A

    L3 Cache Handle: Not Provided

    Serial Number: Not Specified

    Asset Tag: Not Specified

    Part Number: Not Specified

    Core Count: 4

    Core Enabled: 4

    Thread Count: 4

    Characteristics:

        64-bit capable

Handle 0x0020, DMI type 127, 4 bytes

End Of Table

dipak · ‎01-20-2015

Hi Evren,

I got similar findings after running the sample using Omega driver on Kaveri with Redhat7 (64bit). I've reported the issue to dev team and they are working on it. Once I get any update, I'll get back to you. Thanks for pointing the issue.

Regards,

dipak · ‎02-24-2015

Hi Evren,

My apologies for this delayed reply.

We ran a few experiments at our end. The BufferBandwidth sample was actually intended for measuring the memory bandwidth during the map/unmap operation, not for benchmarking read/write bandwidth from kernels.

Information about read/write bandwidth from kernels is available in the GlobalMemoryBandwidth benchmark sample. The code in this sample is written to showcase this information. The GlobalMemoryBandwidth benchmark sample shows global memory accessing bandwidth in various data accessing scenarios, such as coalescing/uncoalescing, stride, and random.

Per your feedback, we will be modifying the BufferBandwidth sample to show only relevant information about map/unmap memory bandwidth.

Thanks,

Archives Discussions

BufferBandwidth results on Kaveri