AnsweredAssumed Answered

BufferBandwidth results on Kaveri

Question asked by yurtesen on Jan 5, 2015
Latest reply on Feb 24, 2015 by dipak

Hello,

I was wondering why the CPU read/writes are so slow on the BufferBandwidth example irrelevant of if the memory is allocated in host or not? Also why the GPU writes are slow if the kernel is writing to host memory?

Device  0        Spectre
Build:           release
GPU work items:  8192
Buffer size:     33554432
CPU workers:     1
Timing loops:    20
Repeats:         1
Kernel loops:    20
inputBuffer:     CL_MEM_READ_ONLY
outputBuffer:    CL_MEM_WRITE_ONLY

 

Host baseline (naive):

 

Timer resolution 256.22  ns
Page fault       942.38  ns
CPU read         6.28 GB/s
memcpy()         8.81 GB/s
memset(,1,)      6.87 GB/s
memset(,0,)      6.87 GB/s

 

 

AVERAGES (over loops 2 - 19, use -l for complete log)

--------

 

 

1. Host mapped write to inputBuffer

---------------------------------------|---------------

clEnqueueMapBuffer -- WRITE (GBPS) | 2331.320

---------------------------------------|---------------

memset() (GBPS)                    | 6.717

---------------------------------------|---------------

clEnqueueUnmapMemObject() (GBPS)   | 10.404

 

 

2. GPU kernel read of inputBuffer

---------------------------------------|---------------

clEnqueueNDRangeKernel() (GBPS)    | 29.747

 

Verification Passed!

 

 

3. GPU kernel write to outputBuffer

---------------------------------------|---------------

clEnqueueNDRangeKernel() (GBPS)    | 23.172

 

 

4. Host mapped read of outputBuffer

---------------------------------------|---------------

clEnqueueMapBuffer -- READ (GBPS)  | 10.927

---------------------------------------|---------------

CPU read (GBPS)                    | 6.228

---------------------------------------|---------------

clEnqueueUnmapMemObject() (GBPS)   | 645.145

 

 

 

 

Device  0        Spectre
Build:           release
GPU work items:  8192
Buffer size:     33554432
CPU workers:     1
Timing loops:    20
Repeats:         1
Kernel loops:    20
inputBuffer:     CL_MEM_READ_ONLY CL_MEM_ALLOC_HOST_PTR
outputBuffer:    CL_MEM_WRITE_ONLY CL_MEM_ALLOC_HOST_PTR

 

Host baseline (naive):

 

Timer resolution 256.48  ns
Page fault       974.34  ns
CPU read         6.15 GB/s
memcpy()         8.82 GB/s
memset(,1,)      6.73 GB/s
memset(,0,)      6.72 GB/s

 

 

AVERAGES (over loops 2 - 19, use -l for complete log)

--------

 

 

1. Host mapped write to inputBuffer

---------------------------------------|---------------

clEnqueueMapBuffer -- WRITE (GBPS) | 2880.703

---------------------------------------|---------------

memset() (GBPS)                    | 9.079

---------------------------------------|---------------

clEnqueueUnmapMemObject() (GBPS)   | 917.657

 

 

2. GPU kernel read of inputBuffer

---------------------------------------|---------------

clEnqueueNDRangeKernel() (GBPS)    | 28.579

 

Verification Passed!

 

 

3. GPU kernel write to outputBuffer

---------------------------------------|---------------

clEnqueueNDRangeKernel() (GBPS)    | 8.098

 

 

4. Host mapped read of outputBuffer

---------------------------------------|---------------

clEnqueueMapBuffer -- READ (GBPS)  | 3166.840

---------------------------------------|---------------

CPU read (GBPS)                    | 6.195

---------------------------------------|---------------

clEnqueueUnmapMemObject() (GBPS)   | 794.376

 

 

 

Thanks,

Evren

Outcomes