I've spent some time testing the same memory bandwidth program with the same gpu (HD 5870) on a different board and processore (intel i7). Unfortunately I get the same bandwidth results
Any news? Maybe the high bandwidth of the discrete card is due to some caching inside the device or in the command queue. In this case, how can I avoid it?
The access pattern (neighboring work items reading from overlapping memory regions) indicates that most of reads are probably hitting the data cache. That could explain why the bandwidth is high because it isn't measuring the PCI-E bandwidth.
Have you looked at the BufferBandwidth sample in the SDK? If you run that sample program with the input buffer set to ALLOC_HOST_PTR | READ_ONLY, the bandwidth of "GPU kernel read" is probably similar to what you are trying achieve with your test program.
Hi. The problem is that I get very similar results by accessing "randomly" to the host or device memory. Unfortunately, rabndomly means using a static offset, so a smart compiler could optimize prefetching also in this case. Is there any trick to measure the real bandwidth in transferring data from the host to the HD 5870?
To measure the PCIe bandwidth, you can simply time clEnqueueReadBuffer and clEnqueueWriteBuffer.
Try running the BufferBandwidth sample with the -pcie flag and it will show the PCIe bandwidth for each direction. If you refer to the source code, you'll see that it's actually timing the Read/Write buffer.
If you use clEnqueueReadBuffer and clEnqueueWriteBuffer, you will pay the price for pinning on each transfer unless you use the prepinned path as documented in the APP SDK documentation. This is also demonstrated in the BufferBandwidth sample code.
Hey, thank you for the answers!
I tested the BufferBandwidth sample.
With these arguments I obtain respectively 28GB/s and 17GB/s for the HD 5870 and the integrated GPU. The bandwidth of the discrete card is lower than the bandwidth of the integrated GPU only for very few GPU wavefronts (nw).
C:\Users\gabriele\Downloads\BufferBandwidth\BufferBandwidth\samples\opencl\bin\x86> .\BufferBandwidth.exe -d 0 -if 5
-of 5 -nwk 1 -nr 5 -nl 5 -nw 8192
Probably I'm wrong in thinking that it should be straightforward to verify the higher bandwidth of an integrated GPU...