Strange memory measures: gpu timer vs cpu timer

I get some incredible different results when I try to measure bandwidth and completion time using cpu timer vs gpu timer



I'm performing some memory tests on a pc (cpu + discrete gpu) and on an apu.

In particular, my test consists in writing Y bytes X times to find out the completion time and the average bandwidth. I do this for all the possible allocation strategies for the source and the destination buffers.

I tried to compare the results given by using cpu timers (on windows, queryPerformanceCounter) to those obtained using gpu timers, by now only on the cpu + discrete gpu.

The difference between thos two measures is so huge that I'm sure I made some mistakes.

Here is an example:

Testing transfer of 1024 bytes 16 times...

Unpinned -> Unpinned

CPU timer: 6259.02 Mbytes/s (total time: 0.02 ms)

GPU timer: 6368.19 Mbytes/s (total time: 0.02 ms)

Unpinned -> Device

CPU timer: 4.10 Mbytes/s (total time: 4.65 ms)

GPU timer: 13611.35 Mbytes/s (total time: 0.03 ms)

Pinned   -> Unpinned

CPU timer: 3.94 Mbytes/s (total time: 4.33 ms)

GPU timer: 7492.48 Mbytes/s (total time: 0.03 ms)

Pinned   -> Pinned

CPU timer: 3.73 Mbytes/s (total time: 5.23 ms)

GPU timer: 9359.60 Mbytes/s (total time: 0.04 ms)

Pinned   -> Device

CPU timer: 3.30 Mbytes/s (total time: 5.64 ms)

GPU timer: 12743.39 Mbytes/s (total time: 0.03 ms)

Device   -> Unpinned

CPU timer: 4.70 Mbytes/s (total time: 3.69 ms)

GPU timer: 11231.09 Mbytes/s (total time: 0.03 ms)

Device   -> Pinned

CPU timer: 4.37 Mbytes/s (total time: 3.79 ms)

GPU timer: 8819.22 Mbytes/s (total time: 0.04 ms)

Device   -> Device

CPU timer: 7.78 Mbytes/s (total time: 2.15 ms)

GPU timer: 8876.45 Mbytes/s (total time: 0.04 ms)



In case of 16 times 16 Mbytes, I get:

Pinned   -> Device

CPU timer: 3317.04 Mbytes/s (total time: 81.06 ms)

GPU timer: 3837185.08 Mbytes/s (total time: 0.10 ms)

i.e. 3Tbytes of bandwidth, which is practically impossible, especially for a transfer host->device, which should be limited by the PCI bandwidth.



I really need and help to find the mistake, or to be told why I get such different results.

I show you the piece of code where I compute the completion time for a transfer between a pinned source buffer and a destination buffer allocated on the device. The other cases are really similar.

Some hints regarding the code:

1) DATATYPE is a macro actually set to "int"

2) The struct Timer is contained in an utility library. I report the code of timer (just in case the mistake is in there) at the end of the post

Thank you very much!


//profile with gpu timer if(!gpu_timer) { timer.start(); src_pointer = (DATATYPE*)clEnqueueMapBuffer(queue, src, CL_FALSE, CL_MAP_READ, 0, size * sizeof(DATATYPE), 0, NULL, NULL, NULL); for(int i = 0; i < NUM_TRANSF; i++) clEnqueueWriteBuffer(queue, dst, CL_FALSE, 0, size * sizeof(DATATYPE), src_pointer, 0, NULL, NULL); clFinish(queue); time = timer.get(); } //profile with cpu timer else { src_pointer = (DATATYPE*)clEnqueueMapBuffer(queue, src, CL_FALSE, CL_MAP_READ, 0, size * sizeof(DATATYPE), 0, NULL, &transfer_event, NULL); clWaitForEvents(1, &transfer_event); clGetEventProfilingInfo(transfer_event, CL_PROFILING_COMMAND_QUEUED, sizeof(cl_ulong), &start, 0); for(int i = 0; i < NUM_TRANSF; i++) clEnqueueWriteBuffer(queue, dst, CL_FALSE, 0, size * sizeof(DATATYPE), src_pointer, 0, NULL, &transfer_event); clFinish(queue); clWaitForEvents(1, &transfer_event); clGetEventProfilingInfo(transfer_event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, 0); time = (double)1.0e-9 * (end - start); } double bandwidth = ((double)(NUM_TRANSF * size * sizeof(DATATYPE)) / (double)time) * 1000.0 / 1000000.0; result.total_time = time + alloc_time; result.bandwidth = bandwidth; //Code of struct timer typedef struct Timer { LARGE_INTEGER frequency; LARGE_INTEGER start_time; void start() { QueryPerformanceFrequency(&frequency); QueryPerformanceCounter(&start_time); } double get() { LARGE_INTEGER end; QueryPerformanceCounter(&end); double elapsedTime = (end.QuadPart - start_time.QuadPart) * 1000.0 / frequency.QuadPart; return elapsedTime; } } Timer;

