I get some incredible different results when I try to measure bandwidth and completion time using cpu timer vs gpu timer
Hi,
I'm performing some memory tests on a pc (cpu + discrete gpu) and on an apu.
In particular, my test consists in writing Y bytes X times to find out the completion time and the average bandwidth. I do this for all the possible allocation strategies for the source and the destination buffers.
I tried to compare the results given by using cpu timers (on windows, queryPerformanceCounter) to those obtained using gpu timers, by now only on the cpu + discrete gpu.
The difference between thos two measures is so huge that I'm sure I made some mistakes.
Here is an example:
Testing transfer of 1024 bytes 16 times...
Unpinned -> Unpinned
CPU timer: 6259.02 Mbytes/s (total time: 0.02 ms)
GPU timer: 6368.19 Mbytes/s (total time: 0.02 ms)
Unpinned -> Device
CPU timer: 4.10 Mbytes/s (total time: 4.65 ms)
GPU timer: 13611.35 Mbytes/s (total time: 0.03 ms)
Pinned -> Unpinned
CPU timer: 3.94 Mbytes/s (total time: 4.33 ms)
GPU timer: 7492.48 Mbytes/s (total time: 0.03 ms)
Pinned -> Pinned
CPU timer: 3.73 Mbytes/s (total time: 5.23 ms)
GPU timer: 9359.60 Mbytes/s (total time: 0.04 ms)
Pinned -> Device
CPU timer: 3.30 Mbytes/s (total time: 5.64 ms)
GPU timer: 12743.39 Mbytes/s (total time: 0.03 ms)
Device -> Unpinned
CPU timer: 4.70 Mbytes/s (total time: 3.69 ms)
GPU timer: 11231.09 Mbytes/s (total time: 0.03 ms)
Device -> Pinned
CPU timer: 4.37 Mbytes/s (total time: 3.79 ms)
GPU timer: 8819.22 Mbytes/s (total time: 0.04 ms)
Device -> Device
CPU timer: 7.78 Mbytes/s (total time: 2.15 ms)
GPU timer: 8876.45 Mbytes/s (total time: 0.04 ms)
In case of 16 times 16 Mbytes, I get:
Pinned -> Device
CPU timer: 3317.04 Mbytes/s (total time: 81.06 ms)
GPU timer: 3837185.08 Mbytes/s (total time: 0.10 ms)
i.e. 3Tbytes of bandwidth, which is practically impossible, especially for a transfer host->device, which should be limited by the PCI bandwidth.
I really need and help to find the mistake, or to be told why I get such different results.
I show you the piece of code where I compute the completion time for a transfer between a pinned source buffer and a destination buffer allocated on the device. The other cases are really similar.
Some hints regarding the code:
1) DATATYPE is a macro actually set to "int"
2) The struct Timer is contained in an utility library. I report the code of timer (just in case the mistake is in there) at the end of the post
Thank you very much!
//profile with gpu timer if(!gpu_timer) { timer.start(); src_pointer = (DATATYPE*)clEnqueueMapBuffer(queue, src, CL_FALSE, CL_MAP_READ, 0, size * sizeof(DATATYPE), 0, NULL, NULL, NULL); for(int i = 0; i < NUM_TRANSF; i++) clEnqueueWriteBuffer(queue, dst, CL_FALSE, 0, size * sizeof(DATATYPE), src_pointer, 0, NULL, NULL); clFinish(queue); time = timer.get(); } //profile with cpu timer else { src_pointer = (DATATYPE*)clEnqueueMapBuffer(queue, src, CL_FALSE, CL_MAP_READ, 0, size * sizeof(DATATYPE), 0, NULL, &transfer_event, NULL); clWaitForEvents(1, &transfer_event); clGetEventProfilingInfo(transfer_event, CL_PROFILING_COMMAND_QUEUED, sizeof(cl_ulong), &start, 0); for(int i = 0; i < NUM_TRANSF; i++) clEnqueueWriteBuffer(queue, dst, CL_FALSE, 0, size * sizeof(DATATYPE), src_pointer, 0, NULL, &transfer_event); clFinish(queue); clWaitForEvents(1, &transfer_event); clGetEventProfilingInfo(transfer_event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, 0); time = (double)1.0e-9 * (end - start); } double bandwidth = ((double)(NUM_TRANSF * size * sizeof(DATATYPE)) / (double)time) * 1000.0 / 1000000.0; result.total_time = time + alloc_time; result.bandwidth = bandwidth; //Code of struct timer typedef struct Timer { LARGE_INTEGER frequency; LARGE_INTEGER start_time; void start() { QueryPerformanceFrequency(&frequency); QueryPerformanceCounter(&start_time); } double get() { LARGE_INTEGER end; QueryPerformanceCounter(&end); double elapsedTime = (end.QuadPart - start_time.QuadPart) * 1000.0 / frequency.QuadPart; return elapsedTime; } } Timer;