Archives Discussions

cadorino · ‎06-29-2012

Hi to everybody.
I'm performing some benchmarks on discrete and integrated GPUs, measuring completion time, energy consumption and collecting GPU counters.

For quite all the algorithms that I'm executing, ranging from Saxpy, through Reduction to Convolution, I'm obtaining results difficult to be interpreted when the input data gets bigger than 32MB.

For example, in convolution, the completion time varies quite linearly increasing the matrix size (9ms, 16ms, 30ms ...) for matrix sizes smaller than 8M elements (32MB). For 64MB the completion time jumps to 130ms, i.e. 4 times the completion time for 32 MB.
In Saxpy I found the same situation, with the completion time jumping from 25ms, 50ms, 100ms, to 300ms for 64MB data, which is 3x the completion time for half the input size.

An huge increase of the completion time for input data bigger than 32MB seems to affect all my benchmarks.

Moreover, looking at the GPU counters such as GPUBusy, it seems that for such an input size the GPU resources are underemployed.

Is the increasing of completion time due to memory pinning cost?

Can you help me justifying the decrease of most of the GPU counters values?

I show you two tables, the first relative to Saxpy, executed on the Discrete GPU with no-flags buffers (device allocation) and the second relative to Convolution (3x3 filter, single precision), executed on the A8 integrated GPU with ALLOC_HOST | READ_ONLY flag (host visible pre-pinned allocation).

These are only two examples, but the jump of the completion time and the decrease of GPU counters are actually spreading across all the buffer allocation strategies and the devices used.

Saxpy (vector size expressed in bytes): http://www.gabrielecocco.it/Workbook2.htm

Convolution (matrix size expressed in total elements): http://www.gabrielecocco.it/Workbook3.htm

Thank you very much for you help!

yurtesen · ‎07-01-2012

Did you try 128mb and after to see how that scales? You say "input size bigger than 32MB" but you have only one sample after 32MB.

I would confirm the values by checking the wall time also and not rely on OpenCL event times. In the past I have seen map/unmap actually taking longer time than what their event counters show. Clfinish before and after kernel enqueue and measuring the time should do the trick.

If I understand correctly, cl_mem_alloc_host_ptr should not cause data to be transferred, so there shouldnt be extra pinning overhead (only at allocation time). But then, it probably shouldnt appear in your kernel timings...

Did you try to run different workgroup sizes also?

The documentation says:

4.5.3.2 Using Both CPU and GPU Devices, or using an APU Device
When creating memory objects, create them with
CL_MEM_USE_PERSISTENT_MEM_AMD. This enables the zero copy feature, as
explained in Section 4.5.3.1, “Using the CPU.”.

Did you try using CL_MEM_USE_PERSISTENT_MEM_AMD? Documentation also says:

The CL_MEM_USE_PERSISTENT_MEM_AMD buffer is
– a zero copy buffer that resides on the GPU device.
– directly accessible by the GPU device at GPU memory bandwidth.
– directly accessible by the host across the interconnect (typically with high
streamed write bandwidth, but low read and potentially low write scatter
bandwidth, due to the uncached WC path).
– copyable to, and from, the device at peak interconnect bandwidth using
CL read, write, and copy commands.
There is a limit on the maximum size per buffer, as well as on the total size
of all buffers. This is platform-dependent, limited in size for each buffer, and
also for the total size of all buffers of that type (a good working assumption
is 64 MB for the per-buffer limit, and 128 MB for the total).

cadorino · ‎07-01-2012

You are right, I should add another sample bigger than 32MB, even if with 64MB per vector (2 vectors = 128MB) I'm really near to the allocation limits.

I didn't use GPU timers (sorry, I forgot to mention it). To measure time I use CPU timers (QueryPerformanceCounter, Win32) and I include everything: allocation - initialization - execution - read the result.

Since I'm including allocation time, I thought the increase of compilation time was due to extra pinning cost.

Yes, I use PERSISTENT_MEM_AMD, but the results are the same regardless the allocation strategy (i.e. regardless the strategy I get a huge increase of completion time when input size is > 32MB).

Another reason why I thought that this penalty was due to allocation time is that it doesn't happens when I run the computation heterogeneously on CPU and GPU or on two GPUs, splitting the data. If I do that for 64MB vectors, I start one thread for each device, which allocates and initializes 32MB vectors. Maintaining single buffer allocation equal or smaller than 32MB seems to avoid this huge increase of completion time.

Anyway, I think it's quite incredible that pinning (or whatever happens) gives a so huge performance penalty.

nyanthiss · ‎07-03-2012

In AMD APP guide, page 4-17, Table 4.2 (OpenCL Memory Object Properties) first row says:

clCreateBuffer(no_flags) + clEnqueueMapBuffer =>

Mapped data size:

• <=32MiB: Pinned host memory

• >32MiB: Host memory (different memory area can be used on each map)

I believe it means that, if you try to create a buffer with >32M memory its transferred like this: CPU copies data to "staging buffers" (driver-preallocated pinned memory), then GPU DMA engine copies them; this happens in (probably) 32M chunks.

This means that for >32M, your data is copied twice (cpu->pinned, pinned->gpu). Pinning on the other hand copies just once (pinned->gpu).

Out of curiosity, which flags do you pass to clCreateBuffer, and do you use MapBuffer or Read/WriteBuffer ?

cadorino · ‎07-03-2012

My benchmarks are highly parametrized. They run under different input sizes, buffer flags, number of devices. In fact, one of my targets is to find out the cases for which a buffer flag is better than another or for which a device is more performant than another one given a particular algorithm.

This is to say that the increase of the completion time for buffers > 32MB is something that seems to be constantly present regardless the buffer allocation strategy.

Anyway, the data attached to the first message are relative to the execution of Saxpy using CL_MEM_ALLOC_HOST_PTR.

Since the vector X is written by the host and read by the kernel while Y is read and written by both the host and the kernel, I created
x using CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_ONLY | CL_MEM_HOST_WRITE_ONLY and
y using CL_MEM_ALLOC_HOST_PTR

After having allocated the buffers, the host does a clEnqueueMapBuffer(x) and initializes it (with direct writes, no memcpy or memset). The same for y.
The buffers are stetted as kernel args and the host enqueues an NDRange. Finally, the host re-map y and reads it.

So, for what regards the AMD APP guide, page 4-17, Table 4.2, this is not the case. Since I use CL_MEM_ALLOC_HOST_PTR the buffers are pre-pinned. So, I guess if pre-pinning is responsible of such a huge increase of completion time (I will try to isolate the completion time of allocation/initialization).

Any other suggestion?

yurtesen · ‎07-04-2012

cadorino wrote:
So, for what regards the AMD APP guide, page 4-17, Table 4.2, this is not the case. Since I use CL_MEM_ALLOC_HOST_PTR the buffers are pre-pinned. So, I guess if pre-pinning is responsible of such a huge increase of completion time (I will try to isolate the completion time of allocation/initialization).
Any other suggestion?

Are you using Linux or Windows? Because pinning does not work that well on Linux.

Also using CL_MEM_ALLOC_HOST_PTR should keep the data in the host, you probably dont want this when you run on a discrete device. You should copy the data using enqueuewritebuffer. As far as I understand, the best way for a discrete device is to create a pinned host buffer using CL_MEM_ALLOC_HOST_PTR and a buffer which will reside on device without the CL_MEM_ALLOC_HOST_PTR, copy data using enqueuewritebuffer and use the device buffer with the kernel.

This way, you can also put a clfinish before running the kernel and forget about if the memory transfers the issue.

cadorino · ‎07-04-2012

Windows 32 bit. I know it is often suggested to transfer data to device mem but my benchmarks (presented at the AFDS 12, hopefully slides and recording available online soon) show that for such relatively small amount of data, direct read over PCI-ex is faster for all the vector sizes smaller or equal than 32MB. The best choice is CL_MEM_ALLOC_HOST_PTR.

Here you are the data:

SAXPY - No flags

Vector size (bytes)	Completion time	W	J
64K	2.81	85.00	239.19
128K	2.96	99.53	294.92
256K	3.57	100.28	358.51
512K	3.85	130.32	502.00
1M	5.42	131.90	715.50
2M	7.99	131.12	1047.29
4M	13.41	130.48	1749.35
8M	42.03	108.32	4552.59
16M	45.65	147.85	6750.14
32M	88.55	148.27	13129.63
64M	180.96	145.54	26338.21

SAXPY - alloc host ptr (both x and y)

Vector size (bytes)	Completion time	W	J
64K	1.53	96.76	148.36
128K	1.72	97.72	167.75
256K	2.12	77.70	164.48
512K	2.45	119.53	292.88
1M	3.69	150.65	555.26
2M	5.95	148.09	881.61
4M	10.12	151.37	1531.38
8M	18.45	150.37	2774.60
16M	35.08	148.73	5217.60
32M	68.04	150.72	10254.53
64M	201.80	149.92	30253.78

For 32MB, 68ms vs 88ms.
There's another case of strange behavior for 64MB. If I use ALLOC_HOST_PTR the buffer is pre-pinned. If I use no flags the buffer is pinned when data transfer happens. Nevertheless, even if in both the cases I have to pin the buffer, the transfer to device memory is convenient.

yurtesen · ‎07-05-2012

My point was for detecting if the issue has something to do with the kernel run or memory transfers by running kernel on different sizes and measure kernel run time. If you only want to check the memory access, you can just check how long it takes to map/unmap the memory objects (and I would run them at least twice to see that the results are same). Running a kernel sounds redundant (unless if zero copy is involved, but I would still check the times of map/unmap operations to verify that they are actually able to do zero-copy)

But in my opinion, reading through PCIe from host memory can never be faster as long as you access all the elements. The advantage of keeping data in host memory is if you are accessing to a small part of the data array but you dont know which part beforehand. So you wouldnt need to transfer a pile of unnecessary data. (or that is what I understand from this concept ).

Manual says:

Zero copy host resident memory objects can boost performance when host
memory is accessed by the device in a sparse manner or when a large host
memory buffer is shared between multiple devices and the copies are too
expensive. When choosing this, the cost of the transfer must be greater than the
extra cost of the slower accesses.

Also, you mentioned that you are coming near buffer limits... But there is no clear limit mentioned in the documentation CL_MEM_ALLOC_HOST_PTR, there is some limit for CL_MEM_USE_PERSISTENT_MEM_AMD (since the limit is mentioned under this paragraph.

You should be able to allocate much larger memory amount using CL_MEM_ALLOC_HOST_PTR (and probably even more than the GPU memory if that works in Windows, does not work on Linux )

cadorino · ‎07-10-2012

yurtesen wrote:
My point was for detecting if the issue has something to do with the kernel run or memory transfers by running kernel on different sizes and measure kernel run time. If you only want to check the memory access, you can just check how long it takes to map/unmap the memory objects (and I would run them at least twice to see that the results are same). Running a kernel sounds redundant (unless if zero copy is involved, but I would still check the times of map/unmap operations to verify that they are actually able to do zero-copy)
But in my opinion, reading through PCIe from host memory can never be faster as long as you access all the elements. The advantage of keeping data in host memory is if you are accessing to a small part of the data array but you dont know which part beforehand. So you wouldnt need to transfer a pile of unnecessary data. (or that is what I understand from this concept ).

Here you are (attached) an excel with in-depth timings reported, including allocation, initialization, execution and result retrieval (by the host) under different buffer placements.

As you can see both init and exec times hugely increase when passing from 32MB to 64MB, breaking the "2x rule" (if you change the y scale to logarithm it is probably clearer).

I guess you have ever faced a similar situation.

Thank you!

Archives Discussions

What happens when input size bigger than 32MB?