cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

cadorino
Journeyman III

"Strange" completion time of Saxpy and Reduce for 64MB vectors

Hi to everybody!

I'm working on some simple OpenCL benchmarks to test heterogeneous computing on a APU + GPU system. The tests are actually a vector addition and a reduce, executed many times by varying the amount of data (vector sizes). The results produced by the algorithms have been deeply validated, therefore I expect the OpenCL kernels to be correct.

The completion time of the execution with a fixed amount of data is obtained by averaging 10000-100000 samples.

While running the tests I noticed a "strange" behaviour of the completion time in both the algorithms and on both the integrated and the discrete GPU.

For data sizes ranging from 64KB - 32MB  (per vector, 2 vectors in saxpy, 1 in reduce) the completion time is approximatively proportional to the data size. For example, the completion time of summing 32MB vectors is about two times the completion time of summing 16MB vectors.

Instead, when the algorithms are run with 64MB vectors the completion time is much more higher, 4 or 5 times the completion time with 32MB vectors.

I've never expected the completion time on GPU to vary linearly with the amount of data, but I can't understand why a big jump happens when moving from 32MB data to 64MB data.
Any idea?

Thank you very much

P.S. I briefly report the completion times of Saxpy and Reduce on both the discrete and the integrated GPUs, for 8-64MB vectors and with different buffer allocation strategies

--------- Reduce ---------

- Testing OpenCL

  - Testing devices (Cypress - discrete GPU)

    - Testing memory modes (0, 0)

      - Testing with 4194304 bytes...

7.119792 ms, (61 counters)

      - Testing with 8388608 bytes...

11.758621 ms, (61 counters)

      - Testing with 16777216 bytes...

20.622047 ms, (61 counters)

      - Testing with 33554432 bytes...

39.022222 ms, (61 counters)

      - Testing with 67108864 bytes...

80.660377 ms, (61 counters)

    - Testing memory modes (1, 1)

      - Testing with 4194304 bytes...

4.352941 ms, (61 counters)

      - Testing with 8388608 bytes...

6.902174 ms, (61 counters)

      - Testing with 16777216 bytes...

10.953642 ms, (61 counters)

      - Testing with 33554432 bytes...

19.405405 ms, (61 counters)

      - Testing with 67108864 bytes...

71.408163 ms, (61 counters)

    - Testing memory modes (2, 2)

      - Testing with 4194304 bytes...

3.559242 ms, (61 counters)

      - Testing with 8388608 bytes...

5.107692 ms, (61 counters)

      - Testing with 16777216 bytes...

8.264151 ms, (61 counters)

      - Testing with 33554432 bytes...

14.475410 ms, (61 counters)

      - Testing with 67108864 bytes...

65.489796 ms, (61 counters)

    - Testing memory modes (1, 2)

      - Testing with 4194304 bytes...

4.182692 ms, (61 counters)

      - Testing with 8388608 bytes...

6.416667 ms, (61 counters)

      - Testing with 16777216 bytes...

10.666667 ms, (61 counters)

      - Testing with 33554432 bytes...

19.285714 ms, (61 counters)

      - Testing with 67108864 bytes...

65.127660 ms, (61 counters)

  - Testing devices (Beavercreek - integrated GPU)

    - Testing memory modes (0, 0)

      - Testing with 4194304 bytes...

8.260870 ms, (61 counters)

      - Testing with 8388608 bytes...

11.795181 ms, (61 counters)

      - Testing with 16777216 bytes...

20.511628 ms, (61 counters)

      - Testing with 33554432 bytes...

37.989011 ms, (61 counters)

      - Testing with 67108864 bytes...

78.018519 ms, (61 counters)

    - Testing memory modes (1, 1)

      - Testing with 4194304 bytes...

3.918269 ms, (61 counters)

      - Testing with 8388608 bytes...

5.228723 ms, (61 counters)

      - Testing with 16777216 bytes...

7.319527 ms, (61 counters)

      - Testing with 33554432 bytes...

11.503597 ms, (61 counters)

      - Testing with 67108864 bytes...

49.800000 ms, (61 counters)

    - Testing memory modes (2, 2)

      - Testing with 4194304 bytes...

3.154930 ms, (61 counters)

      - Testing with 8388608 bytes...

4.273171 ms, (61 counters)

      - Testing with 16777216 bytes...

6.450292 ms, (61 counters)

      - Testing with 33554432 bytes...

10.588235 ms, (61 counters)

      - Testing with 67108864 bytes...

58.763636 ms, (61 counters)

    - Testing memory modes (1, 2)

      - Testing with 4194304 bytes...

3.732394 ms, (61 counters)

      - Testing with 8388608 bytes...

4.875000 ms, (61 counters)

      - Testing with 16777216 bytes...

7.081871 ms, (61 counters)

      - Testing with 33554432 bytes...

11.239437 ms, (61 counters)

      - Testing with 67108864 bytes...

48.228070 ms, (61 counters)

-------- Saxpy ---------

  - Testing devices (Cypress - discrete GPU)

    - Testing memory modes (0, 0)

      - Testing with 4194304 bytes...

15.266254 ms, (61 counters)

      - Testing with 8388608 bytes...

28.105556 ms, (61 counters)

      - Testing with 16777216 bytes...

53.105263 ms, (61 counters)

      - Testing with 33554432 bytes...

104.645833 ms, (61 counters)

      - Testing with 67108864 bytes...

212.041667 ms, (61 counters)

    - Testing memory modes (1, 1)

      - Testing with 4194304 bytes...

11.068396 ms, (61 counters)

      - Testing with 8388608 bytes...

20.729730 ms, (61 counters)

      - Testing with 16777216 bytes...

39.848739 ms, (61 counters)

      - Testing with 33554432 bytes...

81.152542 ms, (61 counters)

      - Testing with 67108864 bytes...

212.363636 ms, (61 counters)

    - Testing memory modes (1, 2)

      - Testing with 4194304 bytes...

8.676174 ms, (61 counters)

      - Testing with 8388608 bytes...

16.105802 ms, (61 counters)

      - Testing with 16777216 bytes...

31.021898 ms, (61 counters)

      - Testing with 33554432 bytes...

64.149254 ms, (61 counters)

      - Testing with 67108864 bytes...

181.208333 ms, (61 counters)

  - Testing devices (Beavercreek - integrated GPU)

    - Testing memory modes (0, 0)

      - Testing with 4194304 bytes...

16.267606 ms, (61 counters)

      - Testing with 8388608 bytes...

27.192308 ms, (61 counters)

      - Testing with 16777216 bytes...

51.655914 ms, (61 counters)

      - Testing with 33554432 bytes...

100.480000 ms, (61 counters)

      - Testing with 67108864 bytes...

206.520000 ms, (61 counters)

    - Testing memory modes (1, 1)

      - Testing with 4194304 bytes...

9.558000 ms, (61 counters)

      - Testing with 8388608 bytes...

17.188679 ms, (61 counters)

      - Testing with 16777216 bytes...

32.297872 ms, (61 counters)

      - Testing with 33554432 bytes...

62.369863 ms, (61 counters)

      - Testing with 67108864 bytes...

185.500000 ms, (61 counters)

    - Testing memory modes (1, 2)

      - Testing with 4194304 bytes...

6.277571 ms, (61 counters)

      - Testing with 8388608 bytes...

11.224057 ms, (61 counters)

      - Testing with 16777216 bytes...

21.339713 ms, (61 counters)

      - Testing with 33554432 bytes...

41.730769 ms, (61 counters)

      - Testing with 67108864 bytes...

148.032258 ms, (61 counters)

0 Likes
5 Replies
settle
Challenger

I'll go out on a limb and guess this has to do the cl_mem_flags you pass to clCreateBuffer.  For example, in the AMD Accelerated Parallel Processing OpenCL Programming Guide, 4.4.2 Placement, it states that mapped data size <= 32 MiB uses host pinned memory.  Saxpy is O(N) computation and order O(N) memory access, so if you include the memory transfer times, slight changes in the memory paths will greatly affect your completion times.  BTW, I'm assuming your times include the memory transfers in addition to the kernel executions.

0 Likes

Thank you for the answer!
You are right, I'm measuring completion time from the initialization of the input data, through actual execution of the kernel, to reading the result.

I'm aware of pinning cost, but I didn't expect a so-high impact on completion time, especially on the integrated GPU, where with host_alloc or persistent_mem_amd flags data should not be copied.

0 Likes

I'm not exactly sure about your last statement about the host_alloc on the igpu not being copied.  The igpu and cpu share a unified memory space, but at the moment they are disjoint regions in host memory, so host memory accessible from the cpu much be explicitly or implicitly copied back and forth between the host memory accessible from the igpu.

One last thing to consider is that the allocations are lazy allocation, so data is not transferred until it's first accessed.  That would suggest to split a large buffer into small buffers and small kernel runs to hide that latency.

0 Likes

Thank you for your suggestion!

Anyway, if data has always to be copied, how can they call some buffers (with opportune flags choice) ZERO COPY buffers?

0 Likes

I misspoke a bit in my description of the unified host memory, so I'll cite the OpenCL Programming Guide, 4.5.1.4 Device Memory:

On an APU, the system memory is shared between the GPU and the CPU; it is

visible by either the CPU or the GPU at any given time. A significant benefit of

this is that buffers can be zero copied between the devices by using map/unmap

operations to logically move the buffer between the CPU and the GPU address

space. See Section 4.5.4, “Mapping,” page 4-18, for more information on zero

copy.

As you pointed out about zero copy buffers, the host memory is not copied, but ownership of that region in host memory is exchanged between the cpu and igpu using map/unmap (some other exchange of unallocated memory may or may not also occur to zero out the net exchange).

0 Likes