OpenCL

___ · ‎12-07-2020

Hi. I tried to port this cuda example(https://developer.nvidia.com/blog/introduction-cuda-dynamic-parallelism/) to opencl.
My configuration:
Windows 10 2004.
Nvidia GPU: GTX 1070 8GB Driver 457.51
AMD GPU: RX Vega 64 8GB Driver 20.11.3

Original repository: https://github.com/canonizer/mandelbrot-dyn
My fork: https://github.com/tupieurods/mandelbrot-dyn
I tested my code with Visual Studio 2019.
I used vcpkg to link libpng. Just to verify that my image is correct. But it is easy to remove it from project. Tell me if I should remove vckpg and libpng dependency from my repository.

required ENV variables:

VCPKG_ROOT - root directory of vcpkg. Example: C:\src\vcpkg
AMDAPPSDKROOT - path to amd app sdk. Example: C:\Program Files (x86)\AMD APP SDK\3.0\

Cuda performance:
https://github.com/tupieurods/mandelbrot-dyn/blob/master/MandelbrotVS/MandelbrotCuda/main.cpp#L7-L20
https://github.com/tupieurods/mandelbrot-dyn/blob/master/MandelbrotVS/MandelbrotCuda/mandelbrot.cu#L...
https://github.com/tupieurods/mandelbrot-dyn/blob/master/MandelbrotVS/MandelbrotCuda/mandelbrotDevic...
Nvidia CUDA. Mandelbrot set(host enqueue) computed in 0.120 s, at 2231.883 Mpix/s

https://github.com/tupieurods/mandelbrot-dyn/blob/master/MandelbrotVS/MandelbrotCuda/main.cpp#L22-L3...
https://github.com/tupieurods/mandelbrot-dyn/blob/master/MandelbrotVS/MandelbrotCuda/mandelbrot.cu#L...
https://github.com/tupieurods/mandelbrot-dyn/blob/master/MandelbrotVS/MandelbrotCuda/mandelbrotDevic...
Nvidia CUDA. Mandelbrot set(device enqueue) computed in 0.031 s, at 8668.464 Mpix/s

AMD opencl performance:
https://github.com/tupieurods/mandelbrot-dyn/blob/master/MandelbrotVS/MandelbrotOpencl/main.cpp#L10-...
https://github.com/tupieurods/mandelbrot-dyn/blob/master/MandelbrotVS/MandelbrotOpencl/mandelbrotOpe...
https://github.com/tupieurods/mandelbrot-dyn/blob/master/MandelbrotVS/MandelbrotOpencl/kernels/mande...
AMD OPENCL. Mandelbrot set(host enqueue) computed in 0.118 s, at 2272.444 Mpix/s

https://github.com/tupieurods/mandelbrot-dyn/blob/master/MandelbrotVS/MandelbrotOpencl/main.cpp#L25-...
https://github.com/tupieurods/mandelbrot-dyn/blob/master/MandelbrotVS/MandelbrotOpencl/mandelbrotOpe...
https://github.com/tupieurods/mandelbrot-dyn/blob/master/MandelbrotVS/MandelbrotOpencl/kernels/mande...
AMD OPENCL. Mandelbrot set(device enqueue) RUN #0 computed in 2.822 s, at 95.122 Mpix/s
AMD OPENCL. Mandelbrot set(device enqueue) RUN #1 computed in 2.602 s, at 103.152 Mpix/s

As you can see straight port doesn't work well with AMD.

I decided to store params for subtasks in global memory and run it from the host in the hybrid way: launch kernels which would enqueue only one kernel for each subtask.

https://github.com/tupieurods/mandelbrot-dyn/blob/master/MandelbrotVS/MandelbrotOpencl/main.cpp#L46-...
https://github.com/tupieurods/mandelbrot-dyn/blob/master/MandelbrotVS/MandelbrotOpencl/mandelbrotOpe...
https://github.com/tupieurods/mandelbrot-dyn/blob/master/MandelbrotVS/MandelbrotOpencl/kernels/mande...
AMD OPENCL. Mandelbrot set(device enqueue with host) RUN #0 computed in 0.277 s, at 970.313 Mpix/s
AMD OPENCL. Mandelbrot set(device enqueue with host) RUN #1 computed in 0.037 s, at 7200.696 Mpix/s

It worked faster. But for some reason only when i do the second pass. In other words I execute the same sequence of kernels twice: https://github.com/tupieurods/mandelbrot-dyn/blob/master/MandelbrotVS/MandelbrotOpencl/mandelbrotOpe...
But on the second time it works faster. Maybe driver is doing some extra work? Sounds pretty bad. Because in order to make it work fast I have to warm-up gpu first?

I also tried very strange idea: enqueue kernel with worksize (1, 1, 1) from host and call the same kernel as the first naive implementation. And it suffers from the same warm-up problem. But surprisingly it works faster on the second pass. Am i missing something? Or doing something wrong?
https://github.com/tupieurods/mandelbrot-dyn/blob/master/MandelbrotVS/MandelbrotOpencl/main.cpp#L67-...
https://github.com/tupieurods/mandelbrot-dyn/blob/master/MandelbrotVS/MandelbrotOpencl/mandelbrotOpe...
https://github.com/tupieurods/mandelbrot-dyn/blob/master/MandelbrotVS/MandelbrotOpencl/kernels/mande...
AMD OPENCL. Worksize (1, 1, 1) test. Mandelbrot set(device enqueue) RUN #0 computed in 0.355 s, at 756.069 Mpix/s
AMD OPENCL. Worksize (1, 1, 1) test. Mandelbrot set(device enqueue) RUN #1 computed in 0.053 s, at 5020.394 Mpix/s

I tried to profile code with "Radeon Developer Panel v2.2.0.15", because CodeXL doesn't work well with new drivers. But when i try to capture a profile i see this error in logs:
[RGP] Failed to finish executing profile with code: 0
and from different source in debug log:
[RGP] Counters not supported on current device (asic_device_id=[26751]. asic_family=[141])

dipak · ‎12-08-2020

"...surprisingly it works faster on the second pass. Am i missing something? Or doing something wrong?"

AMD OpenCL runtime uses deferred allocation policy which might be the reason behind this observation.

As AMD OpenCL Optimization Guide says about the "deferred allocation" policy:

"The CL runtime attempts to minimize resource consumption by delaying buffer allocation until first use. As a side effect, the first accesses to a buffer may be more expensive than subsequent accesses."

"The OpenCL runtime uses deferred allocation to maximize memory resources. This means that a complete roundtrip chain, including data transfer and kernel compute, might take one or two iterations to reach peak performance."

Thanks.

___ · ‎12-08-2020

Thanks, i forgot about this. Deferred allocation explains why 2 runs are required for a better performance on some scenarios. But why this kernel is very slow on amd hardware: https://github.com/tupieurods/mandelbrot-dyn/blob/master/MandelbrotVS/MandelbrotOpencl/kernels/mande... ?

AMD OPENCL. Mandelbrot set(device enqueue) RUN #0 computed in 2.822 s, at 95.122 Mpix/s
AMD OPENCL. Mandelbrot set(device enqueue) RUN #1 computed in 2.602 s, at 103.152 Mpix/s

Compared to similar kernel from nvidia:

https://github.com/tupieurods/mandelbrot-dyn/blob/master/MandelbrotVS/MandelbrotCuda/mandelbrotDevic...
Nvidia CUDA. Mandelbrot set(device enqueue) computed in 0.031 s, at 8668.464 Mpix/s

dipak · ‎12-09-2020

Is the below function where kernel execution time is measured?

mandelbrotDeviceEnqueueOpencl in https://github.com/tupieurods/mandelbrot-dyn/blob/master/MandelbrotVS/MandelbrotOpencl/mandelbrotOpe...

If yes, it seems that the kernel is enqueued "maxDepth" times with clFinish blocking call and the total time is considered as "gpuTime". Am I correct?

If you are interested about kernel execution time on the device , I think event based profiling information might be more useful for this purpose.

Thanks.

___ · ‎12-10-2020

You are right. Sorry, my bad. I forgot to remove this loop after some experiments. Of course it is not needed and we shouldn't call this kernel maxDepth times. One time is enough. Fixed code on github.

Also added event based timing, but it is not very relevant anymore. Because once useless loop was removed - host side time measurement would be accurate enough.

Tried to test with different initSubdiv values: https://github.com/tupieurods/mandelbrot-dyn/blob/master/MandelbrotVS/MandelbrotOpencl/mandelbrotOpe...

Performance is still not very impressive:

initSubdiv 1	Time	Time event measure
Run #0	0.293 s	0.283 s
Run #1	0.058 s	0.058 s
initSubdiv 2	Time	Time event measure
Run #0	0.369 s	0.358 s
Run #1	0.132 s	0.132 s
initSubdiv 4	Time	Time event measure
Run #0	0.507 s	0.497 s
Run #1	0.273 s	0.271 s
initSubdiv 8	Time	Time event measure
Run #0	0.887 s	0.872 s
Run #1	0.654 s	0.653 s
initSubdiv 16	Time	Time event measure
Run #0	1.073 s	1.064 s
Run #1	0.799 s	0.799 s
initSubdiv 32	Time	Time event measure
Run #0	0.807 s	0.797 s
Run #1	0.572 s	0.572 s

Best performance with initSubdiv equal to 1. But with such configuration it is very close to the code without device queue at all. And this result is far from version which involves more memory and data copying to host(https://github.com/tupieurods/mandelbrot-dyn/blob/master/MandelbrotVS/MandelbrotOpencl/mandelbrotOpe...).

Two possible reasons for that:

- amd opencl implementation doesn't like device queue and branching.

- Or it doesn't like when too many items scheduled to device queue.

dipak · ‎12-14-2020

Without seeing the profiling data, it would be difficult to point out any particular reason behind this observation. It seems like increasing the "initSubdiv" value causes some other bottleneck which might be affecting device enqueue and/or overall kernel performance.

From a quick look at the "mandelbrot" kernel, it seems like "getBorderDwell" function calculates some work-group wise decision value, so it can be called per work-group once. In the code, however, the function is called per work-item. Please check the related code if work-item wise call can be avoided to improve the overall kernel execution time.

[P.S. I didn't check the full code-flow, so please correct me if I misunderstood any point]

Thanks.

___ · ‎12-22-2020

@dipak wrote:
From a quick look at the "mandelbrot" kernel, it seems like "getBorderDwell" function calculates some work-group wise decision value, so it can be called per work-group once. In the code, however, the function is called per work-item. Please check the related code if work-item wise call can be avoided to improve the overall kernel execution time.

Sorry for a delayed response.

getBorderDwell method computes per-work group value, yes. We execute it for a square with a side equal to `d`.
It would compute pixelDwell value only for its borders. If all border values of our square would have the same dwell value - this means that for all inner points of this square with a side `d` pixelDwell would be the same. So we could simply set it.
Every item of workgroup would process its own pixel from the square border (https://github.com/tupieurods/mandelbrot-dyn/blob/master/MandelbrotVS/MandelbrotOpencl/kernels/mande...) Note this part of the outer loop: `for(int r = tid; r < d; r += groupSize)` - we will never check the same pixel twice. After that we would reduce it to single value here: https://github.com/tupieurods/mandelbrot-dyn/blob/master/MandelbrotVS/MandelbrotOpencl/kernels/mande...

In other words - it is not a bottleneck, because we don't compute extra data. And effectively using all work-items of a workgroup.

About collecting data from profiler. As i said for some reason RGP give me errors like:

[RGP] Failed to finish executing profile with code: 0
[RGP] Counters not supported on current device (asic_device_id=[26751]. asic_family=[141])

I'll be able to retry profiling on a clean system on January. Once issues with RGP would be resolved for me and I'll grab profiling data - I'll post them there.

OpenCL

Device enqueue poor performance