cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

hashyboy
Journeyman III

Why is AMD OpenCL compiler taking excessive build time with complex conditional filters?

If you write a OpenCL filter kernel in which each work items reads more than 8x8 neighboring pixels of the associated pixel and conditionally applies static weights to sum up output pixel value, the AMD OpenCL compiler appears to fall into a bad optimization path and takes an excessive amount of time to compile the kernel (over a minute in some cases). The same kernels compile in under a second with Intel and NVidia compilers. Note: the execution of the kernel itself (tested on Radeon R7 is competitive once kernel is compiled).

This video shows an example with 17x17GaussThresh filter in Blurate:

Haswell vs. Broxton vs. Radeon R7 vs. GeForce GTX Anisotropic Diffusion - YouTube

More details in case changing code layout or default optimization options would help compiler (note that this is auto generated code that can not be dramatically restructured)...

This is the simplified implementation of the filter:

float4 center_pix = read_imagef(x,y);

float4 neighbor_1 = read_imagef(x-8,y-8);

float4 neighbor_2 = read_imagef(x-7,y-8);

float4 neighbor_3 = read_imagef(x-6,y-8);

...

float4 neighbor_289 = read_imagef(x+8,y+8);

float4 output_pix = (0,0,0,0);

if (ABS(neighbor_1.R-center_pix.R)<Threash) output_pix += neighbor_1*Weight_1;

if (ABS(neighbor_2.R-center_pix.R)<Threash) output_pix += neighbor_2*Weight_2;

...

if (ABS(neighbor_289.R-center_pix.R)<Threash) output_pix += neighbor_289*Weight_289;

write_image(x,y, output_pix);

0 Likes
5 Replies
dipak
Big Boss

Hi,

Thanks for reporting the problem. Please share that kernel code and setup details (OS, driver, gpu). We'll take a look.

Regards,

0 Likes
dipak
Big Boss

Thanks for sharing the code. We'll check and get back to you shortly.

Regards,

0 Likes

Hi,

It looks like a compiler optimization issue. Without any optimization (i.e. with build flag "-O0" or -cl-opt-disable), the build time is much faster (few seconds only). I'll forward this issue to the compiler team.

Regards,

0 Likes

Here is an update.

After investigation, the engg. team has found that the issue is in one of the core compiler modules, caused by the huge amount of variables declared in the single kernel. As there is lots of dependency on that module, it may take time to fix the bug without compromising other codes quality. So, here are the workarounds:

  • rewrite the kernel into a something much smaller with few loops, because that is in fact few loops fully unrolled manually.
  • may use pre-compiled kernel code like other apps do in case of huge kernels with long compilation (e.g. use clBuildProgramWithBinary on a pre-compiled kernel).

Regards,

0 Likes
realhet
Miniboss

Hi,

I've just checked the video about this filter and I gotta say that it's as pretty as Lana Del Rey.

For this kind of code: There is like only 5 alu instructions for each memory instruction, so this kernel's bottleneck is clearly the memory IO. On GCN there should be around 32 ALU instructions for every memory instructions to be balanced, so in this case it will not harm performance if you make one FOR loop. This way the code that OpenCL must dealt with would be 1/17 smaller. And don't affraid to use little constant tables for the convolution matrix, as they will be accessed in no time using the highest cache.

But if I'd really wanted to optimize this, I'd do every row in a separate thread. Because when you go one pixel to the right, it needs only 17 reads, not 17^2 like in the current situation. And to optimize further, I'd need at least 4x gpu-streams to be effective, so it would be reasonable to split the lines in half or even more to accomplish the minimum thread count on the GPU.

A side note: On GCN4 there is Data Parallel Processing, so you can make FOR loops, that are indexing registers in no time by using the loop's counter variable. I dunno how to do it in OCL, but I'm sure, you can acces DPP through ASM.

Also use max(RDiff, GDiff, BDiff)<threshold instead of 3 separate compares! There is a 3 operand instruction just for this inside the gpu.

0 Likes