5 Replies Latest reply on Aug 6, 2016 2:23 PM by realhet

    Why is AMD OpenCL compiler taking excessive build time with complex conditional filters?

    hashyboy

      If you write a OpenCL filter kernel in which each work items reads more than 8x8 neighboring pixels of the associated pixel and conditionally applies static weights to sum up output pixel value, the AMD OpenCL compiler appears to fall into a bad optimization path and takes an excessive amount of time to compile the kernel (over a minute in some cases). The same kernels compile in under a second with Intel and NVidia compilers. Note: the execution of the kernel itself (tested on Radeon R7 is competitive once kernel is compiled).

      This video shows an example with 17x17GaussThresh filter in Blurate:

      Haswell vs. Broxton vs. Radeon R7 vs. GeForce GTX Anisotropic Diffusion - YouTube

       

      More details in case changing code layout or default optimization options would help compiler (note that this is auto generated code that can not be dramatically restructured)...

      This is the simplified implementation of the filter:

      float4 center_pix = read_imagef(x,y);

      float4 neighbor_1 = read_imagef(x-8,y-8);

      float4 neighbor_2 = read_imagef(x-7,y-8);

      float4 neighbor_3 = read_imagef(x-6,y-8);

      ...

      float4 neighbor_289 = read_imagef(x+8,y+8);

      float4 output_pix = (0,0,0,0);

      if (ABS(neighbor_1.R-center_pix.R)<Threash) output_pix += neighbor_1*Weight_1;

      if (ABS(neighbor_2.R-center_pix.R)<Threash) output_pix += neighbor_2*Weight_2;

      ...

      if (ABS(neighbor_289.R-center_pix.R)<Threash) output_pix += neighbor_289*Weight_289;

      write_image(x,y, output_pix);

        • Re: Why is AMD OpenCL compiler taking excessive build time with complex conditional filters?
          dipak

          Hi,

          Thanks for reporting the problem. Please share that kernel code and setup details (OS, driver, gpu). We'll take a look.

           

          Regards,

          • Re: Why is AMD OpenCL compiler taking excessive build time with complex conditional filters?
            dipak

            Thanks for sharing the code. We'll check and get back to you shortly.

             

            Regards,

            • Re: Why is AMD OpenCL compiler taking excessive build time with complex conditional filters?
              realhet

              Hi,

               

              I've just checked the video about this filter and I gotta say that it's as pretty as Lana Del Rey.

               

              For this kind of code: There is like only 5 alu instructions for each memory instruction, so this kernel's bottleneck is clearly the memory IO. On GCN there should be around 32 ALU instructions for every memory instructions to be balanced, so in this case it will not harm performance if you make one FOR loop. This way the code that OpenCL must dealt with would be 1/17 smaller. And don't affraid to use little constant tables for the convolution matrix, as they will be accessed in no time using the highest cache.

               

              But if I'd really wanted to optimize this, I'd do every row in a separate thread. Because when you go one pixel to the right, it needs only 17 reads, not 17^2 like in the current situation. And to optimize further, I'd need at least 4x gpu-streams to be effective, so it would be reasonable to split the lines in half or even more to accomplish the minimum thread count on the GPU.

               

              A side note: On GCN4 there is Data Parallel Processing, so you can make FOR loops, that are indexing registers in no time by using the loop's counter variable. I dunno how to do it in OCL, but I'm sure, you can acces DPP through ASM.

               

              Also use max(RDiff, GDiff, BDiff)<threshold instead of 3 separate compares! There is a 3 operand instruction just for this inside the gpu.