If you write a OpenCL filter kernel in which each work items reads more than 8x8 neighboring pixels of the associated pixel and conditionally applies static weights to sum up output pixel value, the AMD OpenCL compiler appears to fall into a bad optimization path and takes an excessive amount of time to compile the kernel (over a minute in some cases). The same kernels compile in under a second with Intel and NVidia compilers. Note: the execution of the kernel itself (tested on Radeon R7 is competitive once kernel is compiled).
This video shows an example with 17x17GaussThresh filter in Blurate:
More details in case changing code layout or default optimization options would help compiler (note that this is auto generated code that can not be dramatically restructured)...
This is the simplified implementation of the filter:
float4 center_pix = read_imagef(x,y);
float4 neighbor_1 = read_imagef(x-8,y-8);
float4 neighbor_2 = read_imagef(x-7,y-8);
float4 neighbor_3 = read_imagef(x-6,y-8);
float4 neighbor_289 = read_imagef(x+8,y+8);
float4 output_pix = (0,0,0,0);
if (ABS(neighbor_1.R-center_pix.R)<Threash) output_pix += neighbor_1*Weight_1;
if (ABS(neighbor_2.R-center_pix.R)<Threash) output_pix += neighbor_2*Weight_2;
if (ABS(neighbor_289.R-center_pix.R)<Threash) output_pix += neighbor_289*Weight_289;