18 Replies Latest reply on Feb 1, 2011 4:36 AM by diedalusus

    Why does the Compiler behave this way?

    diedalusus
      gpu compilation fails

      Hi,

      i am almost new to OpenCL and have following problems:

      1.  I wrote a long kernel (mostly containing 32 while-loops doing binary search over a table which calculates than interpolations over the found values). The problem is following: I seem to reach a limit of the gpu-compiler, where it trys to compile the code(allocating more and more memory, than seems to fail(memory freed), and than trys it again and again) until it finishes with following outputs:

                   clBuildProgram fails with CL_BUILD_PROGRAM_FAILURE

                   and the Build Log says: Error: Creating kernel get_Values failed!

                The code compiles fine on CPU. I can't shorten my code because of the limitations of RAM and non-scalar vectors. So it would be nice to get a hint which problem let the Compiler behave this way or to get more data maybe a verbose-mode of the compiler.

            2. Is there any coincidence between the bug where the clock frequency of the HD5870 isn't read out correctly (0Mhz) and the not-working function barrier on the gpu?(I assume that because CLK_LOCAL_MEM_FENCE where CLK stands for clock?, but maybe iam wrong.). Iam asking this because, the barrier-function is completly ignored on gpu(works fine on cpu).

      Thanks for the answers.

      P.S.: my system: ubuntu x64 10.04LTS + HD5870

              Unfortunatly I can not attach the kernel, because the code is under copyright but if no one knows an answer i try to reproduce the problem with another kernel

      So I figured out when I change this code:

      less_even.s0 = fabs(table[start_ind_even.s0*NMB_COLUMN]-param) <= fabs(table[stop_ind_even.s0*NMB_COLUMN]-param);
        less_even.s1 = fabs(table[start_ind_even.s1*NMB_COLUMN]-param) <= fabs(table[stop_ind_even.s1*NMB_COLUMN]-param);
        less_even.s2 = fabs(table[start_ind_even.s2*NMB_COLUMN]-param) <= fabs(table[stop_ind_even.s2*NMB_COLUMN]-param);
        less_even.s3 = fabs(table[start_ind_even.s3*NMB_COLUMN]-param) <= fabs(table[stop_ind_even.s3*NMB_COLUMN]-param);
        less_even.s4 = fabs(table[start_ind_even.s4*NMB_COLUMN]-param) <= fabs(table[stop_ind_even.s4*NMB_COLUMN]-param);
        less_even.s5 = fabs(table[start_ind_even.s5*NMB_COLUMN]-param) <= fabs(table[stop_ind_even.s5*NMB_COLUMN]-param);
        less_even.s6 = fabs(table[start_ind_even.s6*NMB_COLUMN]-param) <= fabs(table[stop_ind_even.s6*NMB_COLUMN]-param);
        less_even.s7 = fabs(table[start_ind_even.s7*NMB_COLUMN]-param) <= fabs(table[stop_ind_even.s7*NMB_COLUMN]-param);
        less_even.s8 = fabs(table[start_ind_even.s8*NMB_COLUMN]-param) <= fabs(table[stop_ind_even.s8*NMB_COLUMN]-param);
        less_even.s9 = fabs(table[start_ind_even.s9*NMB_COLUMN]-param) <= fabs(table[stop_ind_even.s9*NMB_COLUMN]-param);
        less_even.sA = fabs(table[start_ind_even.sA*NMB_COLUMN]-param) <= fabs(table[stop_ind_even.sA*NMB_COLUMN]-param);
        less_even.sB = fabs(table[start_ind_even.sB*NMB_COLUMN]-param) <= fabs(table[stop_ind_even.sB*NMB_COLUMN]-param);
        less_even.sC = fabs(table[start_ind_even.sC*NMB_COLUMN]-param) <= fabs(table[stop_ind_even.sC*NMB_COLUMN]-param);
        less_even.sD = fabs(table[start_ind_even.sD*NMB_COLUMN]-param) <= fabs(table[stop_ind_even.sD*NMB_COLUMN]-param);
        less_even.sE = fabs(table[start_ind_even.sE*NMB_COLUMN]-param) <= fabs(table[stop_ind_even.sE*NMB_COLUMN]-param);
        less_even.sF = fabs(table[start_ind_even.sF*NMB_COLUMN]-param) <= fabs(table[stop_ind_even.sF*NMB_COLUMN]-param);

      to this one:

      start_ind_even *= NMB_COLUMN;

      less_even.s0 = fabs(table[start_ind_even.s0]-param) <= fabs(table[stop_ind_even.s0*NMB_COLUMN]-param);
        less_even.s1 = fabs(table[start_ind_even.s1]-param) <= fabs(table[stop_ind_even.s1*NMB_COLUMN]-param);
        less_even.s2 = fabs(table[start_ind_even.s2]-param) <= fabs(table[stop_ind_even.s2*NMB_COLUMN]-param);
        less_even.s3 = fabs(table[start_ind_even.s3]-param) <= fabs(table[stop_ind_even.s3*NMB_COLUMN]-param);
        less_even.s4 = fabs(table[start_ind_even.s4]-param) <= fabs(table[stop_ind_even.s4*NMB_COLUMN]-param);
        less_even.s5 = fabs(table[start_ind_even.s5]-param) <= fabs(table[stop_ind_even.s5*NMB_COLUMN]-param);
        less_even.s6 = fabs(table[start_ind_even.s6]-param) <= fabs(table[stop_ind_even.s6*NMB_COLUMN]-param);
        less_even.s7 = fabs(table[start_ind_even.s7]-param) <= fabs(table[stop_ind_even.s7*NMB_COLUMN]-param);
        less_even.s8 = fabs(table[start_ind_even.s8]-param) <= fabs(table[stop_ind_even.s8*NMB_COLUMN]-param);
        less_even.s9 = fabs(table[start_ind_even.s9]-param) <= fabs(table[stop_ind_even.s9*NMB_COLUMN]-param);
        less_even.sA = fabs(table[start_ind_even.sA]-param) <= fabs(table[stop_ind_even.sA*NMB_COLUMN]-param);
        less_even.sB = fabs(table[start_ind_even.sB]-param) <= fabs(table[stop_ind_even.sB*NMB_COLUMN]-param);
        less_even.sC = fabs(table[start_ind_even.sC]-param) <= fabs(table[stop_ind_even.sC*NMB_COLUMN]-param);
        less_even.sD = fabs(table[start_ind_even.sD]-param) <= fabs(table[stop_ind_even.sD*NMB_COLUMN]-param);
        less_even.sE = fabs(table[start_ind_even.sE]-param) <= fabs(table[stop_ind_even.sE*NMB_COLUMN]-param);
        less_even.sF = fabs(table[start_ind_even.sF]-param) <= fabs(table[stop_ind_even.sF*NMB_COLUMN]-param);

      where table is __global float* buffer, NMB_COLUMN a constant given over at compile time and param a float,

      than the compiler show this behaviour(it is not the complete code as you can imagine).

      UPDATE:  Compilation on CPU runs fine but Execution stops with a SIGSEGV acessing adress 0x0 in __OpenCL_get_Values_stub(). What is this function for?

      backtrace:

      #0  0x00007fffeb750bc0 in __OpenCL_get_values_stub () from /tmp/OCLbMaXbq.so
      #1  0x00007fffe3828c2d in ?? () from /usr/local/ati-stream-sdk-v2.3-lnx64/lib/x86_64/libatiocl64.so
      #2  0x00007fffe3829791 in ?? () from /usr/local/ati-stream-sdk-v2.3-lnx64/lib/x86_64/libatiocl64.so
      #3  0x00007fffe3878c2c in ?? () from /usr/local/ati-stream-sdk-v2.3-lnx64/lib/x86_64/libatiocl64.so
      #4  0x00007fffe3876edd in ?? () from /usr/local/ati-stream-sdk-v2.3-lnx64/lib/x86_64/libatiocl64.so
      #5  0x00007ffff6a849ca in start_thread (arg=) at pthread_create.c:300
      #6  0x00007ffff718d70d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
      #7  0x0000000000000000 in ?? ()

        • Why does the Compiler behave this way?
          MicahVillmow
          CLK stands for CL Kernel, not clock
            • Why does the Compiler behave this way?
              diedalusus

              Thanks Micah,

              so I assume that it has nothing to do with that and I have to do a further look on that topic.

              Regarding to point one i can add following thing to it:

              Same behaviour(compiler allocates more and more memory on host, free it and does same thing a couple of times) on Windows 7 32-bit, with SDK 2.3 and CCC 10.12. So it seems to be OS independent.

              I used the KernelAnalyzer and got following result on the unchanged code from the first post:

              Name,                 ALU,  Fetch,CF    ,Write,GPR,Scratch Reg  ,Min          ,Max         ,Avg       ,Est Cycles,ALU:Fetch,BottleNeck,  Throughput,         Thread\Clock
              Radeon HD 5870,2699,1020,2497,183  ,62  ,132              ,0.70        ,22122.80,2827.82,2495.14  ,0.37        ,Global Fetch,11 M Threads\Sec,0.01

              The scratch registers are red highlighted and the change in the code uses more GPRs and because there are already scratch registers used the change maps thats as well to the scratch registers.

              So can be the problem there that no more scratch registers are left?

                • Why does the Compiler behave this way?
                  himanshu.gautam

                  Using scratch registers is quite strongly disproved. I think you can find if that is the problem by using global variablesinstead of local variables(which would be stored in registers).

                  ALthough I am not sure, I don't think reaching the limit of scratch registers is possible. IIRC scratch registers are memory units having local scope but are actually stored in global memory itself.

                  Also It appears your ALU:Fetch ratio is very low(.37 if i see correctly). The fact is that it would be very difficult to get some performance out with this ALU:Fetch ratio as the ALU would remain stalled most of the time.

                    • Why does the Compiler behave this way?
                      Jawed

                      Try using:

                      #pragma unroll 1

                      in the line before one or more of your loops. This theoretically prevents the compiler from unrolling the loop that follows.

                      In general the compiler likes to analyse loops to unroll them if possible, i.e. if there are literals that define the loop's start and end then it will examine the possibility of unrolling the loop.

                      • Why does the Compiler behave this way?
                        diedalusus

                         

                        Originally posted by: himanshu.gautam Using scratch registers is quite strongly disproved. I think you can find if that is the problem by using global variablesinstead of local variables(which would be stored in registers).

                         

                        ALthough I am not sure, I don't think reaching the limit of scratch registers is possible. IIRC scratch registers are memory units having local scope but are actually stored in global memory itself.



                        Yes you are right. Today I was getting the same strange error, where in the unchanged code no scratch registers were used.

                         

                        Originally posted by: himanshu.gautam

                         

                        Also It appears your ALU:Fetch ratio is very low(.37 if i see correctly). The fact is that it would be very difficult to get some performance out with this ALU:Fetch ratio as the ALU would remain stalled most of the time.

                        That is not yet such a problem. I still try to optimize this but changing something getting to non-compilable code doesn't help me

                         

                        Originally posted by: Jawed

                        Try using:

                        #pragma unroll 1

                        in the line before one or more of your loops. This theoretically prevents the compiler from unrolling the loop that follows.

                        In general the compiler likes to analyse loops to unroll them if possible, i.e. if there are literals that define the loop's start and end then it will examine the possibility of unrolling the loop.

                        Thanks for the advice. Unfortanetly, this did not changed anything.

                        Tommorrow, I'll try to give a example-kernel for this.

                  • Why does the Compiler behave this way?
                    MicahVillmow
                    diedalusus,
                    The loop unrolling is occuring at the lower level compiler which the pragma does not affect. A way you can work around this is by passing in as an argument to the kernel the loop count instead of having it hard coded in the kernel.
                      • Why does the Compiler behave this way?
                        diedalusus

                         

                        Originally posted by: MicahVillmow diedalusus, The loop unrolling is occuring at the lower level compiler which the pragma does not affect. A way you can work around this is by passing in as an argument to the kernel the loop count instead of having it hard coded in the kernel.


                        Tried this too, but same behaviour. I now try to reproduce this with a "small example".

                          • Why does the Compiler behave this way?
                            diedalusus

                            So here is an example kernel:

                            If you comment out the enclosed part of the code with the comments the compiler compiles. The while loops are changed binary searches. NMB_COLUMN=23

                            Maybe some one can find out what the problem is

                            #pragma OPENCL EXTENSION cl_amd_printf : enable __constant sampler_t imageSampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP | CLK_FILTER_NEAREST; uchar find(uint min, uint max, uint* mid, float param, __global float* table){ while(min < max){ *mid=(max + min)>>1; if((table[*mid*NMB_COLUMN] > param && table[(*mid-1)*NMB_COLUMN] <= param) || (table[*mid*NMB_COLUMN] < param && table[(*mid-1)*NMB_COLUMN] >= param)){ return 1; } (param < table[(*mid-1)*NMB_COLUMN]) ? max=*mid-1 : min=*mid+1; } return 0; } __kernel void get_values(__global float* a, __global float* b, __global float* c, __global float* d, __global float* e, __global float* f, int nmb, __constant float* aGet, __constant float* bGet, __constant float* cGet, __constant float* dGet, __constant float* eGet, __global float* table, __read_only image2d_t start, __read_only image2d_t stop, __global float* dest){ uchar factor, factor2; uint i,index; index = get_global_id(0)*get_global_size(1) + get_global_id(1); int2 coord; uint4 ind; float16 even,odd; float tmp,param = f[index]; float8 params = (float8)(a[index], b[index], c[index], d[index], e[index], 0,0,0); index*=NMB_COLUMN; uint16 start_even, start_odd, stop_even, stop_odd; uint min,mid, max; uint16 mid_even, mid_odd; uchar16 found_even, found_odd, less_even, less_odd; found_even = (uchar16)0; found_odd = (uchar16)0; uint16 indvector = (uint16)0; float8 dists = (float8)0; float8 half_int1, half_int2; float4 quart_int1, quart_int2; float2 eighth_int1, eighth_int2; float16 lini,tmp1,tmp2; factor = 5*(params.s0>aGet[0]); factor2 = factor*(params.s0<aGet[5]); if(factor2) for(i=1;i<=5;i++){ tmp = aGet[i]; if(params.s0<tmp){ indvector.s4 = i-1; dists.s4 = (tmp - params.s0) / (tmp - aGet[i-1]); break; } } indvector.s4 += factor*(indvector.s4 == 0); indvector.s9 = factor2?indvector.s4+1:indvector.s4; factor = 9*(params.s1>bGet[0]); factor2 = factor*(params.s1<bGet[9]); if(factor2) for(i=1;i<=9;i++){ tmp = bGet[i]; if(params.s1<tmp){ indvector.s3 = i-1; dists.s3 = (tmp - params.s1) / (tmp - bGet[i-1]); break; } } indvector.s3 += factor*(indvector.s3 == 0); indvector.s8 = factor2?indvector.s3+1:indvector.s3; factor = 0*(params.s2>cGet[0]); factor2 *= factor*(params.s2<cGet[0]); if(factor2) for(i=1;i<=0;i++){ tmp = cGet[i]; if(params.s2<tmp){ indvector.s2 = i-1; dists.s2 = (tmp - params.s2) / (tmp - cGet[i-1]); break; } } indvector.s2 += factor*(indvector.s2 == 0); indvector.s7 = factor2?indvector.s2+1:indvector.s2; factor = 100*(params.s3>dGet[0]); factor2 = factor*(params.s3<dGet[100]); if(factor2) for(i=1;i<=100;i++){ tmp = dGet[i]; if(params.s3<tmp){ indvector.s1 = i-1; dists.s1 = (tmp - params.s3) / (tmp - dGet[i-1]); break; } } indvector.s1 += factor*(indvector.s1 == 0); indvector.s6 = factor2?(indvector.s1+1)*4:indvector.s1*4; indvector.s1 *= 4; factor = 3*(params.s4>eGet[0]); factor2 = factor*(params.s4<eGet[3]); if(factor2) for(i=1;i<=3;i++){ tmp = eGet[i]; if(params.s4<tmp){ indvector.s0 = i-1; dists.s0 = (tmp - params.s4) / (tmp - eGet[i-1]); break; } } indvector.s0 += factor*(indvector.s0 == 0); indvector.s5 = factor2?indvector.s0+1:indvector.s0; coord = (int2)(indvector.s2+indvector.s1+indvector.s0,indvector.s4+indvector.s3); ind = read_imageui(start, imageSampler, coord); start_even.s0 = ind.x-1; ind = read_imageui(stop, imageSampler, coord); stop_even.s0 = ind.x-1; coord.even += indvector.s5-indvector.s0; ind = read_imageui(start, imageSampler, coord); start_odd.s0 = ind.x-1; ind = read_imageui(stop, imageSampler, coord); stop_odd.s0 = ind.x-1; mid=start_even.s0; found_even.s0 = find(start_even.s0, stop_even.s0, &mid, param, table); mid_even.s0=mid; mid=start_odd.s0; found_odd.s0 = find(start_odd.s0, stop_odd.s0, &mid, param, table); mid_odd.s0=mid; //******get_Values()************* coord.even = indvector.s2+indvector.s6+indvector.s0; ind = read_imageui(start, imageSampler, coord); start_even.s1 = ind.x-1; ind = read_imageui(stop, imageSampler, coord); stop_even.s1 = ind.x-1; coord.even += indvector.s5-indvector.s0; ind = read_imageui(start, imageSampler, coord); start_odd.s1 = ind.x-1; ind = read_imageui(stop, imageSampler, coord); stop_odd.s1 = ind.x-1; mid=start_even.s1; found_even.s1 = find(start_even.s1, stop_even.s1, &mid, param, table); mid_even.s1=mid; mid=start_odd.s0; found_odd.s1 = find(start_odd.s1, stop_odd.s1, &mid, param, table); mid_odd.s1=mid; //******get_Values()************* coord.even = indvector.s7+indvector.s1+indvector.s0; ind = read_imageui(start, imageSampler, coord); start_even.s2 = ind.x-1; ind = read_imageui(stop, imageSampler, coord); stop_even.s2 = ind.x-1; coord.even += indvector.s5-indvector.s0; ind = read_imageui(start, imageSampler, coord); start_odd.s2 = ind.x-1; ind = read_imageui(stop, imageSampler, coord); stop_odd.s2 = ind.x-1; mid=start_even.s2; found_even.s2 = find(start_even.s2, stop_even.s2, &mid, param, table); mid_even.s2=mid; mid=start_odd.s2; found_odd.s2 = find(start_odd.s2, stop_odd.s2, &mid, param, table); mid_odd.s2=mid; //******get_Values()************* coord.even = indvector.s7+indvector.s6+indvector.s0; ind = read_imageui(start, imageSampler, coord); start_even.s3 = ind.x-1; ind = read_imageui(stop, imageSampler, coord); stop_even.s3 = ind.x-1; coord.even += indvector.s5-indvector.s0; ind = read_imageui(start, imageSampler, coord); start_odd.s3 = ind.x-1; ind = read_imageui(stop, imageSampler, coord); stop_odd.s3 = ind.x-1; min=start_even.s3; mid=start_even.s3; max=stop_even.s3; while(min < max){ mid=(max + min)>>1; if((table[mid*NMB_COLUMN] > param && table[(mid-1)*NMB_COLUMN] <= param) || (table[mid*NMB_COLUMN] < param && table[(mid-1)*NMB_COLUMN] >= param)){ found_even.s3 = 1; break; } (param < table[(mid-1)*NMB_COLUMN]) ? max=mid-1 : min=mid+1; } mid_even.s3=mid; min=start_odd.s3; mid=start_odd.s3; max=stop_odd.s3; while(min < max){ mid=(max + min)>>1; if((table[mid*NMB_COLUMN] > param && table[(mid-1)*NMB_COLUMN] <= param) || (table[mid*NMB_COLUMN] < param && table[(mid-1)*NMB_COLUMN] >= param)){ found_odd.s3 = 1; break; } (param < table[(mid-1)*NMB_COLUMN]) ? max=mid-1 : min=mid+1; } mid_odd.s3=mid; //******get_Values()************* coord = (int2)(indvector.s0+indvector.s1+indvector.s2, indvector.s4+indvector.s8); ind = read_imageui(start, imageSampler, coord); start_even.s4 = ind.x-1; ind = read_imageui(stop, imageSampler, coord); stop_even.s4 = ind.x-1; coord.even += indvector.s5-indvector.s0; ind = read_imageui(start, imageSampler, coord); start_odd.s4 = ind.x-1; ind = read_imageui(stop, imageSampler, coord); stop_odd.s4 = ind.x-1; min=start_even.s4; mid=start_even.s4; max=stop_even.s4; while(min < max){ mid=(max + min)>>1; if((table[mid*NMB_COLUMN] > param && table[(mid-1)*NMB_COLUMN] <= param) || (table[mid*NMB_COLUMN] < param && table[(mid-1)*NMB_COLUMN] >= param)){ found_even.s4 = 1; break; } (param < table[(mid-1)*NMB_COLUMN]) ? max=mid-1 : min=mid+1; } mid_even.s4=mid; min=start_odd.s4; mid=start_odd.s4; max=stop_odd.s4; while(min < max){ mid=(max + min)>>1; if((table[mid*NMB_COLUMN] > param && table[(mid-1)*NMB_COLUMN] <= param) || (table[mid*NMB_COLUMN] < param && table[(mid-1)*NMB_COLUMN] >= param)){ found_odd.s4 = 1; break; } (param < table[(mid-1)*NMB_COLUMN]) ? max=mid-1 : min=mid+1; } mid_odd.s4=mid; //start comment out //******get_Values()************* coord.even = indvector.s2+indvector.s6+indvector.s0; ind = read_imageui(start, imageSampler, coord); start_even.s5= ind.x-1; ind = read_imageui(stop, imageSampler, coord); stop_even.s5 = ind.x-1; coord.even += indvector.s5-indvector.s0; ind = read_imageui(start, imageSampler, coord); start_odd.s5 = ind.x-1; ind = read_imageui(stop, imageSampler, coord); stop_odd.s5 = ind.x-1; min=start_even.s5; mid=start_even.s5; max=stop_even.s5; while(min < max){ mid=(max + min)>>1; if((table[mid*NMB_COLUMN] > param && table[(mid-1)*NMB_COLUMN] <= param) || (table[mid*NMB_COLUMN] < param && table[(mid-1)*NMB_COLUMN] >= param)){ found_even.s5 = 1; break; } (param < table[(mid-1)*NMB_COLUMN]) ? max=mid-1 : min=mid+1; } mid_even.s5=mid; min=start_odd.s5; mid=start_odd.s5; max=stop_odd.s5; while(min < max){ mid=(max + min)>>1; if((table[mid*NMB_COLUMN] > param && table[(mid-1)*NMB_COLUMN] <= param) || (table[mid*NMB_COLUMN] < param && table[(mid-1)*NMB_COLUMN] >= param)){ found_odd.s5 = 1; break; } (param < table[(mid-1)*NMB_COLUMN]) ? max=mid-1 : min=mid+1; } mid_odd.s5=mid; //end of commenting it out mid_even *= NMB_COLUMN; mid_odd *= NMB_COLUMN; tmp1.s0 = table[mid_even.s0-NMB_COLUMN]; tmp1.s1 = table[mid_even.s1-NMB_COLUMN]; tmp1.s2 = table[mid_even.s2-NMB_COLUMN]; tmp1.s3 = table[mid_even.s3-NMB_COLUMN]; tmp1.s4 = table[mid_even.s4-NMB_COLUMN]; tmp1 *= -1; tmp2 = tmp1; tmp1 += param; tmp1 = fabs(tmp1); tmp2.s0 += table[mid_even.s0]; tmp2.s1 += table[mid_even.s1]; tmp2.s2 += table[mid_even.s2]; tmp2.s3 += table[mid_even.s3]; tmp2.s4 += table[mid_even.s4]; tmp2.s5 += table[mid_even.s5]; tmp2 = fabs(tmp2); even = tmp1 / tmp2; tmp1.s0 = table[mid_odd.s0-NMB_COLUMN]; tmp1.s1 = table[mid_odd.s1-NMB_COLUMN]; tmp1.s2 = table[mid_odd.s2-NMB_COLUMN]; tmp1.s3 = table[mid_odd.s3-NMB_COLUMN]; tmp1.s4 = table[mid_odd.s4-NMB_COLUMN]; tmp1.s5 = table[mid_odd.s5-NMB_COLUMN]; tmp1 *= -1; tmp2 = tmp1; tmp1 += param; tmp1 = fabs(tmp1); tmp2.s0 += table[mid_odd.s0]; tmp2.s1 += table[mid_odd.s1]; tmp2.s2 += table[mid_odd.s2]; tmp2.s3 += table[mid_odd.s3]; tmp2.s4 += table[mid_odd.s4]; tmp2.s5 += table[mid_odd.s5]; tmp2 = fabs(tmp2); odd = tmp1 / tmp2; start_odd *= NMB_COLUMN; stop_odd *= NMB_COLUMN; less_odd.s0 = fabs(table[start_odd.s0]-param) <= fabs(table[stop_odd.s0]-param); less_odd.s1 = fabs(table[start_odd.s1]-param) <= fabs(table[stop_odd.s1]-param); less_odd.s2 = fabs(table[start_odd.s2]-param) <= fabs(table[stop_odd.s2]-param); less_odd.s3 = fabs(table[start_odd.s3]-param) <= fabs(table[stop_odd.s3]-param); less_odd.s4 = fabs(table[start_odd.s4]-param) <= fabs(table[stop_odd.s4]-param); less_odd.s5 = fabs(table[start_odd.s5]-param) <= fabs(table[stop_odd.s5]-param); less_even.s0 = fabs(table[start_even.s0*NMB_COLUMN]-param) <= fabs(table[stop_even.s0*NMB_COLUMN]-param); less_even.s1 = fabs(table[start_even.s1*NMB_COLUMN]-param) <= fabs(table[stop_even.s1*NMB_COLUMN]-param); less_even.s2 = fabs(table[start_even.s2*NMB_COLUMN]-param) <= fabs(table[stop_even.s2*NMB_COLUMN]-param); less_even.s3 = fabs(table[start_even.s3*NMB_COLUMN]-param) <= fabs(table[stop_even.s3*NMB_COLUMN]-param); less_even.s4 = fabs(table[start_even.s4*NMB_COLUMN]-param) <= fabs(table[stop_even.s4*NMB_COLUMN]-param); less_even.s5 = fabs(table[start_even.s5*NMB_COLUMN]-param) <= fabs(table[stop_even.s5*NMB_COLUMN]-param); start_even *= convert_uint16(less_even)*NMB_COLUMN; stop_even *= convert_uint16((!less_even))*NMB_COLUMN; start_even += stop_even; start_odd *= convert_uint16(less_odd); stop_odd *= convert_uint16((!less_odd)); start_odd += stop_odd; for(i=0;i<nmb;i++){ tmp1.s0 = table[mid_even.s0+i]; tmp1.s1 = table[mid_even.s1+i]; tmp1.s2 = table[mid_even.s2+i]; tmp1.s3 = table[mid_even.s3+i]; tmp1.s4 = table[mid_even.s4+i]; tmp1.s5 = table[mid_even.s5+i]; tmp1.s6 = table[mid_even.s6+i]; tmp1.s7 = table[mid_even.s7+i]; tmp1.s8 = table[mid_even.s8+i]; tmp1.s9 = table[mid_even.s9+i]; tmp1.sA = table[mid_even.sA+i]; tmp1.sB = table[mid_even.sB+i]; tmp1.sC = table[mid_even.sC+i]; tmp1.sD = table[mid_even.sD+i]; tmp1.sE = table[mid_even.sE+i]; tmp1.sF = table[mid_even.sF+i]; tmp1 *= even; tmp2.s0 = table[mid_even.s0-NMB_COLUMN+i]; tmp2.s1 = table[mid_even.s1-NMB_COLUMN+i]; tmp2.s2 = table[mid_even.s2-NMB_COLUMN+i]; tmp2.s3 = table[mid_even.s3-NMB_COLUMN+i]; tmp2.s4 = table[mid_even.s4-NMB_COLUMN+i]; tmp2.s5 = table[mid_even.s5-NMB_COLUMN+i]; tmp2.s6 = table[mid_even.s6-NMB_COLUMN+i]; tmp2.s7 = table[mid_even.s7-NMB_COLUMN+i]; tmp2.s8 = table[mid_even.s8-NMB_COLUMN+i]; tmp2.s9 = table[mid_even.s9-NMB_COLUMN+i]; tmp2.sA = table[mid_even.sA-NMB_COLUMN+i]; tmp2.sB = table[mid_even.sB-NMB_COLUMN+i]; tmp2.sC = table[mid_even.sC-NMB_COLUMN+i]; tmp2.sD = table[mid_even.sD-NMB_COLUMN+i]; tmp2.sE = table[mid_even.sE-NMB_COLUMN+i]; tmp2.sF = table[mid_even.sF-NMB_COLUMN+i]; tmp2 *= (1-even); tmp1 += tmp2; tmp1 *= convert_float16(found_even); tmp2.s0 = table[start_even.s0+i]; tmp2.s1 = table[start_even.s1+i]; tmp2.s2 = table[start_even.s2+i]; tmp2.s3 = table[start_even.s3+i]; tmp2.s4 = table[start_even.s4+i]; tmp2.s5 = table[start_even.s5+i]; tmp2.s6 = table[start_even.s6+i]; tmp2.s7 = table[start_even.s7+i]; tmp2.s8 = table[start_even.s8+i]; tmp2.s9 = table[start_even.s9+i]; tmp2.sA = table[start_even.sA+i]; tmp2.sB = table[start_even.sB+i]; tmp2.sC = table[start_even.sC+i]; tmp2.sD = table[start_even.sD+i]; tmp2.sE = table[start_even.sE+i]; tmp2.sF = table[start_even.sF+i]; tmp2 *= convert_float16((!found_even)); lini = tmp1 + tmp2; lini *= dists.s0; tmp1.s0 = table[mid_odd.s0+i]; tmp1.s1 = table[mid_odd.s1+i]; tmp1.s2 = table[mid_odd.s2+i]; tmp1.s3 = table[mid_odd.s3+i]; tmp1.s4 = table[mid_odd.s4+i]; tmp1.s5 = table[mid_odd.s5+i]; tmp1.s6 = table[mid_odd.s6+i]; tmp1.s7 = table[mid_odd.s7+i]; tmp1.s8 = table[mid_odd.s8+i]; tmp1.s9 = table[mid_odd.s9+i]; tmp1.sA = table[mid_odd.sA+i]; tmp1.sB = table[mid_odd.sB+i]; tmp1.sC = table[mid_odd.sC+i]; tmp1.sD = table[mid_odd.sD+i]; tmp1.sE = table[mid_odd.sE+i]; tmp1.sF = table[mid_odd.sF+i]; tmp1 *= odd; tmp2.s0 = table[mid_odd.s0-NMB_COLUMN+i]; tmp2.s1 = table[mid_odd.s1-NMB_COLUMN+i]; tmp2.s2 = table[mid_odd.s2-NMB_COLUMN+i]; tmp2.s3 = table[mid_odd.s3-NMB_COLUMN+i]; tmp2.s4 = table[mid_odd.s4-NMB_COLUMN+i]; tmp2.s5 = table[mid_odd.s5-NMB_COLUMN+i]; tmp2.s6 = table[mid_odd.s6-NMB_COLUMN+i]; tmp2.s7 = table[mid_odd.s7-NMB_COLUMN+i]; tmp2.s8 = table[mid_odd.s8-NMB_COLUMN+i]; tmp2.s9 = table[mid_odd.s9-NMB_COLUMN+i]; tmp2.sA = table[mid_odd.sA-NMB_COLUMN+i]; tmp2.sB = table[mid_odd.sB-NMB_COLUMN+i]; tmp2.sC = table[mid_odd.sC-NMB_COLUMN+i]; tmp2.sD = table[mid_odd.sD-NMB_COLUMN+i]; tmp2.sE = table[mid_odd.sE-NMB_COLUMN+i]; tmp2.sF = table[mid_odd.sF-NMB_COLUMN+i]; tmp2 *= (1-odd); tmp1 += tmp2; tmp1 *= convert_float16(found_odd); tmp2.s0 = table[start_odd.s0+i]; tmp2.s1 = table[start_odd.s1+i]; tmp2.s2 = table[start_odd.s2+i]; tmp2.s3 = table[start_odd.s3+i]; tmp2.s4 = table[start_odd.s4+i]; tmp2.s5 = table[start_odd.s5+i]; tmp2.s6 = table[start_odd.s6+i]; tmp2.s7 = table[start_odd.s7+i]; tmp2.s8 = table[start_odd.s8+i]; tmp2.s9 = table[start_odd.s9+i]; tmp2.sA = table[start_odd.sA+i]; tmp2.sB = table[start_odd.sB+i]; tmp2.sC = table[start_odd.sC+i]; tmp2.sD = table[start_odd.sD+i]; tmp2.sE = table[start_odd.sE+i]; tmp2.sF = table[start_odd.sF+i]; tmp2 *= convert_float16((!found_odd)); tmp1 += tmp2; lini += (1-dists.s0)*tmp1; half_int1 = dists.s1*lini.even; half_int2 = (1-dists.s1)*lini.odd; half_int1 += half_int2; quart_int1 = dists.s2*half_int1.even; quart_int2 = (1-dists.s2)*half_int1.odd; quart_int1 += quart_int2; eighth_int1 = dists.s3*quart_int1.even; eighth_int2 = (1-dists.s3)*quart_int1.odd; eighth_int1 += eighth_int2; dest[index+i] = dists.s4*eighth_int1.even + (1-dists.s4)*eighth_int1.odd; } }

                              • Why does the Compiler behave this way?
                                Jawed

                                I just tried making NMB_COLUMN a parameter to your kernel, instead of a #define and then made it a parameter to the find function, too. Then edited every instance of the find call to pass NMB_COLUMN from the kernel get_values into find.

                                It still fails.

                                I used Stream Kernel Analyzer which is now called APP Kernel Analyzer, I think.

                                I'm unsure whether Catalyst 11.1 with APP contains a revised version of the OpenCL compiler. It is AMD's OpenCL compiler that's failing. I'm not sure if Catalyst 11.1 has a newer OpenCL compiler than the compiler in SDK 2.3.

                                Basically, you've broken the compiler. Welcome to the club!

                                  • Why does the Compiler behave this way?
                                    diedalusus

                                     

                                    Originally posted by: Jawed

                                     

                                    I'm unsure whether Catalyst 11.1 with APP contains a revised version of the OpenCL compiler. It is AMD's OpenCL compiler that's failing. I'm not sure if Catalyst 11.1 has a newer OpenCL compiler than the compiler in SDK 2.3.

                                     

                                     

                                     

                                    I tried it with 11.1 too: no difference.

                            • Why does the Compiler behave this way?
                              MicahVillmow
                              Jawed,
                              Catalyst uses the compiler from the most recent SDK release.
                              • Why does the Compiler behave this way?
                                MicahVillmow
                                diedalusus,
                                Your kernel compiles fine with the compiler that will be included in SDK 2.4 for all devices that support images.
                                • Why does the Compiler behave this way?
                                  MicahVillmow
                                  Sorry, we aren't allowed to give out release dates.