9 Replies Latest reply on Sep 29, 2016 4:10 AM by dipak

    confirm bugs in cl_amd_printf

    mkohler

      Hello OpenCLers,


      I'm seeing two problems with cl_amd_printf. These problems only occur when running on the GPU.


      1. If the input string doesn't include a % format specifier, the input string doesn't get printed.
      2. If the built kernel is retrieved from cache, a Segmentation fault occurs.


      The details of my system are:

      - Ubuntu 14.04 LTS, x86_64

      - processor: AMD A10 7850K

      - Catalyst 4.4.12874

      - clinfo version: OpenCL 1.2 AMD-APP (1445.5)

      - running on the GPU

       

      I've pasted code to demonstrate these problems below. On my system, the first time the code is run it will print the "lucky number" line. The second time the code is run, I see the message that "Built kernel retrieved from cache" and a segmentation fault.

       

      Mark

      ---------------------------------------------------------------------

      import sys

      import pyopencl as cl

       

      KERNEL_STR = r'''

      #pragma OPENCL EXTENSION cl_amd_printf : enable

       

      __kernel void cl_kernel() {

          // The line below does not get printed.

          printf("Hello, world\n");

       

          // This line will get printed.

          printf("My lucky number is %d\n", 7);

       

          // The line below does not get printed.

          printf("Hello again.\n");

      }

      '''

       

       

      def main():

          platform = cl.get_platforms()[0]

          devices = platform.get_devices(cl.device_type.GPU)

          ctx = cl.Context(devices)

          device = devices[0]

          cmd_q = cl.CommandQueue(ctx, device)

       

          program = cl.Program(ctx, KERNEL_STR)

          program.build(devices=[device])

       

          program.cl_kernel(cmd_q, (4,), None)

       

       

      if __name__ == '__main__':

          sys.exit(main())

        • Re: confirm bugs in cl_amd_printf
          dipak

          Hi,

           

          Thanks for reporting. We'll try to reproduce the error and get back to you.

           

          Regards,

          • Re: confirm bugs in cl_amd_printf
            dipak

            Hi,

            I am able to reproduce the printf error (but under different system setup and driver). I'll do some more test and, if required, I'll file an internal bug report.

             

            Regards,

              • Re: confirm bugs in cl_amd_printf
                mkohler

                You were able to reproduce both the missing output problem, and the segmentation fault?

                 

                Mark

                  • Re: confirm bugs in cl_amd_printf
                    dipak

                    Hi Mark,

                    I was able to reproduce the missing output problem, but not the segmentation fault.

                    [FYI: I used VS2012 and AMD HD6970 using a different version of driver (greater than your, as per "clinfo").]

                     

                    Regards,

                    • Re: confirm bugs in cl_amd_printf
                      Raistmer

                      I can confirm segmentation fault too:

                       

                      - Unhandled Exception Record -

                      Reason: Access Violation (0xc0000005) at address 0x54DCC4C9 read attempt to address 0x00000010

                       

                       

                      - Registers -

                      eax=00000000 ebx=00000000 ecx=0676fd2c edx=0666e8b8 esi=00000000 edi=006555a8

                      eip=54dcc4c9 esp=0676fcd0 ebp=0676fcdc

                      cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010246

                       

                       

                      - Callstack -

                      ChildEBP RetAddr  Args to Child

                      0676fcdc 54e35b80 00000000 00000000 ffffffff 00000002 amdocl!oclGetAsic+0x0

                      0676fd88 54e35258 00000000 03313000 0676fdc0 0846c1c8 amdocl!clSetKernelExecInfo+0x0

                      0676fdd0 54e19db2 05043ba8 003f4b01 00000000 04fde860 amdocl!clSetKernelExecInfo+0x0

                      0676fe88 54e19993 0846c1b4 00000000 00000000 00000000 amdocl!clSetKernelExecInfo+0x0

                      0676febc 54ded396 0846c0f0 04dcb608 04dcb588 04dcb608 amdocl!clSetKernelExecInfo+0x0

                      0676ff14 54ded456 05043ba8 04dcb608 00658a08 0676ff4c amdocl!clSetKernelExecInfo+0x0

                      0676ff34 54de91cd 04dcb500 00000000 00000000 00658a08 amdocl!clSetKernelExecInfo+0x0

                      0676ff58 54dfe2fc 00000000 00000000 04dcb608 00000000 amdocl!clSetKernelExecInfo+0x0

                      0676ff88 7561338a 04dcb608 0676ffd4 779d9f72 04dcb608 amdocl!clSetKernelExecInfo+0x0

                      0676ff94 779d9f72 04dcb608 4d5e5df8 00000000 00000000 kernel32!BaseThreadInitThunk+0x0

                      0676ffd4 779d9f45 54dfe2c0 04dcb608 00000000 00000000 ntdll!RtlInitializeExceptionChain+0x0

                      0676ffec 00000000 54dfe2c0 04dcb608 00000000 00000006 ntdll!RtlInitializeExceptionChain+0x0

                       

                      The reason: kernel was read from binary cache and it contained: printf("idx=%d\n",idx);

                      With this line commented out crash doesn't happen. With this line present but kernel built from source- crash doesn't happen.

                      Driver/device:

                        Name: Loveland
                        Vendor: Advanced Micro Devices, Inc.
                        Driver version: 1642.5 (VM)
                        Version: OpenCL 1.2 AMD-APP (1642.5)
                        • Re: confirm bugs in cl_amd_printf
                          dipak

                          Thanks for reporting. Could you please share your setup details where the segfault was observed?

                          Please share your kernel code too if it's different than the above one.

                           

                          Regards,

                            • Re: confirm bugs in cl_amd_printf
                              Raistmer

                              Besides already listed: it's windows 7 x64 host.

                               

                              Kernel is:

                              __attribute__((reqd_work_group_size(RESULT_SIZE, 2, 1)))

                              __kernel void Gaussian_logging_kernel_twin(__global const float* restrict PoT, __global const float* restrict NormMaxPower,

                                  int ul_FftLength,

                                __constant ocl_GaussFit_t* restrict settings, __global uint* restrict result_flag,

                                __global float4* restrict GaussFitResults,__global GPUState* restrict gpu_state,

                                __global const float* restrict GaussFitScores){

                                __local Gaussian_core signals[RESULT_SIZE][2][MAX_GPU_GAUSSIANS+1];

                                __local int res_size[RESULT_SIZE][2];

                                __local float best_score[RESULT_SIZE][2];

                                const int tid=get_global_id(0);

                                const int neg=get_global_id(1);

                                int bfft_ind;

                                int bbin;

                                float bscore=-12.f;

                                int idx=gpu_state->gaussians.index;

                                int local_idx=0;

                                int start=settings->GaussTOffsetStart;

                                int stop=settings->GaussTOffsetStop;

                                int stride=(RESULT_SIZE>ul_FftLength?1:ul_FftLength/RESULT_SIZE);

                                if(result_flag[tid+RESULT_SIZE*neg]==1){

                                result_flag[tid+RESULT_SIZE*neg]=0;

                                bscore=gpu_state->gaussians.gaussian[0].score;

                                //max(gpu_state->gaussians.bscore,gpu_state->gaussians.gaussian[0].score);//R take "best"

                                for(int j=max(tid*stride,1);j<(tid+1)*stride;j++)//R: DC PoT always skipped in Gaussian search

                                for(int k=start;k<stop;k++){

                                if(idx+local_idx>=MAX_GPU_GAUSSIANS)break;//R:overflow

                                int coord=ul_FftLength*k+j+ul_FftLength*GAUSS_POT_LENGTH*neg;

                                float4 tmp=GaussFitResults[coord];

                                if(tmp.x==0.0f)continue;//R: nothing here

                                float score = (idx?-12.f:GaussFitScores[coord]);

                                //R: best update

                                if( (score > bscore)&&(tmp.z<=settings->gauss_chi_sq_thresh) ){

                                bscore=score;

                                bfft_ind=k;

                                bbin=j;

                                }

                                //R: new gaussian logging locally

                                if( (tmp.z<=settings->GaussChiSqThresh) && (tmp.w >= settings->gauss_null_chi_sq_thresh) &&

                                (tmp.y>=tmp.x*settings->GaussPeakPowerThresh)){

                                local_idx++;

                                signals[tid][neg][local_idx].fft_ind=k;

                                signals[tid][neg][local_idx].bin=j;

                                }

                                }

                                   //R: best gaussian update in global memory

                                signals[tid][neg][0].fft_ind=bfft_ind;

                                signals[tid][neg][0].bin=bbin;

                                }

                                //R: these 2 updated always, even if nothing in current chunk

                                best_score[tid][neg]=bscore;

                                res_size[tid][neg]=local_idx;

                                barrier(CLK_LOCAL_MEM_FENCE);

                                //R: now complete reduction from local memory

                                if(tid+neg == 0){//R: single first workitem does the reduction job

                                bscore=gpu_state->gaussians.gaussian[0].score;

                                for(int i=0;i<2;i++){for(int j=0;j<RESULT_SIZE;j++){

                                //R:best update

                                if(best_score[j][i]>bscore){

                                bscore=best_score[j][i];

                                bfft_ind=signals[j][i][0].fft_ind;

                                bbin=signals[j][i][0].bin;

                                }

                                //R: new gaussian logging

                                for(int k=1;k<res_size[j][i];k++){

                                if(idx>=MAX_GPU_GAUSSIANS){//R: update global mem data and return on overflow

                                gpu_state->gaussians.index=idx;

                                if(bscore>gpu_state->gaussians.gaussian[0].score){//R: best gaussian update in global memory

                                float4 tmp=GaussFitResults[ul_FftLength*bfft_ind+bbin+ul_FftLength*GAUSS_POT_LENGTH*i];

                                gpu_state->gaussians.gaussian[0].score=bscore;

                                gpu_state->gaussians.gaussian[0].icfft=(i?-gpu_state->icfft:gpu_state->icfft);

                                gpu_state->gaussians.gaussian[0].fft_ind=bfft_ind;

                                gpu_state->gaussians.gaussian[0].bin=bbin;

                                gpu_state->gaussians.gaussian[0].peak_power=tmp.y;

                                gpu_state->gaussians.gaussian[0].mean_power=tmp.x;

                                gpu_state->gaussians.gaussian[0].chisqr=tmp.z;

                                gpu_state->gaussians.gaussian[0].null_chisqr=tmp.w;

                                gpu_state->gaussians.gaussian[0].max_power=NormMaxPower[bbin+i*ul_FftLength];

                                for(int m=0;m<GAUSSIAN_POT_LENGTH;m++)

                                gpu_state->gaussians.gaussian[0].pot[m]=PoT[m*ul_FftLength+bbin+i*ul_FftLength*GAUSS_POT_LENGTH];

                                }

                                return;

                                }

                                idx++;

                                gpu_state->gaussians.gaussian[idx].icfft=(i?-gpu_state->icfft:gpu_state->icfft);

                                gpu_state->gaussians.gaussian[idx].fft_ind=signals[j][i][k].fft_ind;

                                gpu_state->gaussians.gaussian[idx].bin=signals[j][i][k].bin;

                                float4 tmp=

                                GaussFitResults[ul_FftLength*signals[j][i][k].fft_ind+signals[j][i][k].bin+ul_FftLength*GAUSS_POT_LENGTH*i];

                                gpu_state->gaussians.gaussian[idx].peak_power=tmp.y;

                                gpu_state->gaussians.gaussian[idx].mean_power=tmp.x;

                                gpu_state->gaussians.gaussian[idx].chisqr=tmp.z;

                                gpu_state->gaussians.gaussian[idx].null_chisqr=tmp.w;

                                gpu_state->gaussians.gaussian[idx].max_power=NormMaxPower[j+i*ul_FftLength];

                                //R: TODO evaluate async workgroup copy

                                for(int m=0;m<GAUSSIAN_POT_LENGTH;m++)

                                gpu_state->gaussians.gaussian[idx].pot[m]=PoT[m*ul_FftLength+signals[j][i][k].bin+i*ul_FftLength*GAUSS_POT_LENGTH];

                               

                               

                                }

                                }

                                if(bscore>gpu_state->gaussians.gaussian[0].score){//R: best gaussian update in global memory

                                gpu_state->gaussians.gaussian[0].score=bscore;

                                gpu_state->gaussians.gaussian[0].icfft=(i?-gpu_state->icfft:gpu_state->icfft);

                                gpu_state->gaussians.gaussian[0].fft_ind=bfft_ind;

                                gpu_state->gaussians.gaussian[0].bin=bbin;

                                float4 tmp=GaussFitResults[ul_FftLength*bfft_ind+bbin+ul_FftLength*GAUSS_POT_LENGTH*i];

                                gpu_state->gaussians.gaussian[0].peak_power=tmp.y;

                                gpu_state->gaussians.gaussian[0].mean_power=tmp.x;

                                gpu_state->gaussians.gaussian[0].chisqr=tmp.z;

                                gpu_state->gaussians.gaussian[0].null_chisqr=tmp.w;

                                gpu_state->gaussians.gaussian[0].max_power=NormMaxPower[bbin+i*ul_FftLength];

                                for(int m=0;m<GAUSSIAN_POT_LENGTH;m++)

                                gpu_state->gaussians.gaussian[0].pot[m]=PoT[m*ul_FftLength+bbin+i*ul_FftLength*GAUSS_POT_LENGTH];

                                }

                                }

                                gpu_state->gaussians.index=idx;

                                //printf("idx=%d\n",idx);

                                }//end common work

                              }

                                • Re: confirm bugs in cl_amd_printf
                                  dipak

                                  Hi,

                                  AFAIK, "Loveland" belongs to TeraScale series. Here is the latest driver available for that series: Legacy. From your clinfo output, it seems that your driver is older than that. Did you check with this latest driver?

                                  Btw, TeraScale or any other pre-GCN products have been moved to a legacy support model and no additional driver releases are planned. Please check the release note for details.

                                   

                                  Regards,

                        • Re: confirm bugs in cl_amd_printf
                          alexfd7

                          Here on my computer, I do so and it works:

                           

                              // The line below does not get printed.

                              printf("Hello, world%s\n"," ");

                           

                              // The line below does not get printed.

                              printf("Hello again.%s\n", " ");

                           

                           

                          And a line break like this:         printf("%s\n", " ");