19 Replies Latest reply on Apr 27, 2015 4:07 AM by dipak

    Kernel with local memory usage gives different results on some hardware

    Raistmer

      Trying to speedup processing of few large arrays I used shared/local memory for splittling arrays to smaller blocks and to increase execution domain of kernel.

      It wroks on on my dev host (C-60 Loveland) and also gives correct results on HD6950 GPU. But some testers report wrong computations on some GPUs.

       

      So far tested:

      C-60 Loveland with OpenCL 1.2 AMD-APP (1268.1) driver (Windows) - correct results

      HD6950 with OpenCL 1.2 AMD-APP (1348.5) driver (Windows) - correct results

      HD7970/Tahiti with Catalyst 14.9 (Windows) - invalid results

      Tahiti LE with Catalyst 14.12/ OpenCL 1.2 AMD-APP (1642.5) driver  (Linux) - correct results

      Hawaii Pro with Catalyst 14.9/  OpenCL 1.2 AMD-APP (1526.3) driver  (Linux)- invalid results

       

      Not too clear is it driver version related issue or card architecture related or some issue with kernel's code itself.

       

      Here is the kernel under question: http://pastebin.com/c9sX8Xwj

      It has debug output enabled and different cards provide quite different outputs.

       

      What is wrong here?

       

      P.S. kernel's local domain is always {x,1,z} hence no local id(1) used inside kernel. Also, kernel produced correct results on HD7970 with workgroups/local domain of (1,1,64) and (4,1,1)(this one means no array splitting at all) but generated wrong results with (1,1,128).

      Did not find any allowed WG configs that would fail on C-60 so far...

        • Re: Kernel with local memory usage gives different results on some hardware
          Raistmer

          Additional tests were made on Tahity, Tahity LE and Hawaii devices under Windows and Linux.

          While Tahity LE worked with all possible workgroup geometry, both Tahity and Hawaii work correctly only when workgroup size less or equal to wave size (that is WGsize<=64). And for all possible kernel geometries. That is 2x1x32 works as well as 4x1x16, but 1x1x128 will not go.

           

          All this points to some issues with synchronization between waves. Some required barriers missed? Or some issue on another than source code level?...

            • Re: Kernel with local memory usage gives different results on some hardware
              ravkum

              Hi,

               

              Would like to check the source code here. The shared path is not accessible here. Could you check that?

               

              Regards,

              Ravi

                • Re: Kernel with local memory usage gives different results on some hardware
                  Raistmer

                  Actually, very high probability that this issue has same roots as described in this thread: possible OpenCl compiler bug few months ago. Cause we tried latest available drivers it means issue not fixed still.

                  Please do fix to already known and CONFIRMED by your staff issue first. This would save lot of time both users and support staff not to re-check and re-report already detected bugs over and over.

                   

                  And full kernel code in case I'm mistaken and this is another issue:

                   

                  1. __kernel __attribute__((vec_type_hint(float4)))
                  2. void PC_find_triplets_avg_kernel_HD5_cl(int ul_FftLength, int len_power, float triplet_thresh_base, int AdvanceBy, int PoTLen,
                  3.                                                                           __global float4* PoT,__global uint* result_flag,__global float4* PulsePoT_average,
                  4.                                                                           __local float4* tmp) {
                  5. //R: this one uses local memory to increase number of separate workitems hence number of workgroups for better load of
                  6. // multi-CU devices
                  7.                                                                                 
                  8. //R: difference from original kernel: this one just doing fast precheck and relies on CPU for real triplet analysis
                  9. //R: this kernel does 4 PoT chunks at once.
                  10. //R: cause workitems can write flag in arbitrary order it's not possible to set error code actually (original CUDA code
                  11. // missed this fact and tries to return error code from kernel. That is, different variable should be used for setting
                  12. // state flags.
                  13.         int ul_PoT = get_global_id(0);//R: 4 PoTs at once!
                  14.         int y = get_global_id(1);//R: index of offset chunk
                  15.         int tid=get_local_id(2);
                  16.         int fft_len4=ul_FftLength>>2;
                  17.         int TOffset = y * AdvanceBy;
                  18. //R: each wave of third index works on single PoT array
                  19. //      local float4 local_sum[64/*can be get_local_size(2) if variable length allowed*/];
                  20.         if(TOffset + len_power > PoTLen) {            
                  21.                 TOffset = PoTLen - len_power;
                  22.         }
                  23.         __global float4* fp_PulsePot= PoT + ul_PoT + TOffset * (fft_len4);
                  24.         // Clear the result array
                  25.         //int4 numBinsAboveThreshold_private=(int4)0;
                  26.         float4 tmp_private=(float4)0.f,triplet_thresh=(float4)triplet_thresh_base,pp;
                  27.         __local float4* tmp_local=tmp+get_local_size(2)*get_local_id(0);
                  28.         /* Get all the bins that are above the threshold, and find the power array mean value */
                  29.         for( int i=tid;i<len_power;i+=get_local_size(2)/*can be get_local_size(2) if variable length allowed*/ ) {
                  30.                 tmp_private += fp_PulsePot[i*fft_len4];
                  31.         }
                  32.         //R: here can be one of new reduce operations but this will require higher CL version
                  33.         tmp_local[tid]=tmp_private;
                  34.         for(int i=(get_local_size(2)>>1); i>0;i>>=1){
                  35.                 barrier(CLK_LOCAL_MEM_FENCE);
                  36.                 if(tid<i){
                  37.                         tmp_local[tid]+=tmp_local[tid+i];
                  38.                 }
                  39.         }
                  40.         barrier(CLK_LOCAL_MEM_FENCE);
                  41.         if(tid==0){
                  42.                 tmp_private=tmp_local[0];
                  43.                 tmp_private/= (float4)len_power;
                  44.                 PulsePoT_average[ul_PoT+y*fft_len4]=tmp_private;//R: this avg will be needed later, at pulse finding
                  45.                 tmp_local[0]=tmp_private;//R: to  share with other threads
                  46.         }
                  47.         barrier(CLK_LOCAL_MEM_FENCE);
                  48.         tmp_private=tmp_local[0];//R: broadcast reduced value to all threads for further use
                  49.         triplet_thresh*=tmp_private;
                  50.         tmp_private=(float4)0.f;
                  51.         for( int i=tid;i<len_power;i+=get_local_size(2)) {
                  52.                 pp= fp_PulsePot[i*fft_len4];
                  53.                 if(  pp.x>= triplet_thresh.x ) {
                  54.                         tmp_private.x+=1.f;
                  55.                         printf("X BRANCH: global:(%d,%d,%d); local:(%d,%d,%d); tmp_private:(%v4g)\n",
                  56.                                 get_global_id(0),get_global_id(1),get_global_id(2),
                  57.                         get_local_id(0),get_local_id(1),get_local_id(2),tmp_private);
                  58.                         printf("pp:(%v4g); triplet_thresh:(%v4g)\n",pp,triplet_thresh);
                  59.                 }
                  60.                 if(  pp.y>= triplet_thresh.y ) {
                  61.                         tmp_private.y+=1.f;
                  62.                         printf("Y BRANCH: global:(%d,%d,%d); local:(%d,%d,%d); tmp_private:(%v4g)\n",
                  63.                                 get_global_id(0),get_global_id(1),get_global_id(2),
                  64.                                 get_local_id(0),get_local_id(1),get_local_id(2),tmp_private);
                  65.                         printf("pp:(%v4g); triplet_thresh:(%v4g)\n",pp,triplet_thresh);
                  66.                 }
                  67.                 if(  pp.z>= triplet_thresh.z ) {
                  68.                         tmp_private.z+=1.f;
                  69.                         printf("Z BRANCH: global:(%d,%d,%d); local:(%d,%d,%d); tmp_private:(%v4g)\n",
                  70.                                 get_global_id(0),get_global_id(1),get_global_id(2),
                  71.                                 get_local_id(0),get_local_id(1),get_local_id(2),tmp_private);
                  72.                         printf("pp:(%v4g); triplet_thresh:(%v4g)\n",pp,triplet_thresh);
                  73.                 }
                  74.                 if(  pp.w>= triplet_thresh.w ) {
                  75.                         tmp_private.w+=1.f;
                  76.                         printf("W BRANCH: global:(%d,%d,%d); local:(%d,%d,%d); tmp_private:(%v4g)\n",
                  77.                                 get_global_id(0),get_global_id(1),get_global_id(2),
                  78.                                 get_local_id(0),get_local_id(1),get_local_id(2),tmp_private);
                  79.                         printf("pp:(%v4g); triplet_thresh:(%v4g)\n",pp,triplet_thresh);
                  80.                 }
                  81.         }
                  82. //R: again need to reduce values
                  83.         tmp_local[tid]=tmp_private;
                  84.         for(int i=(get_local_size(2)>>1)/*can be get_local_size(2) if variable length allowed*/; i>0;i>>=1){
                  85.                 barrier(CLK_LOCAL_MEM_FENCE);
                  86.                 if(tid<i){
                  87.                         tmp_local[tid]+=tmp_local[tid+i];
                  88.                 }
                  89.         }
                  90.         barrier(CLK_LOCAL_MEM_FENCE);
                  91.   if(tid==0){
                  92.                 tmp_private=tmp_local[0];
                  93.         if(tmp_private.x>2.f || tmp_private.y>2.f || tmp_private.z>2.f || tmp_private.w>2.f){
                  94. //R: global size is power of 2 so it should be safe to perform integer division here
                  95. //              printf("Resulting numbers of peaks: (%v4g)\n",tmp_private);
                  96.                 int result_coordinate=(get_global_size(0)>RESULT_SIZE)?
                  97.                         ((RESULT_SIZE*get_global_id(0))/get_global_size(0)):get_global_id(0);
                  98.                 result_flag[result_coordinate]=1;
                  99.         }
                  100.   }
                  101. }
                    • Re: Kernel with local memory usage gives different results on some hardware
                      ravkum

                      Thanks for the code. Could you please also tell me the size of local memory allocated in the host code?

                       

                      Also what is the global_work_size for all the different local_work_sizes you have mentioned in your earlier posts?

                       

                      Regards,

                      Ravi

                        • Re: Kernel with local memory usage gives different results on some hardware
                          Raistmer

                          Thanks for looking into this issue.

                          Requested data:

                           

                          1) List of global kernel sizes for run with failures (received on Tahity LE host that able to do this kernel properly):

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (4,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (8,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (16,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (16,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (16,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (8,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (16,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (8,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (16,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (128,15,64); local (4,1,64)

                          host: launching PC_find_triplets_avg_kernel_HD5_cl with next domains: global (256,15,64); local (4,1,64)

                           

                          2)Allocated local memory:

                           

                          err |= clSetKernelArg(PC_find_triplets_avg_kernel_HD5_cl,8,sizeof(cl_float4)*64*4,NULL);

                           

                          That is, local memory area enough for storing max possible workgroup size (256) of  cl_float4 values is allocated.

                          Maybe worth to make it tunable to real WG size used, but for now just max possible amount allocated. In cases where WG size smaller than 256 just not all allocated amount really used.

                           

                          3) Global domain vs local domain sizes.

                          Global domain depends on data being processed. I listed global sizes for very that task that has failures on some GPUs but processed OK on others.

                          local domain currently tunable. Listed one (4,1,64) will work OK on some GPUs but will fail on Tahity and Hawaii.

                          If one chose something like (2,1,32) or (1,1,64) (with very same first two dimensions of global domains) task will be finished OK on ALL tested devices. As one can see, WG has different geometry but always have to be of size of single wavefront to work everywhere.

                           

                          EDIT:

                          4) There is another very similar kernel that processes some sizes. and can result in failure too. The difference from listed one - there is no write into global memory for averaged value. All other just the same.

                          I'll list global domains used there soon too.

                          EDIT2:

                          And here are missed sizes from secondary kernel:

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (16,15,64); local (4,1,64)

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (16,15,64); local (4,1,64)

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (8,15,64); local (4,1,64)

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (8,15,64); local (4,1,64)

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (16,15,64); local (4,1,64)

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (16,15,64); local (4,1,64)

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (4,15,64); local (4,1,64)

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (4,15,64); local (4,1,64)

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (32,15,64); local (4,1,64)

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (16,15,64); local (4,1,64)

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (16,15,64); local (4,1,64)

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                          host: launching PC_find_triplets_kernel_HD5_cl with next domains: global (64,15,64); local (4,1,64)

                    • Re: Kernel with local memory usage gives different results on some hardware
                      jason

                      this  sounds like you arent using a barrier somewhere.  64 is the magic wavefront size,<= it all work items execute in lock step on amd gpus.

                       

                      however after glancing it over i wasnt sure your offset calculation of tmp_local made sense... things like that can also produce this kind of problem,

                        • Re: Kernel with local memory usage gives different results on some hardware
                          Raistmer

                          Yes, some synching missing was the first thing I thought about.

                          But so far I can't find where barriers missed. Also, keep in mind that this kernel works perfectly on HD6950 for example. And on C-60 APU too. Both devices would experience same issues as some of GCN ones in case of missing barrier... but they don't.

                           

                          And what exactly you don't like in tmp_local? There are get_local_size(2) number of threads/work items that work cooperatively on single array. Kernel processes few such arrays hence single workgroup handles few independend teams of threads (governed by get_local_id(0) index). Also, get_local_id(1) always zero cause WG dimensions always x*1*z. So get_local_id(1) doesn't participate in calculations.

                          Please be more specific what exactly you consider as wrong there.

                            • Re: Kernel with local memory usage gives different results on some hardware
                              jason

                              reposting - first post got rejected because of a sentence containing AMD and the word fail.  Lighten up mods.  AMD has caused people headaches repeatedly so it is a fact of life to consider in debugging.

                               

                              To remove thread / race conditions / AMD originating failures from the problem, try validating your code in python with something like numpy trying to emulate what the threads should be doing and all the id/index computations.  I've found you can get pretty close mapping in most places but you still must do it carefully.

                               

                              Btw I don't know if it makes a bit of difference in performance these days but generally you would want swap your semantics dimensions of x and z because x is the fastest moving, z is the slowest.

                               

                              I was also not able to figure why you would want an x dimension larger >1, not that necessarily undermines your issue.  The hardware scheduler should pretty much do the job of x for you.

                               

                              You might also try declaring the workgroup shared memory locally and see if that changes anything (I wouldn't be surprised if the compiler emits different instructions) due to that alone.

                               

                              Except for the semantics mapping of x and z and what you are currently using x for I have a ton of code just like this function.  We all know the compiler sucks alot but generally I don't have issues with it on simple stuff like this which leads me to believe its one of those "everything looks correct but there's just one problem hiding in plain sight".  Indexing bugs can also alias like this which is part of why I suggest try writing the code in python.  You'll at least have more proof/peace of mind and have an easier environment to inspect what's going on for most issues (not all).

                               

                              Another thing you could do is write out intermediate data to GDS so you can double check and bisect the range of where the calculations go bad...

                               

                              Also, you remember float parallel reductions are not equivalent to serial reductions right?  This is due to associativity of floating point math.  You can s/float/int/ maybe and test to see if you get the same result across different work group sizes as long as there's enough numerical stability..

                                • Re: Kernel with local memory usage gives different results on some hardware
                                  Raistmer

                                  I'm  too big fan of Occam razor principle to not going into troubles of re-implementing kernel on another language knowing that it works already OK on some hardware until explanation why it works only on such hardware will be received. Current most simple explanation is bug in compiler that emits synching machine code for subset of GCN devices. In few days I will able to test same kernel on iGPU and nVidia. It would be quite enough to rule out algorithmic issues IMO (even if having working kernel on VLIW ATi GPUs isn't enough).

                                   

                                  Regarding x and z dimensions - historical reasons mostly - there is similar kernel that doesn't use local memory. Taking into account strided access to global memory and low computational density of kernel I'm not expect big speed difference from reordering local memory accesses. Biggest issue that this kernel solves is to load all CUs with work, that's not the case sometimes with older one.

                                   

                                  Why x-dim in workgroup: cause I want flexibility in WG geometry (and this flexibility will gone if local memory will be allocated inside kernel, BTW). App processes different numbers of arrays at once with different sizes on devices with different numbers of CUs, hence I need different number of workgroups and waves. Having workgroup of single thread team, especially when this team equal or less than wavefront size will limit waves in flight on CU that will reduce occupancy and performance.

                                   

                                  Possibility of  rounding issue can be ruled out cause on devices where kernel gives incorrect results it gives _different_results from run to run, not just invalid but stable ones (runs with identical WG geometries set of course).

                                    • Re: Kernel with local memory usage gives different results on some hardware
                                      jason

                                      right, well...

                                       

                                      Also, did you compare results against a regular old CPU target with your problematic WG sizes?  Does commenting out the printfs you have change outcomes any either - I've seen it's presence/absence influence some strange things and not just limited to race conditions.  Storing intermediate results to GDS would allow you to bisect the range down too where the problem occurs - it can help you figure out where to look in IR/ISA results.

                                       

                                      Also I noticed on the cards you tested with problems used 14.9 - I know you noted this but did you bother retesting against 14.12?

                                       

                                      For local memory declaration, it's just a debugging suggestion - not a perm change suggestion.  Again: tweak -> run -> analyze & infer.

                                        • Re: Kernel with local memory usage gives different results on some hardware
                                          Raistmer

                                          printfs were added because of this issue, it exists (on some GPUs with and w/o those printfs).

                                          For now I successfully ran that kernel on iGPU HD2500 too - no issues. So, VLIW AMD GPUs, lesser (Tahiti LE) GCN GPUs, iGPUs - all free from this issue.

                                          Wasn't able to check with nVidia ones so far but variance between hardware quite big already.

                                           

                                          AFAIK tester who reported this issue first did test under few different driver versions. Would be good to find working driver indeed, but it can be only workaround, if latest driver has this bug...

                                            • Re: Kernel with local memory usage gives different results on some hardware
                                              jason

                                              Raistmer,

                                               

                                              I'm not expecting this to work but on multigpu systems right now a problem like this also exists.  I reencountered the problem for several hours yesterday so I wanted to see if the kinda-fix changes anything for you. This might help AMD find and fix 2 problems.  The test-fix is setting environmental variable GPU_NUM_COMPUTE_RINGS=1

                                                • Re: Kernel with local memory usage gives different results on some hardware
                                                  Raistmer

                                                  Thanks for suggestion but both Windows and Linux testers reported negative results. That env variable has no influence on this particular issue.

                                                    • Re: Kernel with local memory usage gives different results on some hardware
                                                      dipak

                                                      Hi Raistmer,

                                                      My apologies for this delay.

                                                      From your posts, it seems that your issue is a platform specific one ( particularly with few GCN cards). As you pointed out whether your problem has anything to do with this one: possible OpenCl compiler bug or not. If so, then I'm sorry, because that issue has not been resolved yet.

                                                      However, at this moment, I'm not sure whether both are same or not. That's why, I would like to forward this issue to concerned team by filing an bug report against it. To do so, I need a complete reproducible test-case. Could you please provide such one?


                                                      Regards,

                                                        • Re: Kernel with local memory usage gives different results on some hardware
                                                          Raistmer

                                                          Hello. Yes, you summarized issue right. Only some of GCN-family cards are affected, but those who are affected both under Windows and Linux it seems. Other platforms (nVidia, Intel GPU) just as older AMD cards seems not affected.

                                                           

                                                          Since initial guess about wave size involvement we did more comprehensive testing of all possible workgroup sizes.

                                                          Issue sweems more complex than just smaller or bigger WG size regarding wave size.

                                                          Here is full table:

                                                           

                                                          x/z1248163264128256
                                                          1+++++++++
                                                          2++++++++0
                                                          4+++++++00
                                                          8+++++-000
                                                          16+++--0000
                                                          32++--00000
                                                          64+--000000
                                                          128--0000000
                                                          256+00000000

                                                           

                                                          "+" means kernels work OK with sich workgroup. "-" - false detections. 0 - such WG size not supported on AMD hardware.

                                                          As one can see some configs that exceed wave size work OK still. And indeed, all that smaller than wave works OK.

                                                           

                                                          Also, from all listed earlier domain sizes for those 2 kernels only sizes with x-dim equal 256 and 512 give false detections.

                                                          All other sizes are silent besides 6 small ones that give true detections at any config just as on all other hardware 9hence, we did not miss valid detection, we just get false ones and at 2 specific domain sizes only). Number of false detections differ between runs but in all runs I saw only (256,y,z) (and only in single case) and (512,y,z) domain (global) sizes lead to failures.

                                                          I'll construct test case for this issue illustration and upload in separate post with description how to use it.

                                                          • Re: Re: Kernel with local memory usage gives different results on some hardware
                                                            Raistmer

                                                            Here is the test case

                                                             

                                                            To run just launch executable with desired kernel workgroup configuration.

                                                            Example for (4,1,64) workgroup:

                                                             

                                                            start MB7_win_x86_SSE_OpenCL_ATi_HD5_r2889.exe -tune 1 4 1 64

                                                             

                                                            y-component should always be 1 and this particular kernel num 1 too (first number).

                                                            App will produce few different files but you need to look for stderr.txt only.

                                                            reference one zipped inside archive.

                                                             

                                                            Relevant part of it (check that first listed line present to ensure app got desired option):

                                                             

                                                            TUNE: kernel 1 now has workgroup size of (4,1,64)

                                                             

                                                            Autocorr: peak=19.20864, time=20.13, delay=6.6902, d_freq=1419769860.95, chirp=-1.8134, fft_len=128k

                                                            TripletFind miss: domain(32,15,64), (local)_(with_average) kernel

                                                            TripletFind miss: domain(8,15,64), (local)_(wo_average) kernel

                                                            TripletFind miss: domain(32,15,64), (local)_(with_average) kernel

                                                            Gaussian: peak=3.140635, mean=0.5500718, ChiSq=1.353511, time=76.34, d_freq=1419770570.67,

                                                              score=1.136018, null_hyp=2.090153, chirp=-4.5252, fft_len=16k

                                                            TripletFind miss: domain(8,15,64), (local)_(with_average) kernel

                                                            TripletFind miss: domain(16,15,64), (local)_(wo_average) kernel

                                                            TripletFind miss: domain(32,15,64), (local)_(with_average) kernel

                                                             

                                                            and:

                                                             

                                                            class PC_triplet_find_miss:total=6,N=6,<>=1,min=1max=1

                                                             

                                                            class PoT_transfer_needed:total=11,N=11,<>=1,min=1max=1

                                                             

                                                            If you see different number of reported misses and increased number of needed transfer that means you see bug under consideration.

                                                             

                                                            Example of bad behaving config:

                                                            TUNE: kernel 1 now has workgroup size of (2,1,128)

                                                             

                                                            TripletFind miss: domain(512,15,128), (local)_(wo_average) kernel

                                                            TripletFind miss: domain(512,15,128), (local)_(wo_average) kernel

                                                            TripletFind miss: domain(512,15,128), (local)_(wo_average) kernel

                                                            TripletFind miss: domain(512,15,128), (local)_(with_average) kernel

                                                            TripletFind miss: domain(512,15,128), (local)_(with_average) kernel

                                                            TripletFind miss: domain(512,15,128), (local)_(wo_average) kernel

                                                            TripletFind miss: domain(512,15,128), (local)_(with_average) kernel

                                                            TripletFind miss: domain(512,15,128), (local)_(wo_average) kernel

                                                            TripletFind miss: domain(512,15,128), (local)_(wo_average) kernel

                                                            TripletFind miss: domain(512,15,128), (local)_(wo_average) kernel

                                                            ......

                                                            TripletFind miss: domain(512,15,128), (local)_(with_average) kernel

                                                            Autocorr: peak=19.20864, time=20.13, delay=6.6902, d_freq=1419769860.95, chirp=-1.8134, fft_len=128k

                                                            TripletFind miss: domain(512,15,128), (local)_(with_average) kernel

                                                            ....

                                                            TripletFind miss: domain(512,15,128), (local)_(wo_average) kernel

                                                            Gaussian: peak=3.140636, mean=0.5500715, ChiSq=1.353511, time=76.34, d_freq=1419770570.67,

                                                              score=1.13603, null_hyp=2.090154, chirp=-4.5252, fft_len=16k

                                                            TripletFind miss: domain(512,15,128), (local)_(with_average) kernel

                                                            TripletFind miss: domain(512,15,128), (local)_(wo_average) kernel

                                                            ....

                                                            TripletFind miss: domain(512,15,128), (local)_(with_average) kernel

                                                            TripletFind miss: domain(16,15,128), (local)_(wo_average) kernel

                                                            TripletFind miss: domain(32,15,128), (local)_(with_average) kernel

                                                            TripletFind miss: domain(512,15,128), (local)_(wo_average) kernel

                                                            TripletFind miss: domain(256,15,128), (local)_(wo_average) kernel

                                                            TripletFind miss: domain(512,15,128), (local)_(with_average) kernel

                                                            ....

                                                             

                                                            class PC_triplet_find_miss:total=181,N=181,<>=1,min=1max=1
                                                            class PoT_transfer_needed:total=186,N=186,<>=1,min=1max=1

                                                             

                                                            BTW, I have some report that for Linux bug fixed for Hawaii (at least) in 15.3 Beta driver, though that driver has another issues (power-safe low-freq not rised for first GPU).

                                                            Looking forward to get fix in Windows driver too.