11 Replies Latest reply on Feb 22, 2011 10:11 AM by Raistmer

    Driver restarts on HD6970

    Raistmer
      works ok on HD5xxx GPUs

      My app experiences driver restarts on HD6970 GPUs time to time.

      Looks like one of kernels sometimes takes too long to execute:


      PC_find_triplets_avg_kernel_cl__k11_Cayman1 15809 { 256 17 1} NULL 0,19282 0 14 0 3
      ReadBuffer 15810 0,25789 0,02
      set_mem_kernel_cl__k4_Cayman1 15811 { 1 1 1} NULL 0,00339 0 3 0 1
      PC_find_pulse_local_kernel_cl__k21_Cayman1 15812 { 1024 17 1} { 64 1 1} 20628,77946 10496 62 0 7

      Such huge time causes driver restart.
      Usually this kernel take much lower amount of time:


      PC_find_triplets_avg_kernel_cl__k11_Cayman1 155 { 256 17 1} NULL 0,19387 0 14 0 3
      ReadBuffer 156 25,51814 0,02
      set_mem_kernel_cl__k4_Cayman1 157 { 1 1 1} NULL 0,00349 0 3 0 1
      PC_find_pulse_local_kernel_cl__k21_Cayman1 158 { 1024 17 1} { 64 1 1} 10,72624 10496 62 0 7
      ReadBuffer 159 0,31345 0,02

      This doesn't happen on HD5xxx GPUs, only on HD6xxx ones.

      Is it driver problem? What is wrong in this particular kernel for HD6xxx GPUs ?

      //R: this kernel should be used for small power arrays that fit in local memory comletely __kernel void PC_find_pulse_local_kernel_cl(float best_pulse_score, int PulsePotLen, int AdvanceBy, int fft_len, int ndivs, __global PulseFind_t* settings,__global float* t_funct_cache_fp, __global float* PoT,__local float* tmp_PoT,__global uint4* result_flag, __global float* PulsePoT_average) { //if(result_flag[0].x == 1) return;//R: if CPU processing already needed - abort kernel ASAP const int PoTLen = 1024*1024/fft_len; int ul_PoT = get_global_id(0); int y = get_global_id(1); int z=get_local_id(0); int TOffset1 = y * AdvanceBy; float rcfg_dis_thresh=settings->rcfg_dis_thresh; int PulseMax=settings->PulseMax; //if (ul_PoT < 1) return; // Original find_pulse, omits first element //if (ul_PoT >= fft_len) return; if(TOffset1 + PulsePotLen > PoTLen) { TOffset1 = PoTLen - PulsePotLen; } __global float* fp_PulsePot = PoT + ul_PoT + TOffset1 * (fft_len); __local float* tmp_pot = tmp_PoT +z; //float* best_pot = cudaAcc_PulseFind_settings.best_pot_fp + ul_PoT + TOffset2 * fft_len; //float* report_pot = cudaAcc_PulseFind_settings.report_pot_fp + ul_PoT + TOffset2 * fft_len; int di; float avg;//,maxp=0;//, snr=0, fthresh=0; float tmp_max; avg=PulsePoT_average[ul_PoT+y*fft_len]; // Periods from PulsePotLen/3 to PulsePotLen/4, and power of 2 fractions of. // then (len/4 to len/5) and finally (len/5 to len/6) // //int num_adds = pass; for(int num_adds = 3; num_adds <= 5; num_adds++) { int firstP, lastP; switch(num_adds) { case 3: lastP = (PulsePotLen*2)/3; firstP = (PulsePotLen*1)/2; break; case 4: lastP = (PulsePotLen*3)/4; firstP = (PulsePotLen*3)/5; break; case 5: lastP = (PulsePotLen*4)/5; firstP = (PulsePotLen*4)/6; break; } int num_adds_minus1 = num_adds - 1; int p = lastP; for (p = lastP ; p > firstP ; p--) { float cur_thresh, dis_thresh; int /*tabofst, */mper, perdiv; int tmp0, tmp1, tmp2, tmp3; //tabofst = ndivs*3+2-num_adds; mper = p * (12/num_adds_minus1); perdiv = num_adds_minus1; tmp0 = (int)((mper + 6)/12); // round(period) tmp1 = (int)((mper * 2 + 6)/12); // round(period*2) di = (int)p/perdiv; // (int)period //dis_thresh = cudaAcc_t_funct(di, num_adds)*avg; dis_thresh = t_funct_f(di, num_adds, 0, PulseMax, t_funct_cache_fp) * avg; switch(num_adds) { case 3: tmp_max = sumtop3_local(fp_PulsePot, tmp_pot, di, tmp0, tmp1,fft_len); break; case 4: tmp2 = (int)((mper * 3 + 6)/12); // round(period*3) tmp_max = sumtop4_local(fp_PulsePot, tmp_pot, di, tmp0, tmp1, tmp2,fft_len); break; case 5: tmp2 = (int)((mper * 3 + 6)/12); // round(period*3) tmp3 = (int)((mper * 4 + 6)/12); // round(period*4) tmp_max = sumtop5_local(fp_PulsePot, tmp_pot, di, tmp0, tmp1, tmp2, tmp3,fft_len); break; } if (tmp_max>dis_thresh) { // unscale for reporting tmp_max /= num_adds; cur_thresh = (dis_thresh / num_adds - avg) * rcfg_dis_thresh + avg; float _snr = (tmp_max-avg)/(cur_thresh-avg); if (_snr > best_pulse_score ) { result_flag[0].x = 1;return; } if( (tmp_max>cur_thresh) ) { result_flag[0].x = 1;return; } } int num_adds_2 = 2* num_adds; // int j = 1; //float tmp_max2=(float4)0.f; for (int j = 1; j < ndivs ; j++) { //perdiv *=2; tmp0 = di & 1; di >>= 1; tmp0 += di; dis_thresh = t_funct_f(di, num_adds, j,PulseMax,t_funct_cache_fp) * avg; { tmp_max = sumtop2_local(tmp_pot, tmp_pot, di,tmp0); } if (tmp_max>dis_thresh) { // unscale for reporting tmp_max /= num_adds_2; cur_thresh = (dis_thresh / num_adds_2 - avg) * rcfg_dis_thresh + avg; float _snr = (tmp_max-avg)/(cur_thresh-avg); if (_snr > best_pulse_score) { result_flag[0].x = 1;return; } if ((tmp_max>cur_thresh) ) { result_flag[0].x = 1;return; } } num_adds_2 *=2; } // for (j = 1; j < ndivs } // for (p = lastP } // for(num_adds = }

        • Driver restarts on HD6970
          aheirich

          You have multiple data-dependent loops in your kernel and it is not surprising that the runtime may be variable.  To debug this, first get a reproducible case, then use printf to debug the variables that control the loops.  I would also suggest unrolling the outermost loop to speed up your kernel.

          • Driver restarts on HD6970
            Raistmer
            I have not bad performance for this kernel on HD5xxx GPUs. This behavior is HD6xxx specific, seems you missed this fact.
            And execution time change from ~10ms to 20 secodns (!) I would not call this "not surprising" Data flow of app can't explain such huge difference.
              • Driver restarts on HD6970
                genaganna

                 

                Originally posted by: Raistmer I have not bad performance for this kernel on HD5xxx GPUs. This behavior is HD6xxx specific, seems you missed this fact. And execution time change from ~10ms to 20 secodns (!) I would not call this "not surprising" Data flow of app can't explain such huge difference.


                Please send your runtime code and system information(OS, SDK version, Driver version)

                • Driver restarts on HD6970
                  aheirich

                  I think the reason for the timeout in this kernel has to be the data dependent loops in the code.  Have you confirmed that it is indeed this kernel that is timing out?  To do that, put a call to clFinish right after the enqueue and see if you get past it.

                    • Driver restarts on HD6970
                      Raistmer
                      Originally posted by: aheirich
                      Have you confirmed that it is indeed this kernel that is timing out? 


                      I posted profiler data. That kernel took 20 seconds accordingly to profiler. And this was last string that profiler gave. AFAIK watchdog timer setted to 2 seconds.
                      So, either ATi Stream Profiler reports trash (then it's field of work for your profiler team) or it's exactly this kernel.


                  • Driver restarts on HD6970
                    Raistmer
                    I got new reports from card owner (it's hard to debug things remotely, but...) - when he placed his GPU in another host there were no driver restarts. Also, card begin to report different core freq - 950MHz instead of 880MHz.
                    Is it possible that on particular kernel GPU power consumption increased a lot then card dropped its freq then kernel took to long and cause driver restart?
                    That is, looks like it's hardware problem, not software one...
                    In old host he was able to reproduce driver restart on the same input data, in new host he can't recive driver restart (under both Cat 11.1 and 11.2) on the same input data.

                    So there were definitely not data-dependent loops as I already said before.
                    If I get another reproducible report of such failure I'll post here, but for now looks like it more hardware (card was underpowered) issue than software one. Perhaps your driver team could incorporate better power-monitoring logic in driver...
                    • Driver restarts on HD6970
                      MicahVillmow
                      Raistmer,
                      Seems like the power frequency issue is a result of hardware improvements to control performance/power consumption.
                      Power efficiency and power management
                      "
                      Lastly, AMD has also worked on power efficiency and power management. With PowerTune Technology, the GPU TDP is clamped to a pre-determined level. The GPU includes counters across all blocks which are monitored and applied to an algorithm to infer power draw. The core clock is then adjusted dynamically to enforce the TDP level.



                      Read more: http://vr-zone.com/articles/am...74.html#ixzz1EcV0xFxp"
                      • Driver restarts on HD6970
                        Raistmer
                        How much clocks can be dropped down ? 2 times? 3 times? order of magnitude?
                        • Driver restarts on HD6970
                          Raistmer
                          Hosts owner provided full info about both old and new hosts.
                          Hope it will help to identify problem


                          ---------------------------------------The old host:

                          CPUID: AMD Athlon(tm) 64 X2 Dual Core Processor 4400+
                          [Family 15 Model 35 Stepping 2]
                          Speed: 2 x 2289 MHz, Cache: L1=64K L2=1024K
                          Motherboard: ASUS A8N-SLI Premium
                          Chipset: NVIDIA nForce 4 SLI MCP
                          Memory: 2 GB, DDR 400
                          GPU: PCI-E 2048 Sapphire "Radeon HD 6970" 21187-00-40R
                          OpenCL Platform Name: ATI Stream
                          Number of devices: 1
                          Max compute units: 24
                          Max work group size: 256
                          Max clock frequency: 880Mhz (In Catalyst Control Center it is set 880Mhz)
                          Name: Cayman
                          Vendor: Advanced Micro Devices, Inc.
                          Driver version: CAL 1.4.900
                          Version: OpenCL 1.1 ATI-Stream-v2.3 (451)
                          HDD: 5 pieces
                          DVD-R/RW: 2 pieces
                          Sound Card (installed in a PCI slot): Sound Blaster Audigy 4 Pro with extra power and the external module

                          PSU: IN-WIN 450W
                          +3.3V 24A
                          +5V 15A
                          +12V 18A
                          -12V 0.3A

                          OS: Microsoft Windows Vista Ultimate x64 Edition, Service Pack 2, (06.00.6002.00)
                          The second OS: Microsoft Windows 7 Ultimate x64 Edition, (06.01.7600.00)

                          Cat 11.1 - driver restarts.



                          The new host:

                          CPUID: AMD Phenom(tm) II X6 1100T Processor [Family 16 Model 10 Stepping 0]
                          Speed: 6 x 3325 MHz, Cache: L1=64K L2=512K
                          Motherboard: GIGABYTE "GA-870A-UD3" rev.2.1 (With the latest BIOS version)
                          Chipset: AMD 870 / SB850
                          Memory: 2x2GB DDR3 SDRAM Kingston "Hyper X" KHX1600C7D3K2/4GX
                          (PC12800, 1600MHz, CL7).
                          Timings set to 7-8-7-20-1
                          GPU: PCI-E 2048?? Sapphire "Radeon HD 6970" 21187-00-40R
                          OpenCL Platform Name: ATI Stream
                          Number of devices: 1
                          Max compute units: 24
                          Max work group size: 256
                          Max clock frequency: 950Mhz (In Catalyst Control Center it is set 880Mhz)!
                          Name: Cayman
                          Vendor: Advanced Micro Devices, Inc.
                          Driver version: CAL 1.4.1016
                          Version: OpenCL 1.1 ATI-Stream-v2.3 (451)
                          HDD: 3 pieces
                          DVD-R/RW: 2 pieces
                          Sound Card (installed in a PCI slot): Sound Blaster Audigy 4 Pro with extra power and the external module

                          PSU: 750W Chieftec "Nitro 85+ BPS-750C" ATX12V V2.2
                          +3.3 V 25 A
                          +5 V 25 A
                          +12 V 60 A !!!
                          -12 V 0.8 A
                          +5 V SB 3 A
                          OS: Microsoft Windows Vista Ultimate x64 Edition, Service Pack 2, (06.00.6002.00)
                          The same OS that on the previous host!! Didn't reinstall, only has updated drivers for new hardware support (Motherboard, CPU).
                          The second OS: Microsoft Windows 7 Ultimate x64 Edition, Service Pack 1, (06.01.7601.00)
                          The new!! Reinstalled!!

                          Cat 11.1 - No driver restarts.
                          Cat 11.2 - No driver restarts.