cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Raistmer
Adept II

Driver restarts on HD6970

works ok on HD5xxx GPUs

My app experiences driver restarts on HD6970 GPUs time to time.

Looks like one of kernels sometimes takes too long to execute:


PC_find_triplets_avg_kernel_cl__k11_Cayman1 15809 { 256 17 1} NULL 0,19282 0 14 0 3
ReadBuffer 15810 0,25789 0,02
set_mem_kernel_cl__k4_Cayman1 15811 { 1 1 1} NULL 0,00339 0 3 0 1
PC_find_pulse_local_kernel_cl__k21_Cayman1 15812 { 1024 17 1} { 64 1 1} 20628,77946 10496 62 0 7

Such huge time causes driver restart.
Usually this kernel take much lower amount of time:


PC_find_triplets_avg_kernel_cl__k11_Cayman1 155 { 256 17 1} NULL 0,19387 0 14 0 3
ReadBuffer 156 25,51814 0,02
set_mem_kernel_cl__k4_Cayman1 157 { 1 1 1} NULL 0,00349 0 3 0 1
PC_find_pulse_local_kernel_cl__k21_Cayman1 158 { 1024 17 1} { 64 1 1} 10,72624 10496 62 0 7
ReadBuffer 159 0,31345 0,02

This doesn't happen on HD5xxx GPUs, only on HD6xxx ones.

Is it driver problem? What is wrong in this particular kernel for HD6xxx GPUs ?

//R: this kernel should be used for small power arrays that fit in local memory comletely __kernel void PC_find_pulse_local_kernel_cl(float best_pulse_score, int PulsePotLen, int AdvanceBy, int fft_len, int ndivs, __global PulseFind_t* settings,__global float* t_funct_cache_fp, __global float* PoT,__local float* tmp_PoT,__global uint4* result_flag, __global float* PulsePoT_average) { //if(result_flag[0].x == 1) return;//R: if CPU processing already needed - abort kernel ASAP const int PoTLen = 1024*1024/fft_len; int ul_PoT = get_global_id(0); int y = get_global_id(1); int z=get_local_id(0); int TOffset1 = y * AdvanceBy; float rcfg_dis_thresh=settings->rcfg_dis_thresh; int PulseMax=settings->PulseMax; //if (ul_PoT < 1) return; // Original find_pulse, omits first element //if (ul_PoT >= fft_len) return; if(TOffset1 + PulsePotLen > PoTLen) { TOffset1 = PoTLen - PulsePotLen; } __global float* fp_PulsePot = PoT + ul_PoT + TOffset1 * (fft_len); __local float* tmp_pot = tmp_PoT +z; //float* best_pot = cudaAcc_PulseFind_settings.best_pot_fp + ul_PoT + TOffset2 * fft_len; //float* report_pot = cudaAcc_PulseFind_settings.report_pot_fp + ul_PoT + TOffset2 * fft_len; int di; float avg;//,maxp=0;//, snr=0, fthresh=0; float tmp_max; avg=PulsePoT_average[ul_PoT+y*fft_len]; // Periods from PulsePotLen/3 to PulsePotLen/4, and power of 2 fractions of. // then (len/4 to len/5) and finally (len/5 to len/6) // //int num_adds = pass; for(int num_adds = 3; num_adds <= 5; num_adds++) { int firstP, lastP; switch(num_adds) { case 3: lastP = (PulsePotLen*2)/3; firstP = (PulsePotLen*1)/2; break; case 4: lastP = (PulsePotLen*3)/4; firstP = (PulsePotLen*3)/5; break; case 5: lastP = (PulsePotLen*4)/5; firstP = (PulsePotLen*4)/6; break; } int num_adds_minus1 = num_adds - 1; int p = lastP; for (p = lastP ; p > firstP ; p--) { float cur_thresh, dis_thresh; int /*tabofst, */mper, perdiv; int tmp0, tmp1, tmp2, tmp3; //tabofst = ndivs*3+2-num_adds; mper = p * (12/num_adds_minus1); perdiv = num_adds_minus1; tmp0 = (int)((mper + 6)/12); // round(period) tmp1 = (int)((mper * 2 + 6)/12); // round(period*2) di = (int)p/perdiv; // (int)period //dis_thresh = cudaAcc_t_funct(di, num_adds)*avg; dis_thresh = t_funct_f(di, num_adds, 0, PulseMax, t_funct_cache_fp) * avg; switch(num_adds) { case 3: tmp_max = sumtop3_local(fp_PulsePot, tmp_pot, di, tmp0, tmp1,fft_len); break; case 4: tmp2 = (int)((mper * 3 + 6)/12); // round(period*3) tmp_max = sumtop4_local(fp_PulsePot, tmp_pot, di, tmp0, tmp1, tmp2,fft_len); break; case 5: tmp2 = (int)((mper * 3 + 6)/12); // round(period*3) tmp3 = (int)((mper * 4 + 6)/12); // round(period*4) tmp_max = sumtop5_local(fp_PulsePot, tmp_pot, di, tmp0, tmp1, tmp2, tmp3,fft_len); break; } if (tmp_max>dis_thresh) { // unscale for reporting tmp_max /= num_adds; cur_thresh = (dis_thresh / num_adds - avg) * rcfg_dis_thresh + avg; float _snr = (tmp_max-avg)/(cur_thresh-avg); if (_snr > best_pulse_score ) { result_flag[0].x = 1;return; } if( (tmp_max>cur_thresh) ) { result_flag[0].x = 1;return; } } int num_adds_2 = 2* num_adds; // int j = 1; //float tmp_max2=(float4)0.f; for (int j = 1; j < ndivs ; j++) { //perdiv *=2; tmp0 = di & 1; di >>= 1; tmp0 += di; dis_thresh = t_funct_f(di, num_adds, j,PulseMax,t_funct_cache_fp) * avg; { tmp_max = sumtop2_local(tmp_pot, tmp_pot, di,tmp0); } if (tmp_max>dis_thresh) { // unscale for reporting tmp_max /= num_adds_2; cur_thresh = (dis_thresh / num_adds_2 - avg) * rcfg_dis_thresh + avg; float _snr = (tmp_max-avg)/(cur_thresh-avg); if (_snr > best_pulse_score) { result_flag[0].x = 1;return; } if ((tmp_max>cur_thresh) ) { result_flag[0].x = 1;return; } } num_adds_2 *=2; } // for (j = 1; j < ndivs } // for (p = lastP } // for(num_adds = }

0 Likes
11 Replies
aheirich
Staff

You have multiple data-dependent loops in your kernel and it is not surprising that the runtime may be variable.  To debug this, first get a reproducible case, then use printf to debug the variables that control the loops.  I would also suggest unrolling the outermost loop to speed up your kernel.

0 Likes

Another suggestion: change every if statement to use the select function.

 

0 Likes
Raistmer
Adept II

I have not bad performance for this kernel on HD5xxx GPUs. This behavior is HD6xxx specific, seems you missed this fact.
And execution time change from ~10ms to 20 secodns (!) I would not call this "not surprising" Data flow of app can't explain such huge difference.
0 Likes

Originally posted by: Raistmer I have not bad performance for this kernel on HD5xxx GPUs. This behavior is HD6xxx specific, seems you missed this fact. And execution time change from ~10ms to 20 secodns (!) I would not call this "not surprising" Data flow of app can't explain such huge difference.


Please send your runtime code and system information(OS, SDK version, Driver version)

0 Likes

I think the reason for the timeout in this kernel has to be the data dependent loops in the code.  Have you confirmed that it is indeed this kernel that is timing out?  To do that, put a call to clFinish right after the enqueue and see if you get past it.

0 Likes

Originally posted by: aheirich
Have you confirmed that it is indeed this kernel that is timing out? 


I posted profiler data. That kernel took 20 seconds accordingly to profiler. And this was last string that profiler gave. AFAIK watchdog timer setted to 2 seconds.
So, either ATi Stream Profiler reports trash (then it's field of work for your profiler team) or it's exactly this kernel.


0 Likes
Raistmer
Adept II

I got new reports from card owner (it's hard to debug things remotely, but...) - when he placed his GPU in another host there were no driver restarts. Also, card begin to report different core freq - 950MHz instead of 880MHz.
Is it possible that on particular kernel GPU power consumption increased a lot then card dropped its freq then kernel took to long and cause driver restart?
That is, looks like it's hardware problem, not software one...
In old host he was able to reproduce driver restart on the same input data, in new host he can't recive driver restart (under both Cat 11.1 and 11.2) on the same input data.

So there were definitely not data-dependent loops as I already said before.
If I get another reproducible report of such failure I'll post here, but for now looks like it more hardware (card was underpowered) issue than software one. Perhaps your driver team could incorporate better power-monitoring logic in driver...
0 Likes

Raistmer,
Seems like the power frequency issue is a result of hardware improvements to control performance/power consumption.
Power efficiency and power management
"
Lastly, AMD has also worked on power efficiency and power management. With PowerTune Technology, the GPU TDP is clamped to a pre-determined level. The GPU includes counters across all blocks which are monitored and applied to an algorithm to infer power draw. The core clock is then adjusted dynamically to enforce the TDP level.



Read more: http://vr-zone.com/articles/am...74.html#ixzz1EcV0xFxp"
0 Likes
Raistmer
Adept II

How much clocks can be dropped down ? 2 times? 3 times? order of magnitude?
0 Likes

According to the link http://vr-zone.com/articles/amd-radeon-hd-6970-and-hd-6950-review/10474-12.html

The GPU 6970 can be overclocked to about 950MHz core and 1450MHz (5800MHz effective) memory that is a modest increase of 70MHz to the core and 75MHz to the memory

Can you please provide the system configuration of both the old host(where driver restarts) and of new host(where it doesn't restart).

0 Likes
Raistmer
Adept II

Hosts owner provided full info about both old and new hosts.
Hope it will help to identify problem


---------------------------------------The old host:

CPUID: AMD Athlon(tm) 64 X2 Dual Core Processor 4400+
[Family 15 Model 35 Stepping 2]
Speed: 2 x 2289 MHz, Cache: L1=64K L2=1024K
Motherboard: ASUS A8N-SLI Premium
Chipset: NVIDIA nForce 4 SLI MCP
Memory: 2 GB, DDR 400
GPU: PCI-E 2048 Sapphire "Radeon HD 6970" 21187-00-40R
OpenCL Platform Name: ATI Stream
Number of devices: 1
Max compute units: 24
Max work group size: 256
Max clock frequency: 880Mhz (In Catalyst Control Center it is set 880Mhz)
Name: Cayman
Vendor: Advanced Micro Devices, Inc.
Driver version: CAL 1.4.900
Version: OpenCL 1.1 ATI-Stream-v2.3 (451)
HDD: 5 pieces
DVD-R/RW: 2 pieces
Sound Card (installed in a PCI slot): Sound Blaster Audigy 4 Pro with extra power and the external module

PSU: IN-WIN 450W
+3.3V 24A
+5V 15A
+12V 18A
-12V 0.3A

OS: Microsoft Windows Vista Ultimate x64 Edition, Service Pack 2, (06.00.6002.00)
The second OS: Microsoft Windows 7 Ultimate x64 Edition, (06.01.7600.00)

Cat 11.1 - driver restarts.



The new host:

CPUID: AMD Phenom(tm) II X6 1100T Processor [Family 16 Model 10 Stepping 0]
Speed: 6 x 3325 MHz, Cache: L1=64K L2=512K
Motherboard: GIGABYTE "GA-870A-UD3" rev.2.1 (With the latest BIOS version)
Chipset: AMD 870 / SB850
Memory: 2x2GB DDR3 SDRAM Kingston "Hyper X" KHX1600C7D3K2/4GX
(PC12800, 1600MHz, CL7).
Timings set to 7-8-7-20-1
GPU: PCI-E 2048?? Sapphire "Radeon HD 6970" 21187-00-40R
OpenCL Platform Name: ATI Stream
Number of devices: 1
Max compute units: 24
Max work group size: 256
Max clock frequency: 950Mhz (In Catalyst Control Center it is set 880Mhz)!
Name: Cayman
Vendor: Advanced Micro Devices, Inc.
Driver version: CAL 1.4.1016
Version: OpenCL 1.1 ATI-Stream-v2.3 (451)
HDD: 3 pieces
DVD-R/RW: 2 pieces
Sound Card (installed in a PCI slot): Sound Blaster Audigy 4 Pro with extra power and the external module

PSU: 750W Chieftec "Nitro 85+ BPS-750C" ATX12V V2.2
+3.3 V 25 A
+5 V 25 A
+12 V 60 A !!!
-12 V 0.8 A
+5 V SB 3 A
OS: Microsoft Windows Vista Ultimate x64 Edition, Service Pack 2, (06.00.6002.00)
The same OS that on the previous host!! Didn't reinstall, only has updated drivers for new hardware support (Motherboard, CPU).
The second OS: Microsoft Windows 7 Ultimate x64 Edition, Service Pack 1, (06.01.7601.00)
The new!! Reinstalled!!

Cat 11.1 - No driver restarts.
Cat 11.2 - No driver restarts.
0 Likes