I also thought of reverting to some older Catalyst, but the problem was that on Ubuntu, it regularly crashed the machine when I tried to run multi-GPU applications. 11.11 is the first that was stable in that sense, but unfortunately it failes to boot in a significant ratio of times.
{ int i=0; size_t localThreads[1]={1}; size_t globalThreads[1]={8};//R: hardwired number of FFTs for now err = clSetKernelArg(FindAutoCorrelation_reduce1_kernel_cl,i++,sizeof(cl_mem),(void *)&FFTbuf); err |= clSetKernelArg(FindAutoCorrelation_reduce1_kernel_cl,i++,sizeof(cl_mem),(void *)&gpu_AutoCorrelationResults); if(err) fprintf(stderr,"ERROR: Setting kernel argument:FindAutoCorrelation_reduce1_kernel_cl: %d\n",err); #if OCL_VERBOSE fprintf(stderr,"INFO: Setting kernel argument:FindAutoCorrelation_reduce1_kernel_cl ok\n"); #endif err = clEnqueueNDRangeKernel(cq,FindAutoCorrelation_reduce1_kernel_cl, 1, NULL,globalThreads, localThreads, 0,NULL,NULL); if(err) fprintf(stderr,"ERROR: Enqueueing FindAutoCorrelation_reduce1_kernel_cl: %d\n",err); #if OCL_VERBOSE fprintf(stderr,"INFO: Enqueuing FindAutoCorrelation_reduce1_kernel_cl ok\n"); #endif } #endif #if OCL_WDM err=clFlush(cq); if(err)fprintf(stderr,"ERROR: submitting kernels autocorr search: %d\n",err); #if __Win32 Sleep(OCL_WDM_SLEEP); #elif _GNU_SOURCE usleep(OCL_WDM_SLEEP); #endif #endif err=clEnqueueReadBuffer(cq,gpu_AutoCorrelationResults,CL_TRUE,0,sizeof(cl_float2)*8,PowerBin,0, NULL,NULL); if(err) fprintf(stderr,"ERROR: clReadBuffer(gpu_AutoCorrelationResults,PowerBin): %d\n",err); #if OCL_VERBOSE fprintf(stderr,"INFO: clReadBuffer(gpu_AutoCorrelationResults,PowerBin) ok\n"); #endif #if 1 //dump fprintf(stderr,"fully reduced:\n"); for(int i=0;i<8;i++)fprintf(stderr,"i=%d, power=%.4g,bin=%.4g\n",i,PowerBin[2*i],PowerBin[2*i+1]); exit(0); #endif } The kernel: __kernel void FindAutoCorrelation_reduce1_kernel_cl(__global float* PartialSumMaxIdx, __global float2* result){ const int sid=256*get_global_id(0);//only 8 workitems here float full_max=0.0f; float idx=0.0f; float Sum=0.0f; for(int i=0;i<256;i++){ Sum+=PartialSumMaxIdx[3*(sid+i)]; float tmp=PartialSumMaxIdx[(sid+i)*3+1]; if(tmp>full_max){ full_max=tmp; idx=PartialSumMaxIdx[(sid+i)*3+2]; } } result[get_global_id(0)].x=full_max*(128.0f*1024.0f)/Sum;//peak value result[get_global_id(0)].y=idx;//bin }
Hi Raistmer,
Thanks for sending the code. I was trying to see what type of outputs the code generate on my system. The Kernel posted appears to be running fine with the internal driver i have installed on my machine for some set of values.
I wil try to install 11.11b performance driver and try that too. It would be helpful if you can give what kind of values the FFTBuf buffer contain.
Originally posted by: himanshu.gautam
Hi Raistmer,
Thanks for sending the code. I was trying to see what type of outputs the code generate on my system. The Kernel posted appears to be running fine with the internal driver i have installed on my machine for some set of values.
I wil try to install 11.11b performance driver and try that too. It would be helpful if you can give what kind of values the FFTBuf buffer contain.
void PC_FindAutoCorrelation_cl(float* PowerBin){ // Because positive and negative delays are the same // we only have to look at half the points //Jason: Use 4N-FFT method for Type 2 Discrete Cosine Tranform for now, to match fftw's REDFT10 // 1 Autocorrelation from global powerspectrum at fft_num*ac_fft_len (fft_num*ul_NumDataPoints ) //R: all transforms on whole 2D matrix instead of single power specrum. #if 1 //for dump const int ac_fftlen=128*1024; float* b=new float[ac_fftlen*4*2]; if(b==NULL)fprintf(stderr,"Can't allocate buffer\n"); #endif //Step 1: Preprocessing - repack relevant powerspectrum into a 4N array with 'real-even symmetry' { size_t localThreads[2]={64,1};//R: TODO tune size_t globalThreads[2]={128*1024,8};//R: hardwired FFT len and number of FFTs for now #if 1 //dump err=clEnqueueReadBuffer(cq,gpu_PowerSpectrum,CL_TRUE,sizeof(float)*ac_fftlen*1,sizeof(float)*ac_fftlen,b,0, NULL,NULL); if(err)fprintf(stderr,"ERROR: ReadBuffer:%d\n",err); fprintf(stderr,"power spectrum:\n"); for(int i=0;i<ac_fftlen;i++) fprintf(stderr,"i=%d,x=%g\n",i,b); #endif int i=0; err = clSetKernelArg(RepackInput_kernel_cl,i++,sizeof(cl_mem),(void *)&gpu_PowerSpectrum); err |= clSetKernelArg(RepackInput_kernel_cl,i++,sizeof(cl_mem),(void *)&gpu_WorkData); //int size=128*1024; //err |=clSetKernelArg(RepackInput_kernel_cl,i++,sizeof(cl_int),(void *)&size); if(err) fprintf(stderr,"ERROR: Setting kernel argument:RepackInput_kernel_cl: %d\n",err); #if OCL_VERBOSE fprintf(stderr,"INFO: Setting kernel argument:RepackInput_kernel_cl ok\n"); #endif err = clEnqueueNDRangeKernel(cq,RepackInput_kernel_cl, 2, NULL,globalThreads, localThreads, 0,NULL,NULL); if(err) fprintf(stderr,"ERROR: RepackInput_kernel_cl: %d\n",err); #if OCL_VERBOSE fprintf(stderr,"INFO: RepackInput_kernel_cl ok\n"); #endif } #if 1 //dump err=clEnqueueReadBuffer(cq,gpu_WorkData,CL_TRUE,1*4*2*sizeof(float)*ac_fftlen,4*2*sizeof(float)*ac_fftlen,b,0, NULL,NULL); if(err)fprintf(stderr,"ERROR: ReadBuffer:%d\n",err); fprintf(stderr,"after inflation:\n"); for(int i=0;i<4*ac_fftlen;i++) fprintf(stderr,"i=%d,x=%g, y=%g\n",i,b[2*i],b[2*i+1]); #endif //Step 2: Process the 4N-FFT (Complex to Complex, size is 4 * ac_fft_len) err |= clFFT_ExecuteInterleaved_mb(cq, autocorr_plan,8, clFFT_Forward, gpu_WorkData,gpu_WorkData, FFTbuf, 0, NULL, NULL); if(err) fprintf(stderr,"ERROR: autocorr fft: %d\n",err); #if OCL_VERBOSE fprintf(stderr,"INFO: autocorr fft ok\n"); #endif #if 1 //dump err=clEnqueueReadBuffer(cq,gpu_WorkData,CL_TRUE,1*4*2*sizeof(float)*ac_fftlen,4*2*sizeof(float)*ac_fftlen,b,0, NULL,NULL); if(err)fprintf(stderr,"ERROR: ReadBuffer:%d\n",err); fprintf(stderr,"after fft:\n"); for(int i=0;i<4*ac_fftlen;i++) fprintf(stderr,"i=%d,x=%g, y=%g\n",i,b[2*i],b[2*i+1]); #endif #if 0 //R: this step removed now //Step 3: Postprocess the FFT result (Scale, take powers & normalize), discarding unused data packing into AutoCorr_in first half for VRAM reuse { size_t localThreads[2]={64,1};//R: TODO tune size_t globalThreads[2]={32*1024,8};//R: hardwired FFT len and number of FFTs for now int i=0; err = clSetKernelArg(RepackOutput_kernel_cl,i++,sizeof(cl_mem),(void *)&gpu_WorkData); err |= clSetKernelArg(RepackOutput_kernel_cl,i++,sizeof(cl_mem),(void *)&FFTbuf); //int size=64*1024; //err |=clSetKernelArg(RepackOutput_kernel_cl,i++,sizeof(cl_int),(void *)&size); if(err) fprintf(stderr,"ERROR: Setting kernel argument:RepackOutput_kernel_cl: %d\n",err); err = clEnqueueNDRangeKernel(cq,RepackOutput_kernel_cl, 2, NULL,globalThreads, localThreads, 0,NULL,NULL); if(err) fprintf(stderr,"ERROR: RepackOutput_kernel_cl: %d\n",err); } #endif #if 0 err=clEnqueueReadBuffer(cq,FFTbuf,CL_TRUE,1*64*1024*sizeof(float),sizeof(float)*ac_fftlen/2,b,0, NULL,NULL); if(err)fprintf(stderr,"ERROR: ReadBuffer:%d\n",err); fprintf(stderr,"after packing:\n"); for(int j=0;j<ac_fftlen/2;j++) fprintf(stderr,"AutoCorrelation[%d]=%.4g\n",j,b
); //fprintf(stderr,"i=%d, x=%g, y=%g\n",i,b[2*i],b[2*i+1]); exit(0); #endif #if 0 //USE_OPENCL_NV || USE_OPENCL_HD5xxx {size_t localThreads[2]={256,1}; size_t globalThreads[2]={256,8};//R: hardwired number of FFTs for now int i=0; err = clSetKernelArg(FindAutoCorrelation_kernel1_cl,i++,sizeof(cl_mem),(void *)&gpu_WorkData); err |= clSetKernelArg(FindAutoCorrelation_kernel1_cl,i++,sizeof(cl_mem),(void *)&gpu_AutoCorrelationResults); if(err) fprintf(stderr,"ERROR: Setting kernel argument:FindAutoCorrelation_kernel1_cl: %d\n",err); err = clEnqueueNDRangeKernel(cq,FindAutoCorrelation_kernel1_cl, 2, NULL,globalThreads, localThreads, 0,NULL,NULL); if(err) fprintf(stderr,"ERROR: FindAutoCorrelation_reduce0_kernel_cl: %d\n",err); } #else // const int len = 64*1024;//autocorr_fft_len/2; // const int blksize = UNSTDMAX(4, UNSTDMIN(pow2a((unsigned int) sqrt((float) (len / 32)) * 32), 512)); {size_t localThreads[2]={64,1}; size_t globalThreads[2]={256,8};//R: hardwired number of FFTs for now int i=0; err = clSetKernelArg(FindAutoCorrelation_reduce0_kernel_cl,i++,sizeof(cl_mem),(void *)&gpu_WorkData); err |= clSetKernelArg(FindAutoCorrelation_reduce0_kernel_cl,i++,sizeof(cl_mem),(void *)&FFTbuf); if(err) fprintf(stderr,"ERROR: Setting kernel argument:FindAutoCorrelation_reduce0_kernel_cl: %d\n",err); #if OCL_VERBOSE fprintf(stderr,"INFO: Setting kernel argument:FindAutoCorrelation_reduce0_kernel_cl ok\n"); #endif err = clEnqueueNDRangeKernel(cq,FindAutoCorrelation_reduce0_kernel_cl, 2, NULL,globalThreads, localThreads, 0,NULL,NULL); if(err) fprintf(stderr,"ERROR: FindAutoCorrelation_reduce0_kernel_cl: %d\n",err); #if OCL_VERBOSE fprintf(stderr,"INFO: FindAutoCorrelation_reduce0_kernel_cl ok\n"); #endif } #if 1 //dump err=clEnqueueReadBuffer(cq,gpu_WorkData,CL_TRUE,0*256*sizeof(float),sizeof(float)*256*8*3,b,0, NULL,NULL); if(err)fprintf(stderr,"ERROR: ReadBuffer:%d\n",err); fprintf(stderr,"partially reduced:\n"); for(int j=0;j<256*8;j++) fprintf(stderr,"j=%d,partial_sum=%.4g,partial_max=%.4g, idx=%.4g\n",j,b[3*j],b[3*j+1],b[3*j+2]); //fprintf(stderr,"i=%d, x=%g, y=%g\n",i,b[2*i],b[2*i+1]); //exit(0); #endif { int i=0; size_t localThreads[1]={1}; size_t globalThreads[1]={8};//R: hardwired number of FFTs for now err = clSetKernelArg(FindAutoCorrelation_reduce1_kernel_cl,i++,sizeof(cl_mem),(void *)&FFTbuf); err |= clSetKernelArg(FindAutoCorrelation_reduce1_kernel_cl,i++,sizeof(cl_mem),(void *)&gpu_AutoCorrelationResults); if(err) fprintf(stderr,"ERROR: Setting kernel argument:FindAutoCorrelation_reduce1_kernel_cl: %d\n",err); #if OCL_VERBOSE fprintf(stderr,"INFO: Setting kernel argument:FindAutoCorrelation_reduce1_kernel_cl ok\n"); #endif err = clEnqueueNDRangeKernel(cq,FindAutoCorrelation_reduce1_kernel_cl, 1, NULL,globalThreads, localThreads, 0,NULL,NULL); if(err) fprintf(stderr,"ERROR: Enqueueing FindAutoCorrelation_reduce1_kernel_cl: %d\n",err); #if OCL_VERBOSE fprintf(stderr,"INFO: Enqueuing FindAutoCorrelation_reduce1_kernel_cl ok\n"); #endif } #endif #if OCL_WDM err=clFlush(cq); if(err)fprintf(stderr,"ERROR: submitting kernels autocorr search: %d\n",err); #if __Win32 Sleep(OCL_WDM_SLEEP); #elif _GNU_SOURCE usleep(OCL_WDM_SLEEP); #endif #endif err=clEnqueueReadBuffer(cq,gpu_AutoCorrelationResults,CL_TRUE,0,sizeof(cl_float2)*8,PowerBin,0, NULL,NULL); if(err) fprintf(stderr,"ERROR: clReadBuffer(gpu_AutoCorrelationResults,PowerBin): %d\n",err); #if OCL_VERBOSE fprintf(stderr,"INFO: clReadBuffer(gpu_AutoCorrelationResults,PowerBin) ok\n"); #endif #if 1 //dump fprintf(stderr,"fully reduced:\n"); for(int i=0;i<8;i++)fprintf(stderr,"i=%d, power=%.4g,bin=%.4g\n",i,PowerBin[2*i],PowerBin[2*i+1]); exit(0); #endif }
Hi raistmer,
I am not able to access the logs you provided link for, as the link is non-english.
Can you please provide a english link.
Originally posted by: himanshu.gautam
Hi raistmer,
I am not able to access the logs you provided link for, as the link is non-english.
Can you please provide a english link.
Got them 🙂
Thanks for info about which SDK features are supported by XP.
Seti@home apps do not use any SDK 2.6 features currently.
Back to sub-topic :
Found some freeware tool called "BlueScreenView" to decipher
bluescreens on windows XP and later. Hopefully that can help to ease bug search a little.
Dump File : Mini121511-01.dmp
Crash Time : 15.12.2011 02:00:54
Bug Check String : THREAD_STUCK_IN_DEVICE_DRIVER
Caused By Driver : ati2cqag.dll
Caused By Address : ati2cqag.dll+1aeb4
File Description : Central Memory Manager / Queue Server Module
Product Name : ATI Radeon Family
Company : ATI Technologies Inc.
File Version : 6.14.10.0534
Processor : 32-bit
Crash Address : ati2cqag.dll+1ad92
"Seti@home apps do not use any SDK 2.6 features currently."
What are SDK 2.6 features exactly???
Originally posted by: MicahVillmow
freighter,
While we still include the OpenCL runtime components in the graphics driver for pre-SDK 2.6 releases, SDK 2.6 features are not supported on windows xp.
Originally posted by: MicahVillmow Raistmer, It was in response to raistmer/freighter. XP support is being dropped, so if issues arise, they probably won't be fixed.
Will repeat the tests after newer Win7x64 has finished install on that XP host i'm currently running tests.
Originally posted by: freighter Originally posted by: MicahVillmow Raistmer, It was in response to raistmer/freighter. XP support is being dropped, so if issues arise, they probably won't be fixed.
Will repeat the tests after newer Win7x64 has finished install on that XP host i'm currently running tests.
Finally was able to reproduce the issue, first seen on WinXPsp3, now on Win7x64sp1, Cat 11.12 :
Some bluescreen data (added TdrLevel=1 in windows registry) right from crash :
Technical Info:
STOP: 0x0116 (0xFFFFFA8001B044E0, 0xFFFFF88002C078C8, 0x0000000000000000, 0x0000000000000001)
atikmpag.sys - Address FFFFF88002C078C8 base at FFFFF88002C00000, DateStamp 0x4ebb331c
AND additional after restarting the testmachine found in dump :
STOP: 0x0116 (0xFFFFFA8001B044E0, 0xFFFFF88002C078C8, 0x0000000000000000, 0x0000000000000001)
dxgmms1.sys - Address FFFFF88003F8FF07 base at FFFFF88003F80000, DateStamp 0x4ce799c1
Will retry with Cat 12.1preview also.
Originally posted by: MicahVillmow
Raistmer, just a test case that shows the differences is enough.
I bought a HD 4670 AGP to have a machine to test OCL programs, OS W7 x32. Latest driver http://support.amd.com/de/kbarticles/Pages/CatalystAGPHotfix.aspx forces CPU always to 100%. This is a error in latest driver. CPU usage should be max 5%. AMD produce good hardware, but it is useless if driver fails.
I'm very disappointed from AMD company. I will revise my decision to bouhgt new AMD server's if driver support is so bad. This way you lost all my trust to AMD.
_heinz
you bouht the worst card for opencl. first it is 4xxx card which can hardly manage opencl. second it is AGP which have another set of problem of own.
Indeed. There have been a few topics where people were asking what sort of card to buy for testing. Consulting some of these before buying would've saved some money and time. 4670 sure supports OpenCL and might seem an economically wise decision, however 4xxx cards only emulate some features, such as __local (or shared for CUDA) memory as it is. It resides in __global (or device) memory, therefore it hurts performance badly.
I don't know about AGP connector, bud naturally, it is badly deprecated.
Testing performance on 3 years old HW is somewhat self-controversial. If you don't want to spend a lot of money on a test machine, consider buying 6850, which is downclocked second largest chip of the presently newest generation, holds all features and shows decent performance. 7000 cards won't hit market until middle of January most likely.
It is if I'd test NV on GT220 and say that it is bad perf-wise. Naturally, cause it is 3 years old and it is the lowest of the class.
Originally posted by: Meteorhead Indeed. There have been a few topics where people were asking what sort of card to buy for testing. Consulting some of these before buying would've saved some money and time. 4670 sure supports OpenCL and might seem an economically wise decision, however 4xxx cards only emulate some features, such as __local (or shared for CUDA) memory as it is. It resides in __global (or device) memory, therefore it hurts performance badly.
I don't know about AGP connector, bud naturally, it is badly deprecated.
Testing performance on 3 years old HW is somewhat self-controversial. If you don't want to spend a lot of money on a test machine, consider buying 6850, which is downclocked second largest chip of the presently newest generation, holds all features and shows decent performance. 7000 cards won't hit market until middle of January most likely.
It is if I'd test NV on GT220 and say that it is bad perf-wise. Naturally, cause it is 3 years old and it is the lowest of the class.
Reasoning : If an OpenCL app works on one of these lower end HD4xxx GPUs it will run on all the newer ones, too. That seems to be true at least for Seti@home's OpenCL app. This is not about performance or new features it is about providing compatibilty for a wider userbase.
Originally posted by: MicahVillmow freighter, If you want compatibility, don't target a specific device(HD4XXX), but target the language version. You can buy a higher end card but target only core OpenCL 1.0(without images), and it should work on the lower end cards. If it doesn't work, then it is a problem with the our software and should be reported as such.
MicahVillmov,
sorry if my issue is still a problem.
Where did i state that i target a "specific device(HD4xxx)" ? Only using a specific device for testing an OpenCL(1.0 without images) version of Seti@home that tries to follow AMDs recommendations for "HD4xxx GPUs with a max. workgroup size of 128" : Use not more than workgroup sizes of 32 if you want to be on the safe side. Workgroup sizes of 64 and higher can lead to incorrect results.
So, i made sure that AMDs recommendation is in use and works ok on at least one of the (former) supported platforms : openSuse 11.3, 64bit and Catalyst driver version 11.9 and older. Newer drivers lead to the same failure like i reported earlier in this thread for the windows version.
???
Originally posted by: MicahVillmow freighter/Raistmer, Does the crashing occur on anything outside of the HD4XXX series? We believe we have the performance issue understood and are working on a fix.
No, only HD4xxx GPUs with max. workgroup size of 128 are showing the crashes.
Hi,
it's ca. 3 months now and after trying out all the drivers i could get somehow, upto 12.4 preview, the crashing issue that happens only on HD4xxx GPUs with max. workgroup size of 128 is still not resolved. Hopefully with one of the next drivers the fix you were taking about in january gets in the driver.
Patiently (a)waiting ...
freighter,
Are you querying the 'CL_KERNEL_WORK_GROUP_SIZE' attribute of the device you are executing on? In some cases we must limit the size on a per kernel basis.
MicahVillmow,
the 'CL_KERNEL_WORK_GROUP_SIZE' attribute of the device is used inside of Apple's OpenCL_FFT, which we adapted for our needs. It contains most of the modifications for the HD4xxx GPUs with a max. workgroup size of 128.
In fft_setup.cpp, int getMaxKernelWorkGroupSize(cl_fft_plan *plan, unsigned int *max_wg_size, unsigned int num_devices, cl_device_id *devices) the 'CL_KERNEL_WORK_GROUP_SIZE' attribute of the device is queried, but its value only used for comparisons.
In that function the codeline "if(wg_size < kInfo->num_workitems_per_workgroup)" is replaced with "if(wg_size < plan->max_work_item_per_workgroup)", where "plan->max_work_item_per_workgroup" is constantly set to 32, for the HD4xxx GPUs with a max. workgroup size of 128.
Like mentioned earlier, this works upto Catalyst 11.9 and AMD-APP-SDK 2.4 on linux.
Ok,
Can you supply us with a test case that shows the issue? We have plenty of samples running on the HD4XXX series of GPUs without issue.
Micah Villmow,
as Raistmer has sent you the sources already, I've put some example together that should make reproducing the crashes possible (Linux Testcase for LHD4K crashes) :
edit: link removed
Please inform me if you got it, so i can free the server space again.
Nachricht wurde geändert durch: Urs Echternacht to remove a link.
I have the test case, thanks.