cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Raistmer
Adept II

Catalyst 11.11 is broken too

Invalid results produced under this driver

Unfortunately, not all users read message boards.
Hosts with this driver already appeared in list of participants, producing incorrect results.

When next Catalyst version, or at least withdrawal of this one can be expected? Having it online and downloadable hurts project I worked on.
Tags (1)
0 Likes
75 Replies
timattox
Adept I

Catalyst 11.10preview2 is broken

Raistmer,

I'm curious what project (some BOINC project perhaps?) your participants are generating incorrect results for.  You don't give a googleable name, and you don't give a URL.

If you want AMD to help out, you need to be a bit more understanding that they are dealing with hundreds (thousands?) of different users/developers.  To expect them to just remember you and your unnamed project is a bit much.

Without details on what is broken, and any hope of AMD reproducing the problem, how can they tell if your problem is real, or is simply PEBKAC.  (Google it...)

I am also frustrated in the slow pace of support from AMD for OpenCL issues.  But being rude in the forums won't help your case.

-- Tim

0 Likes
FrodoTheGiant
Journeyman III

Catalyst 11.10preview2 is broken

PEBKAC: Poorly Educated Bored Know-all AMD Coder

... and AMD seems to have lots of those in their driver department.

0 Likes
Raistmer
Adept II

Catalyst 11.10preview2 is broken

1) BOINC project is SETI@home
2) There are thousand participants already. Good deal of them trying to use ATi GPUs (even more, some specially buy such GPUs, i.e. generate direct revenue for AMD). But with such nice drivers and only generic support (there was no patch ever released to fix numerous reported problems over time, we should be happy if reported problem was fixed in one of regular, i.e. taking MONTHS (!) releases)....
3) Well, I can say what is broken - AutoCorrelation search is broken and produces overflow in results (too many reported false positives). But I hardly think this app-specific info can help in driver debugging. From other side, I posted link to application itself. Kernels are supplied in text CL file... Use them for debugging issue, why not. OR application developer should debug drivers instead? Sorry, I have other paid work to do... And I got no profit trying (as one of results of my OpenCL programming) to increase AMD revenues (biggest BOINC project needs good hardware and our participants pay for that hardware good money!).
4) citations from this forum posted on AMDs main pages. If recent thread will report about broken driver... hm, yes, I think it can speedup bugfix in some way... No need to being rude for this, but keeping thread on top list is essintial.
0 Likes
Raistmer
Adept II

Catalyst 11.10preview2 is broken

Originally posted by: timattox

Without details on what is broken, and any hope of AMD reproducing the problem, how can they tell if your problem is real, or is simply PEBKAC.  (Google it...)



-- Tim



LoL, I googled it 🙂
Well, what I did to get errors:
1) uninstalled old driver.
2)rebooted
4)installed Cat11.9 release
3)rebooted
4)ran test - ALL FINE (not quite all, speed issues of course, but correct results at least).
5)uninstalled Cat 11.9
6)rebooted
7)installed Cat11.10 preview2
8)rebooted
9)ran test - ERRORS !

Maybe it's PEBKAC and I did something wrong still, but...
0 Likes
genaganna
Journeyman III

Catalyst 11.10preview2 is broken

Originally posted by: Raistmer 3) Well, I can say what is broken - AutoCorrelation search is broken and produces overflow in results (too many reported false positives). But I hardly think this app-specific info can help in driver debugging.


Raistmer,

Would it be possible to send AutoCorrelation search code which helps us to look issue?

0 Likes
Raistmer
Adept II

Catalyst 11.10preview2 is broken

Originally posted by: genaganna

Originally posted by: Raistmer 3) Well, I can say what is broken - AutoCorrelation search is broken and produces overflow in results (too many reported false positives). But I hardly think this app-specific info can help in driver debugging.





Raistmer,




Would it be possible to send AutoCorrelation search code which helps us to look issue?



Sure. Actually you already have corresponding kernels if you downloaded posted benchmark.
Look for these 3 kernels in CL file there:

RepackInput_kernel_cl, FindAutoCorrelation_reduce0_kernel1_cl,FindAutoCorrelation_reduce1_kernel_cl

if you need corresponding host code let me know.
0 Likes
Raistmer
Adept II

Catalyst 11.10preview2 is broken

(but it uses oclFFT on size of 512k. Quite possible that one of FFT kernels failed)

EDIT: full function as is:

void PC_FindAutoCorrelation_cl(float* PowerBin){ // Because positive and negative delays are the same // we only have to look at half the points //Jason: Use 4N-FFT method for Type 2 Discrete Cosine Tranform for now, to match fftw's REDFT10 // 1 Autocorrelation from global powerspectrum at fft_num*ac_fft_len (fft_num*ul_NumDataPoints ) //R: all transforms on whole 2D matrix instead of single power specrum. #if 0 const int ac_fftlen=128*1024; float* b=new float[ac_fftlen*4*2]; if(b==NULL)fprintf(stderr,"Can't allocate buffer\n"); #endif //Step 1: Preprocessing - repack relevant powerspectrum into a 4N array with 'real-even symmetry' { size_t localThreads[2]={64,1};//R: TODO tune size_t globalThreads[2]={128*1024,8};//R: hardwired FFT len and number of FFTs for now #if 0 err=clEnqueueReadBuffer(cq,gpu_PowerSpectrum,CL_TRUE,sizeof(float)*ac_fftlen*1,sizeof(float)*ac_fftlen,b,0, NULL,NULL); if(err)fprintf(stderr,"ERROR: ReadBuffer:%d\n",err); fprintf(stderr,"power spectrum:\n"); for(int i=0;i<ac_fftlen;i++) fprintf(stderr,"i=%d,x=%g\n",i,b); #endif int i=0; err = clSetKernelArg(RepackInput_kernel_cl,i++,sizeof(cl_mem),(void *)&gpu_PowerSpectrum); err |= clSetKernelArg(RepackInput_kernel_cl,i++,sizeof(cl_mem),(void *)&gpu_WorkData); //int size=128*1024; //err |=clSetKernelArg(RepackInput_kernel_cl,i++,sizeof(cl_int),(void *)&size); if(err) fprintf(stderr,"ERROR: Setting kernel argument:RepackInput_kernel_cl: %d\n",err); #if OCL_VERBOSE else fprintf(stderr,"INFO: Setting kernel argument:RepackInput_kernel_cl ok\n"); #endif err = clEnqueueNDRangeKernel(cq,RepackInput_kernel_cl, 2, NULL,globalThreads, localThreads, 0,NULL,NULL); if(err) fprintf(stderr,"ERROR: RepackInput_kernel_cl: %d\n",err); #if OCL_VERBOSE else fprintf(stderr,"INFO: RepackInput_kernel_cl ok\n"); #endif } #if 0 err=clEnqueueReadBuffer(cq,gpu_WorkData,CL_TRUE,1*4*2*sizeof(float)*ac_fftlen,4*2*sizeof(float)*ac_fftlen,b,0, NULL,NULL); if(err)fprintf(stderr,"ERROR: ReadBuffer:%d\n",err); fprintf(stderr,"after inflation:\n"); for(int i=0;i<4*ac_fftlen;i++) fprintf(stderr,"i=%d,x=%g, y=%g\n",i,b[2*i],b[2*i+1]); #endif //Step 2: Process the 4N-FFT (Complex to Complex, size is 4 * ac_fft_len) err |= clFFT_ExecuteInterleaved_mb(cq, autocorr_plan,8, clFFT_Forward, gpu_WorkData,gpu_WorkData, FFTbuf, 0, NULL, NULL); if(err) fprintf(stderr,"ERROR: autocorr fft: %d\n",err); #if OCL_VERBOSE else fprintf(stderr,"INFO: autocorr fft ok\n"); #endif #if 0 err=clEnqueueReadBuffer(cq,gpu_WorkData,CL_TRUE,1*4*2*sizeof(float)*ac_fftlen,4*2*sizeof(float)*ac_fftlen,b,0, NULL,NULL); if(err)fprintf(stderr,"ERROR: ReadBuffer:%d\n",err); fprintf(stderr,"after fft:\n"); for(int i=0;i<4*ac_fftlen;i++) fprintf(stderr,"i=%d,x=%g, y=%g\n",i,b[2*i],b[2*i+1]); #endif #if 0 //R: this step removed now //Step 3: Postprocess the FFT result (Scale, take powers & normalize), discarding unused data packing into AutoCorr_in first half for VRAM reuse { size_t localThreads[2]={64,1};//R: TODO tune size_t globalThreads[2]={32*1024,8};//R: hardwired FFT len and number of FFTs for now int i=0; err = clSetKernelArg(RepackOutput_kernel_cl,i++,sizeof(cl_mem),(void *)&gpu_WorkData); err |= clSetKernelArg(RepackOutput_kernel_cl,i++,sizeof(cl_mem),(void *)&FFTbuf); //int size=64*1024; //err |=clSetKernelArg(RepackOutput_kernel_cl,i++,sizeof(cl_int),(void *)&size); if(err) fprintf(stderr,"ERROR: Setting kernel argument:RepackOutput_kernel_cl: %d\n",err); err = clEnqueueNDRangeKernel(cq,RepackOutput_kernel_cl, 2, NULL,globalThreads, localThreads, 0,NULL,NULL); if(err) fprintf(stderr,"ERROR: RepackOutput_kernel_cl: %d\n",err); } #endif #if 0 err=clEnqueueReadBuffer(cq,FFTbuf,CL_TRUE,1*64*1024*sizeof(float),sizeof(float)*ac_fftlen/2,b,0, NULL,NULL); if(err)fprintf(stderr,"ERROR: ReadBuffer:%d\n",err); fprintf(stderr,"after packing:\n"); for(int j=0;j<ac_fftlen/2;j++) fprintf(stderr,"AutoCorrelation[%d]=%.4g\n",j,b); //fprintf(stderr,"i=%d, x=%g, y=%g\n",i,b[2*i],b[2*i+1]); exit(0); #endif #if 0 //USE_OPENCL_NV || USE_OPENCL_HD5xxx {size_t localThreads[2]={256,1}; size_t globalThreads[2]={256,8};//R: hardwired number of FFTs for now int i=0; err = clSetKernelArg(FindAutoCorrelation_kernel1_cl,i++,sizeof(cl_mem),(void *)&gpu_WorkData); err |= clSetKernelArg(FindAutoCorrelation_kernel1_cl,i++,sizeof(cl_mem),(void *)&gpu_AutoCorrelationResults); if(err) fprintf(stderr,"ERROR: Setting kernel argument:FindAutoCorrelation_kernel1_cl: %d\n",err); err = clEnqueueNDRangeKernel(cq,FindAutoCorrelation_kernel1_cl, 2, NULL,globalThreads, localThreads, 0,NULL,NULL); if(err) fprintf(stderr,"ERROR: FindAutoCorrelation_reduce0_kernel_cl: %d\n",err); } #else // const int len = 64*1024;//autocorr_fft_len/2; // const int blksize = UNSTDMAX(4, UNSTDMIN(pow2a((unsigned int) sqrt((float) (len / 32)) * 32), 512)); {size_t localThreads[2]={64,1}; size_t globalThreads[2]={256,8};//R: hardwired number of FFTs for now int i=0; err = clSetKernelArg(FindAutoCorrelation_reduce0_kernel_cl,i++,sizeof(cl_mem),(void *)&gpu_WorkData); err |= clSetKernelArg(FindAutoCorrelation_reduce0_kernel_cl,i++,sizeof(cl_mem),(void *)&FFTbuf); if(err) fprintf(stderr,"ERROR: Setting kernel argument:FindAutoCorrelation_reduce0_kernel_cl: %d\n",err); #if OCL_VERBOSE else fprintf(stderr,"INFO: Setting kernel argument:FindAutoCorrelation_reduce0_kernel_cl ok\n"); #endif err = clEnqueueNDRangeKernel(cq,FindAutoCorrelation_reduce0_kernel_cl, 2, NULL,globalThreads, localThreads, 0,NULL,NULL); if(err) fprintf(stderr,"ERROR: FindAutoCorrelation_reduce0_kernel_cl: %d\n",err); #if OCL_VERBOSE else fprintf(stderr,"INFO: FindAutoCorrelation_reduce0_kernel_cl ok\n"); #endif } #if 0 err=clEnqueueReadBuffer(cq,gpu_WorkData,CL_TRUE,0*256*sizeof(float),sizeof(float)*256*8*3,b,0, NULL,NULL); if(err)fprintf(stderr,"ERROR: ReadBuffer:%d\n",err); fprintf(stderr,"partially reduced:\n"); for(int j=0;j<256*8;j++) fprintf(stderr,"j=%d,partial_sum=%.4g,partial_max=%.4g, idx=%.4g\n",j,b[3*j],b[3*j+1],b[3*j+2]); //fprintf(stderr,"i=%d, x=%g, y=%g\n",i,b[2*i],b[2*i+1]); exit(0); #endif { int i=0; size_t localThreads[1]={1}; size_t globalThreads[1]={8};//R: hardwired number of FFTs for now err = clSetKernelArg(FindAutoCorrelation_reduce1_kernel_cl,i++,sizeof(cl_mem),(void *)&FFTbuf); err |= clSetKernelArg(FindAutoCorrelation_reduce1_kernel_cl,i++,sizeof(cl_mem),(void *)&gpu_AutoCorrelationResults); if(err) fprintf(stderr,"ERROR: Setting kernel argument:FindAutoCorrelation_reduce1_kernel_cl: %d\n",err); #if OCL_VERBOSE else fprintf(stderr,"INFO: Setting kernel argument:FindAutoCorrelation_reduce1_kernel_cl ok\n"); #endif err = clEnqueueNDRangeKernel(cq,FindAutoCorrelation_reduce1_kernel_cl, 1, NULL,globalThreads, localThreads, 0,NULL,NULL); if(err) fprintf(stderr,"ERROR: Enqueueing FindAutoCorrelation_reduce1_kernel_cl: %d\n",err); #if OCL_VERBOSE else fprintf(stderr,"INFO: Enqueuing FindAutoCorrelation_reduce1_kernel_cl ok\n"); #endif } #endif #if OCL_WDM err=clFlush(cq); if(err)fprintf(stderr,"ERROR: submitting kernels autocorr search: %d\n",err); #if __Win32 Sleep(OCL_WDM_SLEEP); #elif _GNU_SOURCE usleep(OCL_WDM_SLEEP); #endif #endif err=clEnqueueReadBuffer(cq,gpu_AutoCorrelationResults,CL_TRUE,0,sizeof(cl_float2)*8,PowerBin,0, NULL,NULL); if(err) fprintf(stderr,"ERROR: clReadBuffer(gpu_AutoCorrelationResults,PowerBin): %d\n",err); #if OCL_VERBOSE else fprintf(stderr,"INFO: clReadBuffer(gpu_AutoCorrelationResults,PowerBin) ok\n"); #endif #if 0 for(int i=0;i<8;i++)fprintf(stderr,"i=%d, power=%.4g,bin=%.4g\n",i,PowerBin[2*i],PowerBin[2*i+1]); exit(0); #endif }

0 Likes
genaganna
Journeyman III

Catalyst 11.10preview2 is broken

Originally posted by: Raistmer (but it uses oclFFT on size of 512k. Quite possible that one of FFT kernels failed) EDIT: full function as is:


Thank you for giving some code.  Could you please copy oclFFT kernels also or tell us with what options oclFFT kernels were generated?

0 Likes
Raistmer
Adept II

Catalyst 11.10preview2 is broken

Originally posted by: genaganna

Originally posted by: Raistmer (but it uses oclFFT on size of 512k. Quite possible that one of FFT kernels failed) EDIT: full function as is:





Thank you for giving some code.  Could you please copy oclFFT kernels also or tell us with what options oclFFT kernels were generated?


Sure.
Here is slightly modded oclFFT call I used:
(batch size should be 8 and FFT length 4*128k=512k)

cl_int clFFT_ExecuteInterleaved_mb( cl_command_queue queue, clFFT_Plan Plan, cl_int batchSize, clFFT_Direction dir, cl_mem data_in, cl_mem data_out, cl_mem temp_buf, cl_int num_events, cl_event *event_list, cl_event *event ) { int s; cl_fft_plan *plan = (cl_fft_plan *) Plan; if(plan->format != clFFT_InterleavedComplexFormat) return CL_INVALID_VALUE; cl_int err=CL_SUCCESS; size_t gWorkItems, lWorkItems; int inPlaceDone; cl_int isInPlace = data_in == data_out ? 1 : 0; cl_mem memObj[3]; memObj[0] = data_in; memObj[1] = data_out; memObj[2] = temp_buf; cl_fft_kernel_info *kernelInfo = plan->kernel_info; int numKernels = plan->num_kernels; //fprintf(stderr,"ExecuteInterleaved_mb: plan->num_kernels =%d\n",(int)plan->num_kernels); int numKernelsOdd = numKernels & 1; int currRead = 0; int currWrite = 1; #if 0 cl_event fft_event=NULL; unsigned int pass=0; #endif // at least one external dram shuffle (transpose) required if(plan->temp_buffer_needed) { // in-place transform if(isInPlace) { inPlaceDone = 0; currRead = 1; currWrite = 2; } else { currWrite = (numKernels & 1) ? 1 : 2; } while(kernelInfo){ if( isInPlace && numKernelsOdd && !inPlaceDone && kernelInfo->in_place_possible) { currWrite = currRead; inPlaceDone = 1; } #if 0 if(fft_event){ cl_ulong start,end; err=clWaitForEvents(1,&fft_event); err|=clGetEventProfilingInfo (fft_event,CL_PROFILING_COMMAND_START,sizeof(cl_ulong),&start,NULL); err|=clGetEventProfilingInfo (fft_event,CL_PROFILING_COMMAND_END,sizeof(cl_ulong),&end,NULL); //Counters<T_oclFFT1_ns,cl_ulong>::update(end-start); fprintf(stderr,"Pass %u: kernel took: %.2e ns, s=%d\n",pass,float(end-start),batchSize); err|=clReleaseEvent(fft_event);fft_event=NULL; if(err != CL_SUCCESS) fprintf(stderr,"ERROR: mb oclFFT_1 event: %d\n",err); else fprintf(stderr,"INFO: mb oclFFT_1 event done ok.\n"); } #endif s = batchSize; getKernelWorkDimensions(plan, kernelInfo, &s, &gWorkItems, &lWorkItems); //fprintf(stderr,"After getKernelWorkDimensions:\nbatchSize s =%d, gWorkItems =%d, lWorkItems =%d, dir =%d\n", s, gWorkItems, lWorkItems, dir); err |= clSetKernelArg(kernelInfo->kernel, 0, sizeof(cl_mem), &memObj[currRead]); err |= clSetKernelArg(kernelInfo->kernel, 1, sizeof(cl_mem), &memObj[currWrite]); err |= clSetKernelArg(kernelInfo->kernel, 2, sizeof(cl_int), &dir); err |= clSetKernelArg(kernelInfo->kernel, 3, sizeof(cl_int), &s); err |= clEnqueueNDRangeKernel(queue, kernelInfo->kernel, 1, NULL, &gWorkItems, &lWorkItems, 0, NULL,NULL/*&fft_event*/); if(err){ fprintf(stderr,"Error in mb oclFFT_1: %d\n",err); return err; } #if OCL_VERBOSE else fprintf(stderr,"INFO: in mb oclFFT_1 ok\n"); #endif currRead = (currWrite == 1) ? 1 : 2; currWrite = (currWrite == 1) ? 2 : 1; kernelInfo = kernelInfo->next; #if 0 pass++; #endif } } // no dram shuffle (transpose required) transform // all kernels can execute in-place. else { while(kernelInfo){ #if 0 if(fft_event){ cl_ulong start,end; err=clWaitForEvents(1,&fft_event); err|=clGetEventProfilingInfo (fft_event,CL_PROFILING_COMMAND_START,sizeof(cl_ulong),&start,NULL); err|=clGetEventProfilingInfo (fft_event,CL_PROFILING_COMMAND_END,sizeof(cl_ulong),&end,NULL); //Counters<T_oclFFT2_ns,cl_ulong>::update(end-start); fprintf(stderr,"Pass %u: kernel took: %.2e ns, s=%d\n",pass,float(end-start),batchSize); err|=clReleaseEvent(fft_event);fft_event=NULL; if(err != CL_SUCCESS) fprintf(stderr,"ERROR: oclFFT_2 event.\n"); else fprintf(stderr,"INFO: mb oclFFT_2 event done ok\n"); } #endif s = batchSize; getKernelWorkDimensions(plan, kernelInfo, &s, &gWorkItems, &lWorkItems); //fprintf(stderr,"After getKernelWorkDimensions:\nbatchSize s =%d, gWorkItems =%d, lWorkItems =%d, dir =%d\n", s, gWorkItems, lWorkItems, dir); err |= clSetKernelArg(kernelInfo->kernel, 0, sizeof(cl_mem), &memObj[currRead]); err |= clSetKernelArg(kernelInfo->kernel, 1, sizeof(cl_mem), &memObj[currWrite]); err |= clSetKernelArg(kernelInfo->kernel, 2, sizeof(cl_int), &dir); err |= clSetKernelArg(kernelInfo->kernel, 3, sizeof(cl_int), &s); err |= clEnqueueNDRangeKernel(queue, kernelInfo->kernel, 1, NULL, &gWorkItems, &lWorkItems, 0, NULL,NULL/*&fft_event*/); if(err){ fprintf(stderr,"Error in mb oclFFT_2: %d\n",err); return err; } #if OCL_VERBOSE else fprintf(stderr,"INFO: in mb oclFFT_2 ok\n"); #endif currRead = 1; currWrite = 1; //fprintf(stderr,"INFO: before mb oclFFT_2 kernel_info->next\n"); kernelInfo = kernelInfo->next; //fprintf(stderr,"INFO: after mb oclFFT_2 kernel_info->next\n"); #if 0 pass++; #endif } } #if 0 if(fft_event){ cl_ulong start,end; err=clWaitForEvents(1,&fft_event); err|=clGetEventProfilingInfo (fft_event,CL_PROFILING_COMMAND_START,sizeof(cl_ulong),&start,NULL); err|=clGetEventProfilingInfo (fft_event,CL_PROFILING_COMMAND_END,sizeof(cl_ulong),&end,NULL); //Counters<T_oclFFT1_ns,cl_ulong>::update(end-start); fprintf(stderr,"Final pass: kernel took: %.2e ns, s=%d\n",float(end-start),batchSize); err|=clReleaseEvent(fft_event);fft_event=NULL; if(err != CL_SUCCESS)fprintf(stderr,"ERROR: Final mb oclFFT event: %d\n",err); else fprintf(stderr,"INFO: Final mb oclFFT event done ok.\n",err); } #endif // if(batchSize>1) Sleep(2); // else Sleep(1); //fprintf(stderr,"ExecuteInterleaved_mb: about to return with err=%d\n",(int)err); return err; }

0 Likes