Archives Discussions

Raistmer · ‎10-11-2011

Invalid results produced under this driver

Unfortunately, not all users read message boards.
Hosts with this driver already appeared in list of participants, producing incorrect results.

When next Catalyst version, or at least withdrawal of this one can be expected? Having it online and downloadable hurts project I worked on.

timattox · ‎10-11-2011

Raistmer,

I'm curious what project (some BOINC project perhaps?) your participants are generating incorrect results for. You don't give a googleable name, and you don't give a URL.

If you want AMD to help out, you need to be a bit more understanding that they are dealing with hundreds (thousands?) of different users/developers. To expect them to just remember you and your unnamed project is a bit much.

Without details on what is broken, and any hope of AMD reproducing the problem, how can they tell if your problem is real, or is simply PEBKAC. (Google it...)

I am also frustrated in the slow pace of support from AMD for OpenCL issues. But being rude in the forums won't help your case.

-- Tim

FrodoTheGiant · ‎10-11-2011

PEBKAC: Poorly Educated Bored Know-all AMD Coder

... and AMD seems to have lots of those in their driver department.

Raistmer · ‎10-11-2011

Originally posted by: timattox

Without details on what is broken, and any hope of AMD reproducing the problem, how can they tell if your problem is real, or is simply PEBKAC. (Google it...)

-- Tim

LoL, I googled it 🙂
Well, what I did to get errors:
1) uninstalled old driver.
2)rebooted
4)installed Cat11.9 release
3)rebooted
4)ran test - ALL FINE (not quite all, speed issues of course, but correct results at least).
5)uninstalled Cat 11.9
6)rebooted
7)installed Cat11.10 preview2
8)rebooted
9)ran test - ERRORS !

Maybe it's PEBKAC and I did something wrong still, but...

Raistmer · ‎10-11-2011

1) BOINC project is SETI@home
2) There are thousand participants already. Good deal of them trying to use ATi GPUs (even more, some specially buy such GPUs, i.e. generate direct revenue for AMD). But with such nice drivers and only generic support (there was no patch ever released to fix numerous reported problems over time, we should be happy if reported problem was fixed in one of regular, i.e. taking MONTHS (!) releases)....
3) Well, I can say what is broken - AutoCorrelation search is broken and produces overflow in results (too many reported false positives). But I hardly think this app-specific info can help in driver debugging. From other side, I posted link to application itself. Kernels are supplied in text CL file... Use them for debugging issue, why not. OR application developer should debug drivers instead? Sorry, I have other paid work to do... And I got no profit trying (as one of results of my OpenCL programming) to increase AMD revenues (biggest BOINC project needs good hardware and our participants pay for that hardware good money!).
4) citations from this forum posted on AMDs main pages. If recent thread will report about broken driver... hm, yes, I think it can speedup bugfix in some way... No need to being rude for this, but keeping thread on top list is essintial.

genaganna · ‎10-11-2011

Originally posted by: Raistmer 3) Well, I can say what is broken - AutoCorrelation search is broken and produces overflow in results (too many reported false positives). But I hardly think this app-specific info can help in driver debugging.

Raistmer,

Would it be possible to send AutoCorrelation search code which helps us to look issue?

Raistmer · ‎10-11-2011

Originally posted by: genaganna

Originally posted by: Raistmer 3) Well, I can say what is broken - AutoCorrelation search is broken and produces overflow in results (too many reported false positives). But I hardly think this app-specific info can help in driver debugging.

Raistmer,

Would it be possible to send AutoCorrelation search code which helps us to look issue?

Sure. Actually you already have corresponding kernels if you downloaded posted benchmark.
Look for these 3 kernels in CL file there:

RepackInput_kernel_cl, FindAutoCorrelation_reduce0_kernel1_cl,FindAutoCorrelation_reduce1_kernel_cl

if you need corresponding host code let me know.

Raistmer · ‎10-11-2011

(but it uses oclFFT on size of 512k. Quite possible that one of FFT kernels failed)

EDIT: full function as is:

void PC_FindAutoCorrelation_cl(float* PowerBin){ // Because positive and negative delays are the same // we only have to look at half the points //Jason: Use 4N-FFT method for Type 2 Discrete Cosine Tranform for now, to match fftw's REDFT10 // 1 Autocorrelation from global powerspectrum at fft_num*ac_fft_len (fft_num*ul_NumDataPoints ) //R: all transforms on whole 2D matrix instead of single power specrum. #if 0 const int ac_fftlen=128*1024; float* b=new float[ac_fftlen*4*2]; if(b==NULL)fprintf(stderr,"Can't allocate buffer\n"); #endif //Step 1: Preprocessing - repack relevant powerspectrum into a 4N array with 'real-even symmetry' { size_t localThreads[2]={64,1};//R: TODO tune size_t globalThreads[2]={128*1024,8};//R: hardwired FFT len and number of FFTs for now #if 0 err=clEnqueueReadBuffer(cq,gpu_PowerSpectrum,CL_TRUE,sizeof(float)*ac_fftlen*1,sizeof(float)*ac_fftlen,b,0, NULL,NULL); if(err)fprintf(stderr,"ERROR: ReadBuffer:%d\n",err); fprintf(stderr,"power spectrum:\n"); for(int i=0;i<ac_fftlen;i++) fprintf(stderr,"i=%d,x=%g\n",i,b); #endif int i=0; err = clSetKernelArg(RepackInput_kernel_cl,i++,sizeof(cl_mem),(void *)&gpu_PowerSpectrum); err |= clSetKernelArg(RepackInput_kernel_cl,i++,sizeof(cl_mem),(void *)&gpu_WorkData); //int size=128*1024; //err |=clSetKernelArg(RepackInput_kernel_cl,i++,sizeof(cl_int),(void *)&size); if(err) fprintf(stderr,"ERROR: Setting kernel argument:RepackInput_kernel_cl: %d\n",err); #if OCL_VERBOSE else fprintf(stderr,"INFO: Setting kernel argument:RepackInput_kernel_cl ok\n"); #endif err = clEnqueueNDRangeKernel(cq,RepackInput_kernel_cl, 2, NULL,globalThreads, localThreads, 0,NULL,NULL); if(err) fprintf(stderr,"ERROR: RepackInput_kernel_cl: %d\n",err); #if OCL_VERBOSE else fprintf(stderr,"INFO: RepackInput_kernel_cl ok\n"); #endif } #if 0 err=clEnqueueReadBuffer(cq,gpu_WorkData,CL_TRUE,1*4*2*sizeof(float)*ac_fftlen,4*2*sizeof(float)*ac_fftlen,b,0, NULL,NULL); if(err)fprintf(stderr,"ERROR: ReadBuffer:%d\n",err); fprintf(stderr,"after inflation:\n"); for(int i=0;i<4*ac_fftlen;i++) fprintf(stderr,"i=%d,x=%g, y=%g\n",i,b[2*i],b[2*i+1]); #endif //Step 2: Process the 4N-FFT (Complex to Complex, size is 4 * ac_fft_len) err |= clFFT_ExecuteInterleaved_mb(cq, autocorr_plan,8, clFFT_Forward, gpu_WorkData,gpu_WorkData, FFTbuf, 0, NULL, NULL); if(err) fprintf(stderr,"ERROR: autocorr fft: %d\n",err); #if OCL_VERBOSE else fprintf(stderr,"INFO: autocorr fft ok\n"); #endif #if 0 err=clEnqueueReadBuffer(cq,gpu_WorkData,CL_TRUE,1*4*2*sizeof(float)*ac_fftlen,4*2*sizeof(float)*ac_fftlen,b,0, NULL,NULL); if(err)fprintf(stderr,"ERROR: ReadBuffer:%d\n",err); fprintf(stderr,"after fft:\n"); for(int i=0;i<4*ac_fftlen;i++) fprintf(stderr,"i=%d,x=%g, y=%g\n",i,b[2*i],b[2*i+1]); #endif #if 0 //R: this step removed now //Step 3: Postprocess the FFT result (Scale, take powers & normalize), discarding unused data packing into AutoCorr_in first half for VRAM reuse { size_t localThreads[2]={64,1};//R: TODO tune size_t globalThreads[2]={32*1024,8};//R: hardwired FFT len and number of FFTs for now int i=0; err = clSetKernelArg(RepackOutput_kernel_cl,i++,sizeof(cl_mem),(void *)&gpu_WorkData); err |= clSetKernelArg(RepackOutput_kernel_cl,i++,sizeof(cl_mem),(void *)&FFTbuf); //int size=64*1024; //err |=clSetKernelArg(RepackOutput_kernel_cl,i++,sizeof(cl_int),(void *)&size); if(err) fprintf(stderr,"ERROR: Setting kernel argument:RepackOutput_kernel_cl: %d\n",err); err = clEnqueueNDRangeKernel(cq,RepackOutput_kernel_cl, 2, NULL,globalThreads, localThreads, 0,NULL,NULL); if(err) fprintf(stderr,"ERROR: RepackOutput_kernel_cl: %d\n",err); } #endif #if 0 err=clEnqueueReadBuffer(cq,FFTbuf,CL_TRUE,1*64*1024*sizeof(float),sizeof(float)*ac_fftlen/2,b,0, NULL,NULL); if(err)fprintf(stderr,"ERROR: ReadBuffer:%d\n",err); fprintf(stderr,"after packing:\n"); for(int j=0;j<ac_fftlen/2;j++) fprintf(stderr,"AutoCorrelation[%d]=%.4g\n",j,b); //fprintf(stderr,"i=%d, x=%g, y=%g\n",i,b[2*i],b[2*i+1]); exit(0); #endif #if 0 //USE_OPENCL_NV || USE_OPENCL_HD5xxx {size_t localThreads[2]={256,1}; size_t globalThreads[2]={256,8};//R: hardwired number of FFTs for now int i=0; err = clSetKernelArg(FindAutoCorrelation_kernel1_cl,i++,sizeof(cl_mem),(void *)&gpu_WorkData); err |= clSetKernelArg(FindAutoCorrelation_kernel1_cl,i++,sizeof(cl_mem),(void *)&gpu_AutoCorrelationResults); if(err) fprintf(stderr,"ERROR: Setting kernel argument:FindAutoCorrelation_kernel1_cl: %d\n",err); err = clEnqueueNDRangeKernel(cq,FindAutoCorrelation_kernel1_cl, 2, NULL,globalThreads, localThreads, 0,NULL,NULL); if(err) fprintf(stderr,"ERROR: FindAutoCorrelation_reduce0_kernel_cl: %d\n",err); } #else // const int len = 64*1024;//autocorr_fft_len/2; // const int blksize = UNSTDMAX(4, UNSTDMIN(pow2a((unsigned int) sqrt((float) (len / 32)) * 32), 512)); {size_t localThreads[2]={64,1}; size_t globalThreads[2]={256,8};//R: hardwired number of FFTs for now int i=0; err = clSetKernelArg(FindAutoCorrelation_reduce0_kernel_cl,i++,sizeof(cl_mem),(void *)&gpu_WorkData); err |= clSetKernelArg(FindAutoCorrelation_reduce0_kernel_cl,i++,sizeof(cl_mem),(void *)&FFTbuf); if(err) fprintf(stderr,"ERROR: Setting kernel argument:FindAutoCorrelation_reduce0_kernel_cl: %d\n",err); #if OCL_VERBOSE else fprintf(stderr,"INFO: Setting kernel argument:FindAutoCorrelation_reduce0_kernel_cl ok\n"); #endif err = clEnqueueNDRangeKernel(cq,FindAutoCorrelation_reduce0_kernel_cl, 2, NULL,globalThreads, localThreads, 0,NULL,NULL); if(err) fprintf(stderr,"ERROR: FindAutoCorrelation_reduce0_kernel_cl: %d\n",err); #if OCL_VERBOSE else fprintf(stderr,"INFO: FindAutoCorrelation_reduce0_kernel_cl ok\n"); #endif } #if 0 err=clEnqueueReadBuffer(cq,gpu_WorkData,CL_TRUE,0*256*sizeof(float),sizeof(float)*256*8*3,b,0, NULL,NULL); if(err)fprintf(stderr,"ERROR: ReadBuffer:%d\n",err); fprintf(stderr,"partially reduced:\n"); for(int j=0;j<256*8;j++) fprintf(stderr,"j=%d,partial_sum=%.4g,partial_max=%.4g, idx=%.4g\n",j,b[3*j],b[3*j+1],b[3*j+2]); //fprintf(stderr,"i=%d, x=%g, y=%g\n",i,b[2*i],b[2*i+1]); exit(0); #endif { int i=0; size_t localThreads[1]={1}; size_t globalThreads[1]={8};//R: hardwired number of FFTs for now err = clSetKernelArg(FindAutoCorrelation_reduce1_kernel_cl,i++,sizeof(cl_mem),(void *)&FFTbuf); err |= clSetKernelArg(FindAutoCorrelation_reduce1_kernel_cl,i++,sizeof(cl_mem),(void *)&gpu_AutoCorrelationResults); if(err) fprintf(stderr,"ERROR: Setting kernel argument:FindAutoCorrelation_reduce1_kernel_cl: %d\n",err); #if OCL_VERBOSE else fprintf(stderr,"INFO: Setting kernel argument:FindAutoCorrelation_reduce1_kernel_cl ok\n"); #endif err = clEnqueueNDRangeKernel(cq,FindAutoCorrelation_reduce1_kernel_cl, 1, NULL,globalThreads, localThreads, 0,NULL,NULL); if(err) fprintf(stderr,"ERROR: Enqueueing FindAutoCorrelation_reduce1_kernel_cl: %d\n",err); #if OCL_VERBOSE else fprintf(stderr,"INFO: Enqueuing FindAutoCorrelation_reduce1_kernel_cl ok\n"); #endif } #endif #if OCL_WDM err=clFlush(cq); if(err)fprintf(stderr,"ERROR: submitting kernels autocorr search: %d\n",err); #if __Win32 Sleep(OCL_WDM_SLEEP); #elif _GNU_SOURCE usleep(OCL_WDM_SLEEP); #endif #endif err=clEnqueueReadBuffer(cq,gpu_AutoCorrelationResults,CL_TRUE,0,sizeof(cl_float2)*8,PowerBin,0, NULL,NULL); if(err) fprintf(stderr,"ERROR: clReadBuffer(gpu_AutoCorrelationResults,PowerBin): %d\n",err); #if OCL_VERBOSE else fprintf(stderr,"INFO: clReadBuffer(gpu_AutoCorrelationResults,PowerBin) ok\n"); #endif #if 0 for(int i=0;i<8;i++)fprintf(stderr,"i=%d, power=%.4g,bin=%.4g\n",i,PowerBin[2*i],PowerBin[2*i+1]); exit(0); #endif }

genaganna · ‎10-12-2011

Originally posted by: Raistmer (but it uses oclFFT on size of 512k. Quite possible that one of FFT kernels failed) EDIT: full function as is:

Thank you for giving some code. Could you please copy oclFFT kernels also or tell us with what options oclFFT kernels were generated?

Raistmer · ‎10-12-2011

Originally posted by: genaganna

Originally posted by: Raistmer (but it uses oclFFT on size of 512k. Quite possible that one of FFT kernels failed) EDIT: full function as is:

Thank you for giving some code. Could you please copy oclFFT kernels also or tell us with what options oclFFT kernels were generated?

Sure.
Here is slightly modded oclFFT call I used:
(batch size should be 8 and FFT length 4*128k=512k)

cl_int clFFT_ExecuteInterleaved_mb( cl_command_queue queue, clFFT_Plan Plan, cl_int batchSize, clFFT_Direction dir, cl_mem data_in, cl_mem data_out, cl_mem temp_buf, cl_int num_events, cl_event *event_list, cl_event *event ) { int s; cl_fft_plan *plan = (cl_fft_plan *) Plan; if(plan->format != clFFT_InterleavedComplexFormat) return CL_INVALID_VALUE; cl_int err=CL_SUCCESS; size_t gWorkItems, lWorkItems; int inPlaceDone; cl_int isInPlace = data_in == data_out ? 1 : 0; cl_mem memObj[3]; memObj[0] = data_in; memObj[1] = data_out; memObj[2] = temp_buf; cl_fft_kernel_info *kernelInfo = plan->kernel_info; int numKernels = plan->num_kernels; //fprintf(stderr,"ExecuteInterleaved_mb: plan->num_kernels =%d\n",(int)plan->num_kernels); int numKernelsOdd = numKernels & 1; int currRead = 0; int currWrite = 1; #if 0 cl_event fft_event=NULL; unsigned int pass=0; #endif // at least one external dram shuffle (transpose) required if(plan->temp_buffer_needed) { // in-place transform if(isInPlace) { inPlaceDone = 0; currRead = 1; currWrite = 2; } else { currWrite = (numKernels & 1) ? 1 : 2; } while(kernelInfo){ if( isInPlace && numKernelsOdd && !inPlaceDone && kernelInfo->in_place_possible) { currWrite = currRead; inPlaceDone = 1; } #if 0 if(fft_event){ cl_ulong start,end; err=clWaitForEvents(1,&fft_event); err|=clGetEventProfilingInfo (fft_event,CL_PROFILING_COMMAND_START,sizeof(cl_ulong),&start,NULL); err|=clGetEventProfilingInfo (fft_event,CL_PROFILING_COMMAND_END,sizeof(cl_ulong),&end,NULL); //Counters<T_oclFFT1_ns,cl_ulong>::update(end-start); fprintf(stderr,"Pass %u: kernel took: %.2e ns, s=%d\n",pass,float(end-start),batchSize); err|=clReleaseEvent(fft_event);fft_event=NULL; if(err != CL_SUCCESS) fprintf(stderr,"ERROR: mb oclFFT_1 event: %d\n",err); else fprintf(stderr,"INFO: mb oclFFT_1 event done ok.\n"); } #endif s = batchSize; getKernelWorkDimensions(plan, kernelInfo, &s, &gWorkItems, &lWorkItems); //fprintf(stderr,"After getKernelWorkDimensions:\nbatchSize s =%d, gWorkItems =%d, lWorkItems =%d, dir =%d\n", s, gWorkItems, lWorkItems, dir); err |= clSetKernelArg(kernelInfo->kernel, 0, sizeof(cl_mem), &memObj[currRead]); err |= clSetKernelArg(kernelInfo->kernel, 1, sizeof(cl_mem), &memObj[currWrite]); err |= clSetKernelArg(kernelInfo->kernel, 2, sizeof(cl_int), &dir); err |= clSetKernelArg(kernelInfo->kernel, 3, sizeof(cl_int), &s); err |= clEnqueueNDRangeKernel(queue, kernelInfo->kernel, 1, NULL, &gWorkItems, &lWorkItems, 0, NULL,NULL/*&fft_event*/); if(err){ fprintf(stderr,"Error in mb oclFFT_1: %d\n",err); return err; } #if OCL_VERBOSE else fprintf(stderr,"INFO: in mb oclFFT_1 ok\n"); #endif currRead = (currWrite == 1) ? 1 : 2; currWrite = (currWrite == 1) ? 2 : 1; kernelInfo = kernelInfo->next; #if 0 pass++; #endif } } // no dram shuffle (transpose required) transform // all kernels can execute in-place. else { while(kernelInfo){ #if 0 if(fft_event){ cl_ulong start,end; err=clWaitForEvents(1,&fft_event); err|=clGetEventProfilingInfo (fft_event,CL_PROFILING_COMMAND_START,sizeof(cl_ulong),&start,NULL); err|=clGetEventProfilingInfo (fft_event,CL_PROFILING_COMMAND_END,sizeof(cl_ulong),&end,NULL); //Counters<T_oclFFT2_ns,cl_ulong>::update(end-start); fprintf(stderr,"Pass %u: kernel took: %.2e ns, s=%d\n",pass,float(end-start),batchSize); err|=clReleaseEvent(fft_event);fft_event=NULL; if(err != CL_SUCCESS) fprintf(stderr,"ERROR: oclFFT_2 event.\n"); else fprintf(stderr,"INFO: mb oclFFT_2 event done ok\n"); } #endif s = batchSize; getKernelWorkDimensions(plan, kernelInfo, &s, &gWorkItems, &lWorkItems); //fprintf(stderr,"After getKernelWorkDimensions:\nbatchSize s =%d, gWorkItems =%d, lWorkItems =%d, dir =%d\n", s, gWorkItems, lWorkItems, dir); err |= clSetKernelArg(kernelInfo->kernel, 0, sizeof(cl_mem), &memObj[currRead]); err |= clSetKernelArg(kernelInfo->kernel, 1, sizeof(cl_mem), &memObj[currWrite]); err |= clSetKernelArg(kernelInfo->kernel, 2, sizeof(cl_int), &dir); err |= clSetKernelArg(kernelInfo->kernel, 3, sizeof(cl_int), &s); err |= clEnqueueNDRangeKernel(queue, kernelInfo->kernel, 1, NULL, &gWorkItems, &lWorkItems, 0, NULL,NULL/*&fft_event*/); if(err){ fprintf(stderr,"Error in mb oclFFT_2: %d\n",err); return err; } #if OCL_VERBOSE else fprintf(stderr,"INFO: in mb oclFFT_2 ok\n"); #endif currRead = 1; currWrite = 1; //fprintf(stderr,"INFO: before mb oclFFT_2 kernel_info->next\n"); kernelInfo = kernelInfo->next; //fprintf(stderr,"INFO: after mb oclFFT_2 kernel_info->next\n"); #if 0 pass++; #endif } } #if 0 if(fft_event){ cl_ulong start,end; err=clWaitForEvents(1,&fft_event); err|=clGetEventProfilingInfo (fft_event,CL_PROFILING_COMMAND_START,sizeof(cl_ulong),&start,NULL); err|=clGetEventProfilingInfo (fft_event,CL_PROFILING_COMMAND_END,sizeof(cl_ulong),&end,NULL); //Counters<T_oclFFT1_ns,cl_ulong>::update(end-start); fprintf(stderr,"Final pass: kernel took: %.2e ns, s=%d\n",float(end-start),batchSize); err|=clReleaseEvent(fft_event);fft_event=NULL; if(err != CL_SUCCESS)fprintf(stderr,"ERROR: Final mb oclFFT event: %d\n",err); else fprintf(stderr,"INFO: Final mb oclFFT event done ok.\n",err); } #endif // if(batchSize>1) Sleep(2); // else Sleep(1); //fprintf(stderr,"ExecuteInterleaved_mb: about to return with err=%d\n",(int)err); return err; }

Raistmer · ‎10-15-2011

any news on topic?

freighter · ‎11-02-2011

Originally posted by: Raistmer any news on topic?

Only some more bad news using the AMD Catalyst 11.10 release x86_64 linux driver.

Everything that Raistmer reported before in this forum thread for the windows drivers is also affecting the linux version. Windows and linux share the same codebase.

The same type of incorrect results the autocorrelation kernels created on OpenSuse 11.4, 64bit and Kubuntu 10.10, 64bit.

Also another problem got introduced with the new driver version Catalyst 11.10 when using linux OpenCL version for HD4xxx with max workgroup size 128. A severe host system freeze (bluescreen ati2dvag.dll) when running this application version did affect all windows versions (tested since Cat 10.12) and now newly appears on 64bit linux (tested ok until Cat11.9), too. As linux is not throwing "bluescreens" the computer gets completely irresponsive, while still showing the frozen desktop and needs to be rebooted. This happens reproducible at the same place in code (see below : POINT OF FAILURE). This was tested on OpenSuse 11.3, 64bit with a HD4670, 1GB, OpenCL 1.0 device.

....

   localThreads[0]=32; //uje: needed for lower HD4xxx GPUs
#else
        localThreads[0]=host.GetWGSize();
#endif
    //fprintf(stderr,"localThreads[0]=%d\n",localThreads[0]);
        err = clEnqueueNDRangeKernel(cq,PC_find_spike_kernel_cl,
                 1,//R: 1D execution domain used, each work item works with 4 data elements
                 NULL,globalThreads,
                 localThreads,
                 0,NULL,NULL);
            if(err) fprintf(stderr,"ERROR: Enqueueing kernel:PC_find_spike_kernel_cl:%d\n",err);
#if OCL_VERBOSE
      else fprintf(stderr,"INFO: Enqueueing kernel: PC_find_spike_kernel_cl done ok\n");
#endif
    //clFinish(cq);
      //R: checking if any CPU reprocessing/logging needed and if yes, retrieve results
        cl_uint cpu_result_flag[RESULT_SIZE];
      fprintf(stderr,"INFO: Before clEnqueueReadBuffer\n");

#if OCL_WDM
    err=clFlush(cq);
    if(err)fprintf(stderr,"ERROR: submitting kernels for non-strip Spike search: %d\n",err);
#if __Win32
    Sleep(OCL_WDM_SLEEP);
#elif _GNU_SOURCE
    usleep(OCL_WDM_SLEEP);
#endif
#endif
        err=clEnqueueReadBuffer(cq,gpu_result_flag,CL_TRUE,0,
            sizeof(cl_uint)*((globalThreads[0]<RESULT_SIZE)?globalThreads[0]:RESULT_SIZE),
            &cpu_result_flag,0, NULL,NULL);                                        //<---------------------------------------POINT OF FAILURE
        if(err) fprintf(stderr,"ERROR: ReadBuffer(gpu_result_flag,spike):%d\n",err);
#if OCL_VERBOSE
      else fprintf(stderr,"INFO: ReadBuffer(gpu_result_flag,spike) done ok\n");
#endif
             fprintf(stderr,"spike search results (main path): ");

....

Raistmer · ‎11-15-2011

Tried Catalyst 11.10 release.
Same problem - with this Catalyst version app produces invalid results.

AMD, any chances to fix this? Did you find the problem? Few weeks passed...

freighter · ‎11-16-2011

Retested on openSuse11.3, 64bit with Cat 11.11 and still have found my host crashing the same way like described earlier.

How about a patch for this issue ?

gat3way · ‎11-16-2011

If you have previously installed SDK 2.5, please make sure you remove everything its installer placed in /etc/ld.so.conf.d

Otherwise you are using the libOpenCL.so from your SDK2.5 directory, not the one provided with Catalyst 11.11. I haven't tested it yet (I am redoing the offline compilation of all my kernels). The difference is two new targets (Scrapper and Devastator), faster compilation time, mad24 broken for some reason (didn't investigate that though, just replaced it with a slower equivalent).

MicahVillmow · ‎11-15-2011

Raistmer,
11.10 release is code internally from August/September timeframe. So it wouldn't have this fix. I'm trying to find out the status of this specific issue and will get back to you when I can find out more.

gat3way · ‎11-15-2011

Well, apparently Catalyst 11.11 download link is broken too 🙂

MicahVillmow · ‎11-16-2011

mad24 should map directly to the hardware instruction, do you have a test case that shows the failure?

gat3way · ‎11-16-2011

Yep. This is a bad thing to do I know, but it worked before:

#define getglobalid(a) (mad24(get_group_id(0), 64, get_local_id(0)))

...

found_ind[getglobalid(0)] = 1;

Error is:

/tmp/OCLYsVbCt.cl(786): error: more than one instance of overloaded function

"mad24" matches the argument list:

function "mad24(int, int, int) C++"

function "mad24(uint, uint, uint) C++"

argument types are: (uint, int, uint)

found_ind[getglobalid(0)] = 1;

MicahVillmow · ‎11-16-2011

gateway,
This is because our compiler is now more compliant with the OpenCL spec, you need to use 64U instead of 64. This change was made at the behest of our customers so that we had the same error behavior as other vendors that were correct.

gat3way · ‎11-16-2011

Thanks, Micah!

Raistmer · ‎11-21-2011

Well, let's return to original problem.
I just tested app behavior under Win7 x64 + Catalyst 11.11 - the same, app returns incorrect results with this driver too.
When we can expect fix for that or some workaround suggestions ?

FrodoTheGiant · ‎11-22-2011

There is a fix. It is called NVIDIA.

My company decided last week to dump AMD and switch to NVIDIA.

AMD is simply to unreliable when it comes to serious GPGPU computing.

CaptainN · ‎11-22-2011

2 FrodoTheGiant:

Can you stop hysterics?

Raistmer · ‎11-22-2011

Originally posted by: FrodoTheGiant

There is a fix. It is called NVIDIA.

My company decided last week to dump AMD and switch to NVIDIA.

AMD is simply to unreliable when it comes to serious GPGPU computing.

Unfortunately, it's not solution in my case:
1) last NV drivers (270.xx and UP show same CPU time increase as AMD's one. And looks like 285.xx produce incorrect results on Quadro GPUs, though they produce VALID results on all other tested GPUs).
2) I write program that should be used mostly on already bought GPUs. To enable them to participate in world largest project of distributed computations called SETI@home.
Regarding new GPUs - it's quite known over our boards that if one want to make good progress he should buy nVidia GPU for SETI, not AMD GPU.
IMO AMD lost money in 10^5 $ scale. maybe more (maybe less, I'm not interesting in economical computation) by ignoring SETI.
nVidia was kind enough to provide programming support so we have "stock" CUDA app at least, and continuously working to improve it.
From AMD side I see mostly driver degradation (now it even can't perform correct computations, not speaking about increased CPU time).
It's very sad, cause AMD (ATI) hardware could win in SETI area, we already have very fast application for one type of computations, but requirement to install quite old drivers repels most of gamers who wnat to have as new drivers as possible...
AMD's software (in particular, driver) division makes very bad joke over hardware one negating all their achievements!
Needless to say about disgusting software support of most advanced AMD hardware.
I have tester with 2 HD6990, 2-core GPUs. He barely can use them. So, how we can make high-performance computational clusters based on AMD GPUs?...
Definitely all new hardware purchases will go with nVidia mostly.
But there are many AMD GPUs bought already... SETI community donated me HD6950 GPU to improve appication for ATI GPUs... and I still can't use whole its power being restricted with 11.2 drivers that will freeze whole system if I open (for example) YouTube link accidentally on that host...

AMD, time to think about your drivers, really. You lose money, you lose reputation (hence you lose money again). BattleField3 compatibility - it's not all that users want....

NURBS · ‎11-22-2011

$10^5 is spared change to AMD. If the rumors are real, the loss of their contract with Apple will be significant. If you guys have worked for a big corporates, AMD engineers are and should be spending more time on Indeed than fixing anything. Why? Fixing bugs won't prevent you from layoff. What we need for OpenCL to succeed are killer applications used by consumers.

I feel bad for engineers being axed, but not for the corporates making dumb decisions.

Raistmer · ‎11-22-2011

Originally posted by: NURBS

$10^5 is spared change to AMD.

Maybe I'm wrong on few orders of magnitude 😉

Accordingly to boincstats SETI@home now has 3,034,351 hosts.

If only 0.1% of them can be equipped with AMD GPUs it will be 3*10^3 hosts.
If each such GPU costs $100 it will give 3*10^5. But if 1% would be equipped with AMD GPUs we easely will have $3*10^6.
Taking into account that (currently SETI project down for maintenance but after few hours everyone can see top hosts by himself) our top participants have multi-GPU hosts it really can be smth like 10^6 or more.

But I don't think that this issue affect only this single app. So, driver should be fixed anyway.

Meteorhead · ‎11-22-2011

Frodo has already stated that his company decide to shif from AMD to NV. If he isn't working for MS, Apple or some other big company, AMD really doesn't have to care (taking into account 10^5 $ is small money). It is one sad, true story like many others.

I am lucky enough to work at a research center and not at a company which is very result-oriented. I have the luxury to keep ever experimenting with alternative methods and not use factory standards.

Long ago I convinced people to invest money in a multi-GPU development node, used for exploring multi-GPU applications on AMD side. 3X HD5970 is a lot of money for a small hungarian research project. I have my faith in AMD engineers, who create kick@ss HW, but as Raistimer has sad, it is all negated by the poor drivers. I really hate people when they write in full capitals, or when people are vulgar, but as a good friend of mine said: vulgar has it's place:

Two damned GPU generations have passed (having HD7xxx just around the corner), and our 3X HD5970s ARE STILL NOT SUPPORTED!! Really... is AMD really serious about that?? Even if I have my faith in AMD, how should I convince others to invest money in it? Let me try to guess what comes into Frodo's mind: "NOHOW! Do a favor for everyone and keep them away from AMD!"

Unfortunate for AMD, I believe I can achieve more if I keep on pushing AMD, and either keep showing it's impotency, or it's merits. It is all up to "you".

And for the higher ups. Please-please-please Micah, Genaganna or anyone else who reads this. Do convey these comments as quotations. Not summaries, or some light version, but word by word:

Dear AMD,

you're going to your doom if you keep on like this. 2 ****ing years for a GPU to be supported?? By the time I can leverage my HW, it is surpassed by competition mid-class. INVEST MORE INTO SW DEVELOPMENT!! A LOT MORE!!!!!!

Regards,

a fan.

Edit: and don't get me wrong. This is not an advice... this is an ultimatum from the users (developers) as a collective.

Edit2: Not to mention I opened a topic concerning a rather fatal bug of Kernel Analyzer over 3 weeks ago, and I still got no response from anybody, although I'm pretty sure that at least one person knows where KA stores data that has to be cleaned after reinstall.

FrodoTheGiant · ‎11-22-2011

Originally posted by: Raistmer

Needless to say about disgusting software support of most advanced AMD hardware. I have tester with 2 HD6990, 2-core GPUs. He barely can use them.

It is really a slap in the face for those users spending big money on dual GPU cards like the 5970 or 6990 - and then can't use them because AMD doesn't support them, drops support or breaks drivers.

How would anyone trust AMD again?

Why would anyone every buy a high-end AMD GPU again based on that history of scorched earth they have with existing users?

Meteorhead · ‎11-24-2011

Since the topic has been idle for a while now (don't know if my previous post was really that effective, or people just simply lost interest in dising QoS), I though I would post how can one waste a complete day with installing linux.

... ... ... I wanted to write a little short story, to put things nice, and not just btich about things that don't work, but as I was thinking, I figured there's really no point, so I'll just make it short: 11.11 fails even to launch Ubuntu desktop 80% of the times. The desktop simply freezes just a few seconds after the background is shown and the mouse could be moved for a while. Naturally this prevents remote rebooting, as most of the times, the computer will not beable to boot.

For other reasons, I got fed up with Ubuntu, so I installed SLC5 (Scientific Linux CERN, a Red Hat based distro), and after installing a few stuff and restarting the computer in between, after udev has been loaded at boot time and the GUI boot window would be opened, the screen changes resolution (very noticable on the CRT monitor in the computer room), and then ust stays blank and boot halts, just like on Ubuntu (10.04.3, both 64-bit). I have done this twice already (I thought it was the fault of trying to get auto-login working, but even after undoing changes with a live Ubuntu stick, it still fails to boot).

If I pull the computer out of the rack, I got myself a 3000$ doorstand. Or I could heat my office with it...

What is the conclusion? Life would be much better, if I could install a headless node, with no GUI whatsoever and be able to use the GPUs without having to use this crap XServer which crashes at every second corner. Yet again, one thing the community has been asking for over a year now.

Peace.

FrodoTheGiant · ‎11-25-2011

Originally posted by: Meteorhead Since the topic has been idle for a while now (don't know if my previous post was really that effective, or people just simply lost interest in dising QoS)..."

I totally agree with you, but did not post it so that your comment remained the last in this thread. The most visible.

So I was hoping someone from AMD might read it - and respond.

But ... nothing. I am done with AMD.

himanshu_gautam · ‎11-25-2011

Hi Everybody,

It is really sad that it is taking so much time in fixing these reported issues. I will try to find out there status and update here.I would like to assure that these issues are being looked into, based on their priorities.

To summarize the issues, so that I am not missing anything:

Raistmer reports the issue related to correctness in correlation function in SETI@home. This happens in both windows and linux.

Meteohead is trying to run Multi-GPU(x2 cards) configurations which seem to be working. But the driver is giving booting issues in both ubuntu and SLC5.

FrodoTheGiant · ‎11-25-2011

@himanshu.gautam: Please also add the "100% CPU load bug" to your list.

This is broken since the last 5 Catalyst releases. Including the newest one.

On (Windows?) systems with more than once graphics card installed you always have a CPU load of 100% when running Open CL applications. It should be ~2-3% (and has been in older Catalyst versions).

freighter · ‎11-27-2011

Originally posted by: himanshu.gautam Hi Everybody,...

To summarize the issues, so that I am not missing anything:...

And not to forget the seti@home-issue with ATI GPUs with max. workgroup size of 128 (HD43xx - HD46xx), which will crash on windows (tested drivers Cat 10.12 till 11.11) but do work ok on linux (driver <= Cat 11.9). This is a different problem than the incorrectness issue.

gat3way · ‎11-28-2011

Are you sure it works OK on linux? When using local memory, workgroup size is forced to 64 on 4xxx. It might work in some cases though, but in most cases you need to write a separate 4xxx kernel, otherwise it produces wrong results. I did not know it crashes on windows though, that sounds bad...

Raistmer · ‎11-28-2011

Originally posted by: gat3way

Are you sure it works OK on linux? When using local memory, workgroup size is forced to 64 on 4xxx. It might work in some cases though, but in most cases you need to write a separate 4xxx kernel, otherwise it produces wrong results. I did not know it crashes on windows though, that sounds bad...

Actually HD4xxx have no local memory exposed at all. It's emulated via global memory so senseless to use it for performance reasons.
The single point in program where it could be used even on HD4xxx is Apple's oclFFT library.

freighter · ‎12-13-2011

Originally posted by: freighter
Originally posted by: himanshu.gautam Hi Everybody,...

To summarize the issues, so that I am not missing anything:...

And not to forget the seti@home-issue with ATI GPUs with max. workgroup size of 128 (HD43xx - HD46xx), which will crash on windows (tested drivers Cat 10.12 till 11.11) but do work ok on linux (driver <= Cat 11.9). This is a different problem than the incorrectness issue.

Just for update : Tested again today with Cat 11.12 : still see crashing host with a total freeze (openSuse 11.3 64bit) or a bluescreen (windows XPsp3), so no change of that issue yet.

nou · ‎12-14-2011

windows XP is not supported with SDK 2.6

freighter · ‎12-14-2011

Originally posted by: nou windows XP is not supported with SDK 2.6

The OpenCL runtime components are still included in driver Cat 11.12 for winXPpro 32bit. So, i guess, running OpenCL apps on XP IS still supported, isn't it ?

Raistmer · ‎11-24-2011

As linux topic was touched, our Linux porting team has the same problems with Cat 11.10 and 11.11 that I have with Windows. App produces invalid results under these 2 drivers. Versions before work OK. Maybe this will give more hints to AMD engineers what thing they broke in last Catalysts. So, it's not windows-specific issue.

Archives Discussions

Catalyst 11.11 is broken too