75 Replies Latest reply on Apr 12, 2012 12:49 PM by MicahVillmow

    Catalyst 11.11 is broken too

    Raistmer
      Invalid results produced under this driver

      Unfortunately, not all users read message boards.
      Hosts with this driver already appeared in list of participants, producing incorrect results.

      When next Catalyst version, or at least withdrawal of this one can be expected? Having it online and downloadable hurts project I worked on.
        • Catalyst 11.10preview2 is broken
          timattox

          Raistmer,

          I'm curious what project (some BOINC project perhaps?) your participants are generating incorrect results for.  You don't give a googleable name, and you don't give a URL.

          If you want AMD to help out, you need to be a bit more understanding that they are dealing with hundreds (thousands?) of different users/developers.  To expect them to just remember you and your unnamed project is a bit much.

          Without details on what is broken, and any hope of AMD reproducing the problem, how can they tell if your problem is real, or is simply PEBKAC.  (Google it...)

          I am also frustrated in the slow pace of support from AMD for OpenCL issues.  But being rude in the forums won't help your case.

          -- Tim

            • Catalyst 11.10preview2 is broken
              FrodoTheGiant

              PEBKAC: Poorly Educated Bored Know-all AMD Coder

              ... and AMD seems to have lots of those in their driver department.

              • Catalyst 11.10preview2 is broken
                Raistmer
                Originally posted by: timattox

                Without details on what is broken, and any hope of AMD reproducing the problem, how can they tell if your problem is real, or is simply PEBKAC.  (Google it...)



                -- Tim



                LoL, I googled it :)
                Well, what I did to get errors:
                1) uninstalled old driver.
                2)rebooted
                4)installed Cat11.9 release
                3)rebooted
                4)ran test - ALL FINE (not quite all, speed issues of course, but correct results at least).
                5)uninstalled Cat 11.9
                6)rebooted
                7)installed Cat11.10 preview2
                8)rebooted
                9)ran test - ERRORS !

                Maybe it's PEBKAC and I did something wrong still, but...
              • Catalyst 11.10preview2 is broken
                Raistmer
                1) BOINC project is SETI@home
                2) There are thousand participants already. Good deal of them trying to use ATi GPUs (even more, some specially buy such GPUs, i.e. generate direct revenue for AMD). But with such nice drivers and only generic support (there was no patch ever released to fix numerous reported problems over time, we should be happy if reported problem was fixed in one of regular, i.e. taking MONTHS (!) releases)....
                3) Well, I can say what is broken - AutoCorrelation search is broken and produces overflow in results (too many reported false positives). But I hardly think this app-specific info can help in driver debugging. From other side, I posted link to application itself. Kernels are supplied in text CL file... Use them for debugging issue, why not. OR application developer should debug drivers instead? Sorry, I have other paid work to do... And I got no profit trying (as one of results of my OpenCL programming) to increase AMD revenues (biggest BOINC project needs good hardware and our participants pay for that hardware good money!).
                4) citations from this forum posted on AMDs main pages. If recent thread will report about broken driver... hm, yes, I think it can speedup bugfix in some way... No need to being rude for this, but keeping thread on top list is essintial.
                  • Catalyst 11.10preview2 is broken
                    genaganna

                     

                    Originally posted by: Raistmer 3) Well, I can say what is broken - AutoCorrelation search is broken and produces overflow in results (too many reported false positives). But I hardly think this app-specific info can help in driver debugging.


                    Raistmer,

                    Would it be possible to send AutoCorrelation search code which helps us to look issue?

                      • Catalyst 11.10preview2 is broken
                        Raistmer
                        Originally posted by: genaganna

                        Originally posted by: Raistmer 3) Well, I can say what is broken - AutoCorrelation search is broken and produces overflow in results (too many reported false positives). But I hardly think this app-specific info can help in driver debugging.





                        Raistmer,




                        Would it be possible to send AutoCorrelation search code which helps us to look issue?



                        Sure. Actually you already have corresponding kernels if you downloaded posted benchmark.
                        Look for these 3 kernels in CL file there:

                        RepackInput_kernel_cl, FindAutoCorrelation_reduce0_kernel1_cl,FindAutoCorrelation_reduce1_kernel_cl

                        if you need corresponding host code let me know.
                    • Catalyst 11.10preview2 is broken
                      Raistmer
                      (but it uses oclFFT on size of 512k. Quite possible that one of FFT kernels failed)

                      EDIT: full function as is:

                      void PC_FindAutoCorrelation_cl(float* PowerBin){ // Because positive and negative delays are the same // we only have to look at half the points //Jason: Use 4N-FFT method for Type 2 Discrete Cosine Tranform for now, to match fftw's REDFT10 // 1 Autocorrelation from global powerspectrum at fft_num*ac_fft_len (fft_num*ul_NumDataPoints ) //R: all transforms on whole 2D matrix instead of single power specrum. #if 0 const int ac_fftlen=128*1024; float* b=new float[ac_fftlen*4*2]; if(b==NULL)fprintf(stderr,"Can't allocate buffer\n"); #endif //Step 1: Preprocessing - repack relevant powerspectrum into a 4N array with 'real-even symmetry' { size_t localThreads[2]={64,1};//R: TODO tune size_t globalThreads[2]={128*1024,8};//R: hardwired FFT len and number of FFTs for now #if 0 err=clEnqueueReadBuffer(cq,gpu_PowerSpectrum,CL_TRUE,sizeof(float)*ac_fftlen*1,sizeof(float)*ac_fftlen,b,0, NULL,NULL); if(err)fprintf(stderr,"ERROR: ReadBuffer:%d\n",err); fprintf(stderr,"power spectrum:\n"); for(int i=0;i<ac_fftlen;i++) fprintf(stderr,"i=%d,x=%g\n",i,b[i]); #endif int i=0; err = clSetKernelArg(RepackInput_kernel_cl,i++,sizeof(cl_mem),(void *)&gpu_PowerSpectrum); err |= clSetKernelArg(RepackInput_kernel_cl,i++,sizeof(cl_mem),(void *)&gpu_WorkData); //int size=128*1024; //err |=clSetKernelArg(RepackInput_kernel_cl,i++,sizeof(cl_int),(void *)&size); if(err) fprintf(stderr,"ERROR: Setting kernel argument:RepackInput_kernel_cl: %d\n",err); #if OCL_VERBOSE else fprintf(stderr,"INFO: Setting kernel argument:RepackInput_kernel_cl ok\n"); #endif err = clEnqueueNDRangeKernel(cq,RepackInput_kernel_cl, 2, NULL,globalThreads, localThreads, 0,NULL,NULL); if(err) fprintf(stderr,"ERROR: RepackInput_kernel_cl: %d\n",err); #if OCL_VERBOSE else fprintf(stderr,"INFO: RepackInput_kernel_cl ok\n"); #endif } #if 0 err=clEnqueueReadBuffer(cq,gpu_WorkData,CL_TRUE,1*4*2*sizeof(float)*ac_fftlen,4*2*sizeof(float)*ac_fftlen,b,0, NULL,NULL); if(err)fprintf(stderr,"ERROR: ReadBuffer:%d\n",err); fprintf(stderr,"after inflation:\n"); for(int i=0;i<4*ac_fftlen;i++) fprintf(stderr,"i=%d,x=%g, y=%g\n",i,b[2*i],b[2*i+1]); #endif //Step 2: Process the 4N-FFT (Complex to Complex, size is 4 * ac_fft_len) err |= clFFT_ExecuteInterleaved_mb(cq, autocorr_plan,8, clFFT_Forward, gpu_WorkData,gpu_WorkData, FFTbuf, 0, NULL, NULL); if(err) fprintf(stderr,"ERROR: autocorr fft: %d\n",err); #if OCL_VERBOSE else fprintf(stderr,"INFO: autocorr fft ok\n"); #endif #if 0 err=clEnqueueReadBuffer(cq,gpu_WorkData,CL_TRUE,1*4*2*sizeof(float)*ac_fftlen,4*2*sizeof(float)*ac_fftlen,b,0, NULL,NULL); if(err)fprintf(stderr,"ERROR: ReadBuffer:%d\n",err); fprintf(stderr,"after fft:\n"); for(int i=0;i<4*ac_fftlen;i++) fprintf(stderr,"i=%d,x=%g, y=%g\n",i,b[2*i],b[2*i+1]); #endif #if 0 //R: this step removed now //Step 3: Postprocess the FFT result (Scale, take powers & normalize), discarding unused data packing into AutoCorr_in first half for VRAM reuse { size_t localThreads[2]={64,1};//R: TODO tune size_t globalThreads[2]={32*1024,8};//R: hardwired FFT len and number of FFTs for now int i=0; err = clSetKernelArg(RepackOutput_kernel_cl,i++,sizeof(cl_mem),(void *)&gpu_WorkData); err |= clSetKernelArg(RepackOutput_kernel_cl,i++,sizeof(cl_mem),(void *)&FFTbuf); //int size=64*1024; //err |=clSetKernelArg(RepackOutput_kernel_cl,i++,sizeof(cl_int),(void *)&size); if(err) fprintf(stderr,"ERROR: Setting kernel argument:RepackOutput_kernel_cl: %d\n",err); err = clEnqueueNDRangeKernel(cq,RepackOutput_kernel_cl, 2, NULL,globalThreads, localThreads, 0,NULL,NULL); if(err) fprintf(stderr,"ERROR: RepackOutput_kernel_cl: %d\n",err); } #endif #if 0 err=clEnqueueReadBuffer(cq,FFTbuf,CL_TRUE,1*64*1024*sizeof(float),sizeof(float)*ac_fftlen/2,b,0, NULL,NULL); if(err)fprintf(stderr,"ERROR: ReadBuffer:%d\n",err); fprintf(stderr,"after packing:\n"); for(int j=0;j<ac_fftlen/2;j++) fprintf(stderr,"AutoCorrelation[%d]=%.4g\n",j,b[j]); //fprintf(stderr,"i=%d, x=%g, y=%g\n",i,b[2*i],b[2*i+1]); exit(0); #endif #if 0 //USE_OPENCL_NV || USE_OPENCL_HD5xxx {size_t localThreads[2]={256,1}; size_t globalThreads[2]={256,8};//R: hardwired number of FFTs for now int i=0; err = clSetKernelArg(FindAutoCorrelation_kernel1_cl,i++,sizeof(cl_mem),(void *)&gpu_WorkData); err |= clSetKernelArg(FindAutoCorrelation_kernel1_cl,i++,sizeof(cl_mem),(void *)&gpu_AutoCorrelationResults); if(err) fprintf(stderr,"ERROR: Setting kernel argument:FindAutoCorrelation_kernel1_cl: %d\n",err); err = clEnqueueNDRangeKernel(cq,FindAutoCorrelation_kernel1_cl, 2, NULL,globalThreads, localThreads, 0,NULL,NULL); if(err) fprintf(stderr,"ERROR: FindAutoCorrelation_reduce0_kernel_cl: %d\n",err); } #else // const int len = 64*1024;//autocorr_fft_len/2; // const int blksize = UNSTDMAX(4, UNSTDMIN(pow2a((unsigned int) sqrt((float) (len / 32)) * 32), 512)); {size_t localThreads[2]={64,1}; size_t globalThreads[2]={256,8};//R: hardwired number of FFTs for now int i=0; err = clSetKernelArg(FindAutoCorrelation_reduce0_kernel_cl,i++,sizeof(cl_mem),(void *)&gpu_WorkData); err |= clSetKernelArg(FindAutoCorrelation_reduce0_kernel_cl,i++,sizeof(cl_mem),(void *)&FFTbuf); if(err) fprintf(stderr,"ERROR: Setting kernel argument:FindAutoCorrelation_reduce0_kernel_cl: %d\n",err); #if OCL_VERBOSE else fprintf(stderr,"INFO: Setting kernel argument:FindAutoCorrelation_reduce0_kernel_cl ok\n"); #endif err = clEnqueueNDRangeKernel(cq,FindAutoCorrelation_reduce0_kernel_cl, 2, NULL,globalThreads, localThreads, 0,NULL,NULL); if(err) fprintf(stderr,"ERROR: FindAutoCorrelation_reduce0_kernel_cl: %d\n",err); #if OCL_VERBOSE else fprintf(stderr,"INFO: FindAutoCorrelation_reduce0_kernel_cl ok\n"); #endif } #if 0 err=clEnqueueReadBuffer(cq,gpu_WorkData,CL_TRUE,0*256*sizeof(float),sizeof(float)*256*8*3,b,0, NULL,NULL); if(err)fprintf(stderr,"ERROR: ReadBuffer:%d\n",err); fprintf(stderr,"partially reduced:\n"); for(int j=0;j<256*8;j++) fprintf(stderr,"j=%d,partial_sum=%.4g,partial_max=%.4g, idx=%.4g\n",j,b[3*j],b[3*j+1],b[3*j+2]); //fprintf(stderr,"i=%d, x=%g, y=%g\n",i,b[2*i],b[2*i+1]); exit(0); #endif { int i=0; size_t localThreads[1]={1}; size_t globalThreads[1]={8};//R: hardwired number of FFTs for now err = clSetKernelArg(FindAutoCorrelation_reduce1_kernel_cl,i++,sizeof(cl_mem),(void *)&FFTbuf); err |= clSetKernelArg(FindAutoCorrelation_reduce1_kernel_cl,i++,sizeof(cl_mem),(void *)&gpu_AutoCorrelationResults); if(err) fprintf(stderr,"ERROR: Setting kernel argument:FindAutoCorrelation_reduce1_kernel_cl: %d\n",err); #if OCL_VERBOSE else fprintf(stderr,"INFO: Setting kernel argument:FindAutoCorrelation_reduce1_kernel_cl ok\n"); #endif err = clEnqueueNDRangeKernel(cq,FindAutoCorrelation_reduce1_kernel_cl, 1, NULL,globalThreads, localThreads, 0,NULL,NULL); if(err) fprintf(stderr,"ERROR: Enqueueing FindAutoCorrelation_reduce1_kernel_cl: %d\n",err); #if OCL_VERBOSE else fprintf(stderr,"INFO: Enqueuing FindAutoCorrelation_reduce1_kernel_cl ok\n"); #endif } #endif #if OCL_WDM err=clFlush(cq); if(err)fprintf(stderr,"ERROR: submitting kernels autocorr search: %d\n",err); #if __Win32 Sleep(OCL_WDM_SLEEP); #elif _GNU_SOURCE usleep(OCL_WDM_SLEEP); #endif #endif err=clEnqueueReadBuffer(cq,gpu_AutoCorrelationResults,CL_TRUE,0,sizeof(cl_float2)*8,PowerBin,0, NULL,NULL); if(err) fprintf(stderr,"ERROR: clReadBuffer(gpu_AutoCorrelationResults,PowerBin): %d\n",err); #if OCL_VERBOSE else fprintf(stderr,"INFO: clReadBuffer(gpu_AutoCorrelationResults,PowerBin) ok\n"); #endif #if 0 for(int i=0;i<8;i++)fprintf(stderr,"i=%d, power=%.4g,bin=%.4g\n",i,PowerBin[2*i],PowerBin[2*i+1]); exit(0); #endif }

                        • Catalyst 11.10preview2 is broken
                          genaganna

                           

                          Originally posted by: Raistmer (but it uses oclFFT on size of 512k. Quite possible that one of FFT kernels failed) EDIT: full function as is:


                          Thank you for giving some code.  Could you please copy oclFFT kernels also or tell us with what options oclFFT kernels were generated?

                            • Catalyst 11.10preview2 is broken
                              Raistmer
                              Originally posted by: genaganna

                              Originally posted by: Raistmer (but it uses oclFFT on size of 512k. Quite possible that one of FFT kernels failed) EDIT: full function as is:





                              Thank you for giving some code.  Could you please copy oclFFT kernels also or tell us with what options oclFFT kernels were generated?


                              Sure.
                              Here is slightly modded oclFFT call I used:
                              (batch size should be 8 and FFT length 4*128k=512k)

                              cl_int clFFT_ExecuteInterleaved_mb( cl_command_queue queue, clFFT_Plan Plan, cl_int batchSize, clFFT_Direction dir, cl_mem data_in, cl_mem data_out, cl_mem temp_buf, cl_int num_events, cl_event *event_list, cl_event *event ) { int s; cl_fft_plan *plan = (cl_fft_plan *) Plan; if(plan->format != clFFT_InterleavedComplexFormat) return CL_INVALID_VALUE; cl_int err=CL_SUCCESS; size_t gWorkItems, lWorkItems; int inPlaceDone; cl_int isInPlace = data_in == data_out ? 1 : 0; cl_mem memObj[3]; memObj[0] = data_in; memObj[1] = data_out; memObj[2] = temp_buf; cl_fft_kernel_info *kernelInfo = plan->kernel_info; int numKernels = plan->num_kernels; //fprintf(stderr,"ExecuteInterleaved_mb: plan->num_kernels =%d\n",(int)plan->num_kernels); int numKernelsOdd = numKernels & 1; int currRead = 0; int currWrite = 1; #if 0 cl_event fft_event=NULL; unsigned int pass=0; #endif // at least one external dram shuffle (transpose) required if(plan->temp_buffer_needed) { // in-place transform if(isInPlace) { inPlaceDone = 0; currRead = 1; currWrite = 2; } else { currWrite = (numKernels & 1) ? 1 : 2; } while(kernelInfo){ if( isInPlace && numKernelsOdd && !inPlaceDone && kernelInfo->in_place_possible) { currWrite = currRead; inPlaceDone = 1; } #if 0 if(fft_event){ cl_ulong start,end; err=clWaitForEvents(1,&fft_event); err|=clGetEventProfilingInfo (fft_event,CL_PROFILING_COMMAND_START,sizeof(cl_ulong),&start,NULL); err|=clGetEventProfilingInfo (fft_event,CL_PROFILING_COMMAND_END,sizeof(cl_ulong),&end,NULL); //Counters<T_oclFFT1_ns,cl_ulong>::update(end-start); fprintf(stderr,"Pass %u: kernel took: %.2e ns, s=%d\n",pass,float(end-start),batchSize); err|=clReleaseEvent(fft_event);fft_event=NULL; if(err != CL_SUCCESS) fprintf(stderr,"ERROR: mb oclFFT_1 event: %d\n",err); else fprintf(stderr,"INFO: mb oclFFT_1 event done ok.\n"); } #endif s = batchSize; getKernelWorkDimensions(plan, kernelInfo, &s, &gWorkItems, &lWorkItems); //fprintf(stderr,"After getKernelWorkDimensions:\nbatchSize s =%d, gWorkItems =%d, lWorkItems =%d, dir =%d\n", s, gWorkItems, lWorkItems, dir); err |= clSetKernelArg(kernelInfo->kernel, 0, sizeof(cl_mem), &memObj[currRead]); err |= clSetKernelArg(kernelInfo->kernel, 1, sizeof(cl_mem), &memObj[currWrite]); err |= clSetKernelArg(kernelInfo->kernel, 2, sizeof(cl_int), &dir); err |= clSetKernelArg(kernelInfo->kernel, 3, sizeof(cl_int), &s); err |= clEnqueueNDRangeKernel(queue, kernelInfo->kernel, 1, NULL, &gWorkItems, &lWorkItems, 0, NULL,NULL/*&fft_event*/); if(err){ fprintf(stderr,"Error in mb oclFFT_1: %d\n",err); return err; } #if OCL_VERBOSE else fprintf(stderr,"INFO: in mb oclFFT_1 ok\n"); #endif currRead = (currWrite == 1) ? 1 : 2; currWrite = (currWrite == 1) ? 2 : 1; kernelInfo = kernelInfo->next; #if 0 pass++; #endif } } // no dram shuffle (transpose required) transform // all kernels can execute in-place. else { while(kernelInfo){ #if 0 if(fft_event){ cl_ulong start,end; err=clWaitForEvents(1,&fft_event); err|=clGetEventProfilingInfo (fft_event,CL_PROFILING_COMMAND_START,sizeof(cl_ulong),&start,NULL); err|=clGetEventProfilingInfo (fft_event,CL_PROFILING_COMMAND_END,sizeof(cl_ulong),&end,NULL); //Counters<T_oclFFT2_ns,cl_ulong>::update(end-start); fprintf(stderr,"Pass %u: kernel took: %.2e ns, s=%d\n",pass,float(end-start),batchSize); err|=clReleaseEvent(fft_event);fft_event=NULL; if(err != CL_SUCCESS) fprintf(stderr,"ERROR: oclFFT_2 event.\n"); else fprintf(stderr,"INFO: mb oclFFT_2 event done ok\n"); } #endif s = batchSize; getKernelWorkDimensions(plan, kernelInfo, &s, &gWorkItems, &lWorkItems); //fprintf(stderr,"After getKernelWorkDimensions:\nbatchSize s =%d, gWorkItems =%d, lWorkItems =%d, dir =%d\n", s, gWorkItems, lWorkItems, dir); err |= clSetKernelArg(kernelInfo->kernel, 0, sizeof(cl_mem), &memObj[currRead]); err |= clSetKernelArg(kernelInfo->kernel, 1, sizeof(cl_mem), &memObj[currWrite]); err |= clSetKernelArg(kernelInfo->kernel, 2, sizeof(cl_int), &dir); err |= clSetKernelArg(kernelInfo->kernel, 3, sizeof(cl_int), &s); err |= clEnqueueNDRangeKernel(queue, kernelInfo->kernel, 1, NULL, &gWorkItems, &lWorkItems, 0, NULL,NULL/*&fft_event*/); if(err){ fprintf(stderr,"Error in mb oclFFT_2: %d\n",err); return err; } #if OCL_VERBOSE else fprintf(stderr,"INFO: in mb oclFFT_2 ok\n"); #endif currRead = 1; currWrite = 1; //fprintf(stderr,"INFO: before mb oclFFT_2 kernel_info->next\n"); kernelInfo = kernelInfo->next; //fprintf(stderr,"INFO: after mb oclFFT_2 kernel_info->next\n"); #if 0 pass++; #endif } } #if 0 if(fft_event){ cl_ulong start,end; err=clWaitForEvents(1,&fft_event); err|=clGetEventProfilingInfo (fft_event,CL_PROFILING_COMMAND_START,sizeof(cl_ulong),&start,NULL); err|=clGetEventProfilingInfo (fft_event,CL_PROFILING_COMMAND_END,sizeof(cl_ulong),&end,NULL); //Counters<T_oclFFT1_ns,cl_ulong>::update(end-start); fprintf(stderr,"Final pass: kernel took: %.2e ns, s=%d\n",float(end-start),batchSize); err|=clReleaseEvent(fft_event);fft_event=NULL; if(err != CL_SUCCESS)fprintf(stderr,"ERROR: Final mb oclFFT event: %d\n",err); else fprintf(stderr,"INFO: Final mb oclFFT event done ok.\n",err); } #endif // if(batchSize>1) Sleep(2); // else Sleep(1); //fprintf(stderr,"ExecuteInterleaved_mb: about to return with err=%d\n",(int)err); return err; }

                          • Catalyst 11.10preview2 is broken
                            Raistmer
                            any news on topic?
                              • Catalyst 11.10preview2 and release are broken
                                freighter

                                 

                                Originally posted by: Raistmer any news on topic?


                                 

                                Only some more bad news using the AMD Catalyst 11.10 release x86_64 linux driver.

                                Everything that Raistmer reported before in this forum thread for the windows drivers is also affecting the linux version. Windows and linux share the same codebase.

                                The same type of incorrect results the autocorrelation kernels created on OpenSuse 11.4, 64bit  and  Kubuntu 10.10, 64bit.

                                 

                                Also another problem got introduced with the new driver version Catalyst 11.10 when using linux OpenCL version for HD4xxx with max workgroup size 128. A severe host system freeze (bluescreen ati2dvag.dll) when running this application version did affect all windows versions (tested since Cat 10.12) and now newly appears on 64bit linux (tested ok until Cat11.9), too. As linux is not throwing "bluescreens" the computer gets completely irresponsive, while still showing the frozen desktop and needs to be rebooted. This happens reproducible at the same place in code (see below : POINT OF FAILURE). This was tested on OpenSuse 11.3, 64bit with a HD4670, 1GB, OpenCL 1.0 device.

                                [code]

                                ....

                                   localThreads[0]=32; //uje: needed for lower HD4xxx GPUs
                                #else
                                        localThreads[0]=host.GetWGSize();
                                #endif
                                    //fprintf(stderr,"localThreads[0]=%d\n",localThreads[0]);
                                        err = clEnqueueNDRangeKernel(cq,PC_find_spike_kernel_cl,
                                                 1,//R: 1D execution domain used, each work item works with 4 data elements
                                                 NULL,globalThreads,
                                                 localThreads,
                                                 0,NULL,NULL);
                                            if(err) fprintf(stderr,"ERROR: Enqueueing kernel:PC_find_spike_kernel_cl:%d\n",err);
                                #if OCL_VERBOSE
                                      else fprintf(stderr,"INFO: Enqueueing kernel: PC_find_spike_kernel_cl done ok\n");
                                #endif
                                    //clFinish(cq);
                                      //R: checking if any CPU reprocessing/logging needed and if yes, retrieve results
                                          cl_uint cpu_result_flag[RESULT_SIZE];
                                      fprintf(stderr,"INFO: Before clEnqueueReadBuffer\n");

                                #if OCL_WDM
                                    err=clFlush(cq);
                                    if(err)fprintf(stderr,"ERROR: submitting kernels for non-strip Spike search: %d\n",err);
                                #if __Win32
                                    Sleep(OCL_WDM_SLEEP);
                                #elif _GNU_SOURCE
                                    usleep(OCL_WDM_SLEEP);
                                #endif
                                #endif
                                          err=clEnqueueReadBuffer(cq,gpu_result_flag,CL_TRUE,0,
                                            sizeof(cl_uint)*((globalThreads[0]<RESULT_SIZE)?globalThreads[0]:RESULT_SIZE),
                                            &cpu_result_flag,0, NULL,NULL);                                        //<---------------------------------------POINT OF FAILURE
                                          if(err) fprintf(stderr,"ERROR: ReadBuffer(gpu_result_flag,spike):%d\n",err);
                                #if OCL_VERBOSE
                                      else fprintf(stderr,"INFO: ReadBuffer(gpu_result_flag,spike) done ok\n");
                                #endif
                                               fprintf(stderr,"spike search results (main path): ");

                                ....

                                [/code]

                              • Catalyst 11.10preview2 is broken
                                Raistmer
                                Tried Catalyst 11.10 release.
                                Same problem - with this Catalyst version app produces invalid results.

                                AMD, any chances to fix this? Did you find the problem? Few weeks passed...
                                  • Catalyst 11.10preview2 is broken
                                    freighter

                                    Retested on openSuse11.3, 64bit with Cat 11.11 and still have found my host crashing the same way like described earlier.

                                     

                                    How about a patch for this issue ?

                                      • Catalyst 11.10preview2 is broken
                                        gat3way

                                        If you have previously installed SDK 2.5, please make sure you remove everything its installer placed in /etc/ld.so.conf.d

                                        Otherwise you are using the libOpenCL.so from your SDK2.5 directory, not the one provided with Catalyst 11.11. I haven't tested it yet (I am redoing the offline compilation of all my kernels). The difference is two new targets (Scrapper and Devastator), faster compilation time, mad24 broken for some reason (didn't investigate that though, just replaced it with a slower equivalent).

                                    • Catalyst 11.10preview2 is broken
                                      MicahVillmow
                                      Raistmer,
                                      11.10 release is code internally from August/September timeframe. So it wouldn't have this fix. I'm trying to find out the status of this specific issue and will get back to you when I can find out more.
                                      • Catalyst 11.10preview2 is broken
                                        MicahVillmow
                                        mad24 should map directly to the hardware instruction, do you have a test case that shows the failure?
                                          • Catalyst 11.10preview2 is broken
                                            gat3way

                                            Yep. This is a bad thing to do I know, but it worked before:

                                             

                                            #define getglobalid(a) (mad24(get_group_id(0), 64, get_local_id(0)))

                                            ...

                                            found_ind[getglobalid(0)] = 1;

                                             

                                            Error is:

                                             

                                             

                                            /tmp/OCLYsVbCt.cl(786): error: more than one instance of overloaded function

                                                      "mad24" matches the argument list:

                                                        function "mad24(int, int, int) C++"

                                                        function "mad24(uint, uint, uint) C++"

                                                        argument types are: (uint, int, uint)

                                              found_ind[getglobalid(0)] = 1;



                                          • Catalyst 11.10preview2 is broken
                                            MicahVillmow
                                            gateway,
                                            This is because our compiler is now more compliant with the OpenCL spec, you need to use 64U instead of 64. This change was made at the behest of our customers so that we had the same error behavior as other vendors that were correct.
                                            • Catalyst 11.10preview2 is broken
                                              Raistmer
                                              Well, let's return to original problem.
                                              I just tested app behavior under Win7 x64 + Catalyst 11.11 - the same, app returns incorrect results with this driver too.
                                              When we can expect fix for that or some workaround suggestions ?
                                                • Catalyst 11.10preview2 is broken
                                                  FrodoTheGiant

                                                  There is a fix. It is called NVIDIA.

                                                  My company decided last week to dump AMD and switch to NVIDIA.

                                                  AMD is simply to unreliable when it comes to serious GPGPU computing.

                                                    • Catalyst 11.10preview2 is broken
                                                      CaptainN

                                                      2 FrodoTheGiant:

                                                      Can you stop hysterics?

                                                      • Catalyst 11.11 is broken too
                                                        Raistmer
                                                        Originally posted by: FrodoTheGiant

                                                        There is a fix. It is called NVIDIA.


                                                        My company decided last week to dump AMD and switch to NVIDIA.


                                                        AMD is simply to unreliable when it comes to serious GPGPU computing.


                                                        Unfortunately, it's not solution in my case:
                                                        1) last NV drivers (270.xx and UP show same CPU time increase as AMD's one. And looks like 285.xx produce incorrect results on Quadro GPUs, though they produce VALID results on all other tested GPUs).
                                                        2) I write program that should be used mostly on already bought GPUs. To enable them to participate in world largest project of distributed computations called SETI@home.
                                                        Regarding new GPUs - it's quite known over our boards that if one want to make good progress he should buy nVidia GPU for SETI, not AMD GPU.
                                                        IMO AMD lost money in 10^5 $ scale. maybe more (maybe less, I'm not interesting in economical computation) by ignoring SETI.
                                                        nVidia was kind enough to provide programming support so we have "stock" CUDA app at least, and continuously working to improve it.
                                                        From AMD side I see mostly driver degradation (now it even can't perform correct computations, not speaking about increased CPU time).
                                                        It's very sad, cause AMD (ATI) hardware could win in SETI area, we already have very fast application for one type of computations, but requirement to install quite old drivers repels most of gamers who wnat to have as new drivers as possible...
                                                        AMD's software (in particular, driver) division makes very bad joke over hardware one negating all their achievements!
                                                        Needless to say about disgusting software support of most advanced AMD hardware.
                                                        I have tester with 2 HD6990, 2-core GPUs. He barely can use them. So, how we can make high-performance computational clusters based on AMD GPUs?...
                                                        Definitely all new hardware purchases will go with nVidia mostly.
                                                        But there are many AMD GPUs bought already... SETI community donated me HD6950 GPU to improve appication for ATI GPUs... and I still can't use whole its power being restricted with 11.2 drivers that will freeze whole system if I open (for example) YouTube link accidentally on that host...

                                                        AMD, time to think about your drivers, really. You lose money, you lose reputation (hence you lose money again). BattleField3 compatibility - it's not all that users want....
                                                          • Catalyst 11.11 is broken too
                                                            NURBS

                                                            $10^5 is spared change to AMD. If the rumors are real, the loss of their contract with Apple will be significant. If you guys have worked for a big corporates, AMD engineers are and should be spending more time on Indeed than fixing anything. Why? Fixing bugs won't prevent you from layoff. What we need for OpenCL to succeed are killer applications used by consumers.  

                                                            I feel bad for engineers being axed, but not for the corporates making dumb decisions.   

                                                              • Catalyst 11.11 is broken too
                                                                Raistmer
                                                                Originally posted by: NURBS

                                                                $10^5 is spared change to AMD.


                                                                Maybe I'm wrong on few orders of magnitude ;)

                                                                Accordingly to boincstats SETI@home now has 3,034,351 hosts.

                                                                If only 0.1% of them can be equipped with AMD GPUs it will be 3*10^3 hosts.
                                                                If each such GPU costs $100 it will give 3*10^5. But if 1% would be equipped with AMD GPUs we easely will have $3*10^6.
                                                                Taking into account that (currently SETI project down for maintenance but after few hours everyone can see top hosts by himself) our top participants have multi-GPU hosts it really can be smth like 10^6 or more.

                                                                But I don't think that this issue affect only this single app. So, driver should be fixed anyway.
                                                                  • Catalyst 11.11 is broken too
                                                                    Meteorhead

                                                                    Frodo has already stated that his company decide to shif from AMD to NV. If he isn't working for MS, Apple or some other big company, AMD really doesn't have to care (taking into account 10^5 $ is small money). It is one sad, true story like many others.

                                                                    I am lucky enough to work at a research center and not at a company which is very result-oriented. I have the luxury to keep ever experimenting with alternative methods and not use factory standards.

                                                                    Long ago I convinced people to invest money in a multi-GPU development node, used for exploring multi-GPU applications on AMD side. 3X HD5970 is a lot of money for a small hungarian research project. I have my faith in AMD engineers, who create kick@ss HW, but as Raistimer has sad, it is all negated by the poor drivers. I really hate people when they write in full capitals, or when people are vulgar, but as a good friend of mine said: vulgar has it's place:

                                                                    Two damned GPU generations have passed (having HD7xxx just around the corner), and our 3X HD5970s ARE STILL NOT SUPPORTED!! Really... is AMD really serious about that?? Even if I have my faith in AMD, how should I convince others to invest money in it? Let me try to guess what comes into Frodo's mind: "NOHOW! Do a favor for everyone and keep them away from AMD!"

                                                                    Unfortunate for AMD, I believe I can achieve more if I keep on pushing AMD, and either keep showing it's impotency, or it's merits. It is all up to "you".

                                                                    And for the higher ups. Please-please-please Micah, Genaganna or anyone else who reads this. Do convey these comments as quotations. Not summaries, or some light version, but word by word:

                                                                    Dear AMD,

                                                                    you're going to your doom if you keep on like this. 2 ****ing years for a GPU to be supported?? By the time I can leverage my HW, it is surpassed by competition mid-class. INVEST MORE INTO SW DEVELOPMENT!! A LOT MORE!!!!!!

                                                                    Regards,

                                                                    a fan.

                                                                    Edit: and don't get me wrong. This is not an advice... this is an ultimatum from the users (developers) as a collective.

                                                                    Edit2: Not to mention I opened a topic concerning a rather fatal bug of Kernel Analyzer over 3 weeks ago, and I still got no response from anybody, although I'm pretty sure that at least one person knows where KA stores data that has to be cleaned after reinstall.

                                                                • Catalyst 11.11 is broken too
                                                                  FrodoTheGiant

                                                                   

                                                                  Originally posted by: Raistmer

                                                                  Needless to say about disgusting software support of most advanced AMD hardware. I have tester with 2 HD6990, 2-core GPUs. He barely can use them.



                                                                  It is really a slap in the face for those users spending big money on dual GPU cards like the 5970 or 6990 - and then can't use them because AMD doesn't support them, drops support or breaks drivers.

                                                                  How would anyone trust AMD again?

                                                                  Why would anyone every buy a high-end AMD GPU again based on that history of scorched earth they have with existing users?

                                                                   

                                                                    • Catalyst 11.11 is broken too
                                                                      Meteorhead

                                                                      Since the topic has been idle for a while now (don't know if my previous post was really that effective, or people just simply lost interest in dising QoS), I though I would post how can one waste a complete day with installing linux.

                                                                      ... ... ... I wanted to write a little short story, to put things nice, and not just btich about things that don't work, but as I was thinking, I figured there's really no point, so I'll just make it short: 11.11 fails even to launch Ubuntu desktop 80% of the times. The desktop simply freezes just a few seconds after the background is shown and the mouse could be moved for a while. Naturally this prevents remote rebooting, as most of the times, the computer will not beable to boot.

                                                                      For other reasons, I got fed up with Ubuntu, so I installed SLC5 (Scientific Linux CERN, a Red Hat based distro), and after installing a few stuff and restarting the computer in between, after udev has been loaded at boot time and the GUI boot window would be opened, the screen changes resolution (very noticable on the CRT monitor in the computer room), and then ust stays blank and boot halts, just like on Ubuntu (10.04.3, both 64-bit). I have done this twice already (I thought it was the fault of trying to get auto-login working, but even after undoing changes with a live Ubuntu stick, it still fails to boot).

                                                                      If I pull the computer out of the rack, I got myself a 3000$ doorstand. Or I could heat my office with it...

                                                                      What is the conclusion? Life would be much better, if I could install a headless node, with no GUI whatsoever and be able to use the GPUs without having to use this crap XServer which crashes at every second corner. Yet again, one thing the community has been asking for over a year now.

                                                                      Peace.

                                                                        • Catalyst 11.11 is broken too
                                                                          FrodoTheGiant

                                                                           

                                                                          Originally posted by: Meteorhead Since the topic has been idle for a while now (don't know if my previous post was really that effective, or people just simply lost interest in dising QoS)..."


                                                                          I totally agree with you, but did not post it so that your comment remained the last in this thread. The most visible.

                                                                          So I was hoping someone from AMD might read it - and respond.

                                                                          But ... nothing. I am done with AMD.

                                                                            • Catalyst 11.11 is broken too
                                                                              himanshu.gautam

                                                                              Hi Everybody,

                                                                              It is really sad that it is taking so much time in fixing these reported issues. I will try to find out there status and update here.I would like to assure that these issues are being looked into, based on their priorities.

                                                                              To summarize the issues, so that I am not missing anything:

                                                                              Raistmer reports the issue related to correctness in correlation function in SETI@home. This happens in both windows and linux.

                                                                              Meteohead is trying to run Multi-GPU(x2 cards) configurations which seem to be working. But the driver is giving booting issues in both ubuntu and SLC5.

                                                                                • Catalyst 11.11 is broken too
                                                                                  FrodoTheGiant

                                                                                  @himanshu.gautam: Please also add the "100% CPU load bug" to your list.

                                                                                  This is broken since the last 5 Catalyst releases. Including the newest one.

                                                                                  On (Windows?) systems with more than once graphics card installed you always have a CPU load of 100% when running Open CL applications. It should be ~2-3% (and has been in older Catalyst versions).

                                                                                  • Catalyst 11.11 is broken too
                                                                                    freighter

                                                                                     

                                                                                    Originally posted by: himanshu.gautam Hi Everybody,...

                                                                                    To summarize the issues, so that I am not missing anything:...



                                                                                    And not to forget the seti@home-issue with ATI GPUs with max. workgroup size of 128 (HD43xx - HD46xx), which will crash on windows (tested drivers Cat 10.12 till 11.11) but do work ok on linux (driver <= Cat 11.9). This is a different problem than the incorrectness issue.

                                                                                      • Catalyst 11.11 is broken too
                                                                                        gat3way

                                                                                        Are you sure it works OK on linux? When using local memory, workgroup size is forced to 64 on 4xxx. It might work in some cases though, but in most cases you need to write a separate 4xxx kernel, otherwise it produces wrong results. I did not know it crashes on windows though, that sounds bad...

                                                                                          • Catalyst 11.11 is broken too
                                                                                            Raistmer
                                                                                            Originally posted by: gat3way

                                                                                            Are you sure it works OK on linux? When using local memory, workgroup size is forced to 64 on 4xxx. It might work in some cases though, but in most cases you need to write a separate 4xxx kernel, otherwise it produces wrong results. I did not know it crashes on windows though, that sounds bad...



                                                                                            Actually HD4xxx have no local memory exposed at all. It's emulated via global memory so senseless to use it for performance reasons.
                                                                                            The single point in program where it could be used even on HD4xxx is Apple's oclFFT library.
                                                                                          • Catalyst 11.11 is broken too
                                                                                            freighter

                                                                                             

                                                                                            Originally posted by: freighter
                                                                                            Originally posted by: himanshu.gautam Hi Everybody,...

                                                                                             

                                                                                            To summarize the issues, so that I am not missing anything:...



                                                                                             

                                                                                            And not to forget the seti@home-issue with ATI GPUs with max. workgroup size of 128 (HD43xx - HD46xx), which will crash on windows (tested drivers Cat 10.12 till 11.11) but do work ok on linux (driver <= Cat 11.9). This is a different problem than the incorrectness issue.

                                                                                             

                                                                                            Just for update : Tested again today with Cat 11.12 : still see crashing host with a total freeze (openSuse 11.3 64bit) or a bluescreen (windows XPsp3), so no change of that issue yet.

                                                                            • Catalyst 11.11 is broken too
                                                                              Raistmer
                                                                              As linux topic was touched, our Linux porting team has the same problems with Cat 11.10 and 11.11 that I have with Windows. App produces invalid results under these 2 drivers. Versions before work OK. Maybe this will give more hints to AMD engineers what thing they broke in last Catalysts. So, it's not windows-specific issue.
                                                                                • Catalyst 11.11 is broken too
                                                                                  Meteorhead

                                                                                  I also thought of reverting to some older Catalyst, but the problem was that on Ubuntu, it regularly crashed the machine when I tried to run multi-GPU applications. 11.11 is the first that was stable in that sense, but unfortunately it failes to boot in a significant ratio of times.

                                                                                • Catalyst 11.11 is broken too
                                                                                  Raistmer
                                                                                  About those crashes: so far (~10 identical runs complete ) there are 2 locations in program's computational loop where crash occurs.
                                                                                  But number of loop iterations before crash varies. Also, in 2 cases incorrect data was detected (so, memory buffer was damaged before crash).
                                                                                  If AMD wants executable for test I can provide it.
                                                                                  It's separate problem (different application's modification) from problem in discussion in this topic.
                                                                                    • Catalyst 11.11 is broken too
                                                                                      Raistmer
                                                                                      Hope this speeds up AMD solution:

                                                                                      I narrowed down place of error to single kernel invocation:

                                                                                      Results of kernel call under Cat 11.2:

                                                                                      INFO: Setting kernel argument:FindAutoCorrelation_reduce1_kernel_cl ok
                                                                                      INFO: Enqueuing FindAutoCorrelation_reduce1_kernel_cl ok
                                                                                      INFO: clReadBuffer(gpu_AutoCorrelationResults,PowerBin) ok
                                                                                      fully reduced:
                                                                                      i=0, power=7.331,bin=9861
                                                                                      i=1, power=8.084,bin=1.983e+004
                                                                                      i=2, power=8.756,bin=4.039e+004
                                                                                      i=3, power=8.399,bin=1.305e+004
                                                                                      i=4, power=8.169,bin=3.056e+004
                                                                                      i=5, power=8.838,bin=1.377e+004
                                                                                      i=6, power=7.687,bin=6.322e+004
                                                                                      i=7, power=6.46,bin=4.259e+004

                                                                                      Under Cat 11.11:

                                                                                      INFO: Setting kernel argument:FindAutoCorrelation_reduce1_kernel_cl ok
                                                                                      INFO: Enqueuing FindAutoCorrelation_reduce1_kernel_cl ok
                                                                                      INFO: clReadBuffer(gpu_AutoCorrelationResults,PowerBin) ok
                                                                                      fully reduced:
                                                                                      i=0, power=150,bin=2.456e+004
                                                                                      i=1, power=135.1,bin=3.156e+004
                                                                                      i=2, power=139.5,bin=3.984e+004
                                                                                      i=3, power=141.9,bin=2.395e+004
                                                                                      i=4, power=139.9,bin=3.459e+004
                                                                                      i=5, power=141.9,bin=4.022e+004
                                                                                      i=6, power=140.4,bin=3.872e+004
                                                                                      i=7, power=138.1,bin=2.452e+004

                                                                                      And with so big power program ends with too many false positives found.

                                                                                      The code:

                                                                                      { int i=0; size_t localThreads[1]={1}; size_t globalThreads[1]={8};//R: hardwired number of FFTs for now err = clSetKernelArg(FindAutoCorrelation_reduce1_kernel_cl,i++,sizeof(cl_mem),(void *)&FFTbuf); err |= clSetKernelArg(FindAutoCorrelation_reduce1_kernel_cl,i++,sizeof(cl_mem),(void *)&gpu_AutoCorrelationResults); if(err) fprintf(stderr,"ERROR: Setting kernel argument:FindAutoCorrelation_reduce1_kernel_cl: %d\n",err); #if OCL_VERBOSE fprintf(stderr,"INFO: Setting kernel argument:FindAutoCorrelation_reduce1_kernel_cl ok\n"); #endif err = clEnqueueNDRangeKernel(cq,FindAutoCorrelation_reduce1_kernel_cl, 1, NULL,globalThreads, localThreads, 0,NULL,NULL); if(err) fprintf(stderr,"ERROR: Enqueueing FindAutoCorrelation_reduce1_kernel_cl: %d\n",err); #if OCL_VERBOSE fprintf(stderr,"INFO: Enqueuing FindAutoCorrelation_reduce1_kernel_cl ok\n"); #endif } #endif #if OCL_WDM err=clFlush(cq); if(err)fprintf(stderr,"ERROR: submitting kernels autocorr search: %d\n",err); #if __Win32 Sleep(OCL_WDM_SLEEP); #elif _GNU_SOURCE usleep(OCL_WDM_SLEEP); #endif #endif err=clEnqueueReadBuffer(cq,gpu_AutoCorrelationResults,CL_TRUE,0,sizeof(cl_float2)*8,PowerBin,0, NULL,NULL); if(err) fprintf(stderr,"ERROR: clReadBuffer(gpu_AutoCorrelationResults,PowerBin): %d\n",err); #if OCL_VERBOSE fprintf(stderr,"INFO: clReadBuffer(gpu_AutoCorrelationResults,PowerBin) ok\n"); #endif #if 1 //dump fprintf(stderr,"fully reduced:\n"); for(int i=0;i<8;i++)fprintf(stderr,"i=%d, power=%.4g,bin=%.4g\n",i,PowerBin[2*i],PowerBin[2*i+1]); exit(0); #endif } The kernel: __kernel void FindAutoCorrelation_reduce1_kernel_cl(__global float* PartialSumMaxIdx, __global float2* result){ const int sid=256*get_global_id(0);//only 8 workitems here float full_max=0.0f; float idx=0.0f; float Sum=0.0f; for(int i=0;i<256;i++){ Sum+=PartialSumMaxIdx[3*(sid+i)]; float tmp=PartialSumMaxIdx[(sid+i)*3+1]; if(tmp>full_max){ full_max=tmp; idx=PartialSumMaxIdx[(sid+i)*3+2]; } } result[get_global_id(0)].x=full_max*(128.0f*1024.0f)/Sum;//peak value result[get_global_id(0)].y=idx;//bin }

                                                                                    • Catalyst 11.11 is broken too
                                                                                      Raistmer
                                                                                      So, what is the status of this bug report ?
                                                                                      Now, when Cat11.12 ready for preview, what you get with this new driver or any newer internal builds? Will it fix bug or again too old to contain any fixes? ...
                                                                                        • Catalyst 11.11 is broken too
                                                                                          himanshu.gautam

                                                                                          Hi Raistmer,

                                                                                          Thanks for sending the code. I was trying to see what type of outputs the code generate on my system. The Kernel posted appears to be running fine with the internal driver i have installed on my machine for some set of values.

                                                                                          I wil try to install 11.11b performance driver and try that too. It would be helpful if you can give what kind of values the FFTBuf buffer contain.

                                                                                           

                                                                                            • Catalyst 11.11 is broken too
                                                                                              Raistmer
                                                                                              Originally posted by: himanshu.gautam

                                                                                              Hi Raistmer,


                                                                                              Thanks for sending the code. I was trying to see what type of outputs the code generate on my system. The Kernel posted appears to be running fine with the internal driver i have installed on my machine for some set of values.


                                                                                              I wil try to install 11.11b performance driver and try that too. It would be helpful if you can give what kind of values the FFTBuf buffer contain.


                                                                                               



                                                                                              Hi Himanshu,
                                                                                              Here I attached full logs for Cat 11.2 and Cat 11.11 (as one can see, they are identical before very last call, different part I cited in prev post) along with debug build of app itself and test workunit .
                                                                                              I will post whole code that generates dumps in log. I think you don't need all other kernels cause they work OK and data they generate listed in logs.

                                                                                              EDIT:
                                                                                              and here goes the code.
                                                                                              So you can easely compare log outputs with places in code that generated corresponding outputs.

                                                                                              void PC_FindAutoCorrelation_cl(float* PowerBin){ // Because positive and negative delays are the same // we only have to look at half the points //Jason: Use 4N-FFT method for Type 2 Discrete Cosine Tranform for now, to match fftw's REDFT10 // 1 Autocorrelation from global powerspectrum at fft_num*ac_fft_len (fft_num*ul_NumDataPoints ) //R: all transforms on whole 2D matrix instead of single power specrum. #if 1 //for dump const int ac_fftlen=128*1024; float* b=new float[ac_fftlen*4*2]; if(b==NULL)fprintf(stderr,"Can't allocate buffer\n"); #endif //Step 1: Preprocessing - repack relevant powerspectrum into a 4N array with 'real-even symmetry' { size_t localThreads[2]={64,1};//R: TODO tune size_t globalThreads[2]={128*1024,8};//R: hardwired FFT len and number of FFTs for now #if 1 //dump err=clEnqueueReadBuffer(cq,gpu_PowerSpectrum,CL_TRUE,sizeof(float)*ac_fftlen*1,sizeof(float)*ac_fftlen,b,0, NULL,NULL); if(err)fprintf(stderr,"ERROR: ReadBuffer:%d\n",err); fprintf(stderr,"power spectrum:\n"); for(int i=0;i<ac_fftlen;i++) fprintf(stderr,"i=%d,x=%g\n",i,b[i]); #endif int i=0; err = clSetKernelArg(RepackInput_kernel_cl,i++,sizeof(cl_mem),(void *)&gpu_PowerSpectrum); err |= clSetKernelArg(RepackInput_kernel_cl,i++,sizeof(cl_mem),(void *)&gpu_WorkData); //int size=128*1024; //err |=clSetKernelArg(RepackInput_kernel_cl,i++,sizeof(cl_int),(void *)&size); if(err) fprintf(stderr,"ERROR: Setting kernel argument:RepackInput_kernel_cl: %d\n",err); #if OCL_VERBOSE fprintf(stderr,"INFO: Setting kernel argument:RepackInput_kernel_cl ok\n"); #endif err = clEnqueueNDRangeKernel(cq,RepackInput_kernel_cl, 2, NULL,globalThreads, localThreads, 0,NULL,NULL); if(err) fprintf(stderr,"ERROR: RepackInput_kernel_cl: %d\n",err); #if OCL_VERBOSE fprintf(stderr,"INFO: RepackInput_kernel_cl ok\n"); #endif } #if 1 //dump err=clEnqueueReadBuffer(cq,gpu_WorkData,CL_TRUE,1*4*2*sizeof(float)*ac_fftlen,4*2*sizeof(float)*ac_fftlen,b,0, NULL,NULL); if(err)fprintf(stderr,"ERROR: ReadBuffer:%d\n",err); fprintf(stderr,"after inflation:\n"); for(int i=0;i<4*ac_fftlen;i++) fprintf(stderr,"i=%d,x=%g, y=%g\n",i,b[2*i],b[2*i+1]); #endif //Step 2: Process the 4N-FFT (Complex to Complex, size is 4 * ac_fft_len) err |= clFFT_ExecuteInterleaved_mb(cq, autocorr_plan,8, clFFT_Forward, gpu_WorkData,gpu_WorkData, FFTbuf, 0, NULL, NULL); if(err) fprintf(stderr,"ERROR: autocorr fft: %d\n",err); #if OCL_VERBOSE fprintf(stderr,"INFO: autocorr fft ok\n"); #endif #if 1 //dump err=clEnqueueReadBuffer(cq,gpu_WorkData,CL_TRUE,1*4*2*sizeof(float)*ac_fftlen,4*2*sizeof(float)*ac_fftlen,b,0, NULL,NULL); if(err)fprintf(stderr,"ERROR: ReadBuffer:%d\n",err); fprintf(stderr,"after fft:\n"); for(int i=0;i<4*ac_fftlen;i++) fprintf(stderr,"i=%d,x=%g, y=%g\n",i,b[2*i],b[2*i+1]); #endif #if 0 //R: this step removed now //Step 3: Postprocess the FFT result (Scale, take powers & normalize), discarding unused data packing into AutoCorr_in first half for VRAM reuse { size_t localThreads[2]={64,1};//R: TODO tune size_t globalThreads[2]={32*1024,8};//R: hardwired FFT len and number of FFTs for now int i=0; err = clSetKernelArg(RepackOutput_kernel_cl,i++,sizeof(cl_mem),(void *)&gpu_WorkData); err |= clSetKernelArg(RepackOutput_kernel_cl,i++,sizeof(cl_mem),(void *)&FFTbuf); //int size=64*1024; //err |=clSetKernelArg(RepackOutput_kernel_cl,i++,sizeof(cl_int),(void *)&size); if(err) fprintf(stderr,"ERROR: Setting kernel argument:RepackOutput_kernel_cl: %d\n",err); err = clEnqueueNDRangeKernel(cq,RepackOutput_kernel_cl, 2, NULL,globalThreads, localThreads, 0,NULL,NULL); if(err) fprintf(stderr,"ERROR: RepackOutput_kernel_cl: %d\n",err); } #endif #if 0 err=clEnqueueReadBuffer(cq,FFTbuf,CL_TRUE,1*64*1024*sizeof(float),sizeof(float)*ac_fftlen/2,b,0, NULL,NULL); if(err)fprintf(stderr,"ERROR: ReadBuffer:%d\n",err); fprintf(stderr,"after packing:\n"); for(int j=0;j<ac_fftlen/2;j++) fprintf(stderr,"AutoCorrelation[%d]=%.4g\n",j,b[j]); //fprintf(stderr,"i=%d, x=%g, y=%g\n",i,b[2*i],b[2*i+1]); exit(0); #endif #if 0 //USE_OPENCL_NV || USE_OPENCL_HD5xxx {size_t localThreads[2]={256,1}; size_t globalThreads[2]={256,8};//R: hardwired number of FFTs for now int i=0; err = clSetKernelArg(FindAutoCorrelation_kernel1_cl,i++,sizeof(cl_mem),(void *)&gpu_WorkData); err |= clSetKernelArg(FindAutoCorrelation_kernel1_cl,i++,sizeof(cl_mem),(void *)&gpu_AutoCorrelationResults); if(err) fprintf(stderr,"ERROR: Setting kernel argument:FindAutoCorrelation_kernel1_cl: %d\n",err); err = clEnqueueNDRangeKernel(cq,FindAutoCorrelation_kernel1_cl, 2, NULL,globalThreads, localThreads, 0,NULL,NULL); if(err) fprintf(stderr,"ERROR: FindAutoCorrelation_reduce0_kernel_cl: %d\n",err); } #else // const int len = 64*1024;//autocorr_fft_len/2; // const int blksize = UNSTDMAX(4, UNSTDMIN(pow2a((unsigned int) sqrt((float) (len / 32)) * 32), 512)); {size_t localThreads[2]={64,1}; size_t globalThreads[2]={256,8};//R: hardwired number of FFTs for now int i=0; err = clSetKernelArg(FindAutoCorrelation_reduce0_kernel_cl,i++,sizeof(cl_mem),(void *)&gpu_WorkData); err |= clSetKernelArg(FindAutoCorrelation_reduce0_kernel_cl,i++,sizeof(cl_mem),(void *)&FFTbuf); if(err) fprintf(stderr,"ERROR: Setting kernel argument:FindAutoCorrelation_reduce0_kernel_cl: %d\n",err); #if OCL_VERBOSE fprintf(stderr,"INFO: Setting kernel argument:FindAutoCorrelation_reduce0_kernel_cl ok\n"); #endif err = clEnqueueNDRangeKernel(cq,FindAutoCorrelation_reduce0_kernel_cl, 2, NULL,globalThreads, localThreads, 0,NULL,NULL); if(err) fprintf(stderr,"ERROR: FindAutoCorrelation_reduce0_kernel_cl: %d\n",err); #if OCL_VERBOSE fprintf(stderr,"INFO: FindAutoCorrelation_reduce0_kernel_cl ok\n"); #endif } #if 1 //dump err=clEnqueueReadBuffer(cq,gpu_WorkData,CL_TRUE,0*256*sizeof(float),sizeof(float)*256*8*3,b,0, NULL,NULL); if(err)fprintf(stderr,"ERROR: ReadBuffer:%d\n",err); fprintf(stderr,"partially reduced:\n"); for(int j=0;j<256*8;j++) fprintf(stderr,"j=%d,partial_sum=%.4g,partial_max=%.4g, idx=%.4g\n",j,b[3*j],b[3*j+1],b[3*j+2]); //fprintf(stderr,"i=%d, x=%g, y=%g\n",i,b[2*i],b[2*i+1]); //exit(0); #endif { int i=0; size_t localThreads[1]={1}; size_t globalThreads[1]={8};//R: hardwired number of FFTs for now err = clSetKernelArg(FindAutoCorrelation_reduce1_kernel_cl,i++,sizeof(cl_mem),(void *)&FFTbuf); err |= clSetKernelArg(FindAutoCorrelation_reduce1_kernel_cl,i++,sizeof(cl_mem),(void *)&gpu_AutoCorrelationResults); if(err) fprintf(stderr,"ERROR: Setting kernel argument:FindAutoCorrelation_reduce1_kernel_cl: %d\n",err); #if OCL_VERBOSE fprintf(stderr,"INFO: Setting kernel argument:FindAutoCorrelation_reduce1_kernel_cl ok\n"); #endif err = clEnqueueNDRangeKernel(cq,FindAutoCorrelation_reduce1_kernel_cl, 1, NULL,globalThreads, localThreads, 0,NULL,NULL); if(err) fprintf(stderr,"ERROR: Enqueueing FindAutoCorrelation_reduce1_kernel_cl: %d\n",err); #if OCL_VERBOSE fprintf(stderr,"INFO: Enqueuing FindAutoCorrelation_reduce1_kernel_cl ok\n"); #endif } #endif #if OCL_WDM err=clFlush(cq); if(err)fprintf(stderr,"ERROR: submitting kernels autocorr search: %d\n",err); #if __Win32 Sleep(OCL_WDM_SLEEP); #elif _GNU_SOURCE usleep(OCL_WDM_SLEEP); #endif #endif err=clEnqueueReadBuffer(cq,gpu_AutoCorrelationResults,CL_TRUE,0,sizeof(cl_float2)*8,PowerBin,0, NULL,NULL); if(err) fprintf(stderr,"ERROR: clReadBuffer(gpu_AutoCorrelationResults,PowerBin): %d\n",err); #if OCL_VERBOSE fprintf(stderr,"INFO: clReadBuffer(gpu_AutoCorrelationResults,PowerBin) ok\n"); #endif #if 1 //dump fprintf(stderr,"fully reduced:\n"); for(int i=0;i<8;i++)fprintf(stderr,"i=%d, power=%.4g,bin=%.4g\n",i,PowerBin[2*i],PowerBin[2*i+1]); exit(0); #endif }

                                                                                          • Catalyst 11.11 is broken too
                                                                                            MicahVillmow
                                                                                            freighter,
                                                                                            While we still include the OpenCL runtime components in the graphics driver for pre-SDK 2.6 releases, SDK 2.6 features are not supported on windows xp.
                                                                                            • Catalyst 11.11 is broken too
                                                                                              MicahVillmow
                                                                                              Raistmer,
                                                                                              It was in response to raistmer/freighter. XP support is being dropped, so if issues arise, they probably won't be fixed.
                                                                                                • Catalyst 11.11 is broken too
                                                                                                  freighter

                                                                                                   

                                                                                                  Originally posted by: MicahVillmow Raistmer, It was in response to raistmer/freighter. XP support is being dropped, so if issues arise, they probably won't be fixed.


                                                                                                  Will repeat the tests after newer Win7x64 has finished install on that XP host i'm currently running tests.

                                                                                                    • Catalyst 12.1 preview performance hit
                                                                                                      Raistmer
                                                                                                      Unfortunately, not all bugs introduced after Cat 11.2 were fixed still.
                                                                                                      Catalyst 12.1 preview shows much longer execution times than Catalyst 11.2
                                                                                                      (same hardware, Cat 11.2 on Vista x86, Cat 12.1 preview on Win7x64).
                                                                                                      CPU usage is low in both tests. So,that old increased CPU usage bug doesn't appear for single-GPU host now.

                                                                                                      Is it possible to speedup total execution (elapsed) time to the level app has under Catalyst 11.2 ?

                                                                                                      What info required to fix this issue ?
                                                                                                      • Catalyst 11.11 is broken too
                                                                                                        freighter

                                                                                                         

                                                                                                        Originally posted by: freighter
                                                                                                        Originally posted by: MicahVillmow Raistmer, It was in response to raistmer/freighter. XP support is being dropped, so if issues arise, they probably won't be fixed.


                                                                                                         

                                                                                                        Will repeat the tests after newer Win7x64 has finished install on that XP host i'm currently running tests.

                                                                                                         

                                                                                                        Finally was able to reproduce the issue, first seen on WinXPsp3, now on Win7x64sp1, Cat 11.12 :

                                                                                                        Some bluescreen data (added TdrLevel=1 in windows registry) right from crash  :

                                                                                                        Technical Info:

                                                                                                         

                                                                                                        STOP: 0x0116 (0xFFFFFA8001B044E0, 0xFFFFF88002C078C8, 0x0000000000000000, 0x0000000000000001)

                                                                                                        atikmpag.sys - Address FFFFF88002C078C8 base at FFFFF88002C00000, DateStamp 0x4ebb331c

                                                                                                        AND additional after restarting the testmachine found in dump :

                                                                                                        STOP: 0x0116 (0xFFFFFA8001B044E0, 0xFFFFF88002C078C8, 0x0000000000000000, 0x0000000000000001)

                                                                                                         

                                                                                                        dxgmms1.sys - Address FFFFF88003F8FF07 base at FFFFF88003F80000, DateStamp 0x4ce799c1

                                                                                                         

                                                                                                        Will retry with Cat 12.1preview also.

                                                                                                    • Catalyst 11.11 is broken too
                                                                                                      MicahVillmow
                                                                                                      Raistmer, just a test case that shows the differences is enough.
                                                                                                        • Catalyst 11.11 is broken too
                                                                                                          Raistmer
                                                                                                          Originally posted by: MicahVillmow

                                                                                                          Raistmer, just a test case that shows the differences is enough.


                                                                                                          Ok.
                                                                                                          I just found that there is quite big dispersion of timings under Cat 11.2 with CPU busy too.

                                                                                                          For example:

                                                                                                          App Name Task name AR CPU time Elapsed
                                                                                                          MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7.wu 0.008955 35.974 132.818
                                                                                                          MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_1.wu 0.008955 36.535 126.451
                                                                                                          MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_2.wu 0.008955 36.535 126.236
                                                                                                          MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_3.wu 0.008955 33.899 279.133
                                                                                                          MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_4.wu 0.008955 36.582 127.559
                                                                                                          MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_5.wu 0.008955 34.554 301.493

                                                                                                          It's 6 successive runs of the same input data.
                                                                                                          Look for last 2 columns for CPU and Elapsed time.
                                                                                                          While CPU time differs not too much, elapsed time differs strong from run to run.
                                                                                                          Will aquire some statistics under both drivers.

                                                                                                          P.S. Interesting, lower CPU times corresponding to highly increased elapsed times.
                                                                                                          Elapsed vary 2 fold (!)
                                                                                                            • Catalyst 11.11 is broken too
                                                                                                              Raistmer
                                                                                                              Here the example of difference between 11.2 and 12.1preview elapsed times for my app:

                                                                                                              11.2:
                                                                                                              App Name Task name AR CPU time Elapsed

                                                                                                              MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7.wu 0.008955 34.975 129.685
                                                                                                              MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_1.wu 0.008955 36.879 125.781
                                                                                                              MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_2.wu 0.008955 32.511 251.097
                                                                                                              MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_3.wu 0.008955 32.823 274.248
                                                                                                              MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_4.wu 0.008955 35.085 124.304
                                                                                                              MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_5.wu 0.008955 31.731 276.672
                                                                                                              MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_6.wu 0.008955 31.762 291.919
                                                                                                              MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_7.wu 0.008955 33.025 259.631
                                                                                                              MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_8.wu 0.008955 32.76 256.995
                                                                                                              MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7.wu 0.008955 35.35 132.97
                                                                                                              MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_1.wu 0.008955 37.596 127.986
                                                                                                              MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_2.wu 0.008955 34.024 252.894
                                                                                                              MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_3.wu 0.008955 36.707 125.958
                                                                                                              MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_4.wu 0.008955 35.802 125.538
                                                                                                              MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_5.wu 0.008955 35.319 124.983
                                                                                                              MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_6.wu 0.008955 33.837 259.316
                                                                                                              MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_7.wu 0.008955 35.802 126.343
                                                                                                              MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_8.wu 0.008955 36.411 126.186

                                                                                                              12.1p:
                                                                                                              App Name Task name AR CPU time Elapsed

                                                                                                              MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7.wu 0.008955 29.734 470.622
                                                                                                              MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_1.wu 0.008955 29.718 497.89
                                                                                                              MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_2.wu 0.008955 29.812 471.589
                                                                                                              MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_3.wu 0.008955 29.562 474.88
                                                                                                              MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_4.wu 0.008955 34.632 116.392
                                                                                                              MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_5.wu 0.008955 34.648 116.47
                                                                                                              MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_6.wu 0.008955 30.763 388.191
                                                                                                              MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_7.wu 0.008955 34.897 116.855
                                                                                                              MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_8.wu 0.008955 34.866 116.674
                                                                                                              MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7.wu 0.008955 30.795 583.831
                                                                                                              MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_1.wu 0.008955 30.311 576.561
                                                                                                              MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_2.wu 0.008955 31.091 499.032
                                                                                                              MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_3.wu 0.008955 32.745 173.332
                                                                                                              MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_4.wu 0.008955 33.384 186.435
                                                                                                              MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_5.wu 0.008955 35.506 117.972
                                                                                                              MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_6.wu 0.008955 33.01 185.255
                                                                                                              MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_7.wu 0.008955 35.677 116.281
                                                                                                              MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_8.wu 0.008955 31.887 250.409

                                                                                                              As one can see, sometimes 12.1p much slower.

                                                                                                              This leads to 2 issues:

                                                                                                              1) under both drivers elapsed time increased a lot sometimes
                                                                                                              2) under new 12.1p driver this increae can be much bigger

                                                                                                              Testing environment included few idle-priority CPU-intencive processes + GPU app running at increased (increased above normal process) priority.
                                                                                                                • Catalyst 11.11 is broken too
                                                                                                                  _heinz

                                                                                                                  I bought a HD 4670 AGP to have a machine to test OCL programs, OS W7 x32. Latest driver http://support.amd.com/de/kbarticles/Pages/CatalystAGPHotfix.aspx  forces CPU always to 100%. This is a error in latest driver. CPU usage should be max 5%. AMD produce good hardware, but it is useless if driver fails.

                                                                                                                  I'm very disappointed from AMD company. I will revise my decision to bouhgt new AMD server's if driver support is so bad. This way you lost all my trust to AMD.

                                                                                                                  _heinz

                                                                                                                   

                                                                                                                   

                                                                                                                    • Catalyst 11.11 is broken too
                                                                                                                      nou

                                                                                                                      you bouht the worst card for opencl. first it is 4xxx card which can hardly manage opencl. second it is AGP which have another set of problem of own.

                                                                                                                        • Catalyst 11.11 is broken too
                                                                                                                          Meteorhead

                                                                                                                          Indeed. There have been a few topics where people were asking what sort of card to buy for testing. Consulting some of these before buying would've saved some money and time. 4670 sure supports OpenCL and might seem an economically wise decision, however 4xxx cards only emulate some features, such as __local (or shared for CUDA) memory as it is. It resides in __global (or device) memory, therefore it hurts performance badly.

                                                                                                                          I don't know about AGP connector, bud naturally, it is badly deprecated.

                                                                                                                          Testing performance on 3 years old HW is somewhat self-controversial. If you don't want to spend a lot of money on a test machine, consider buying 6850, which is downclocked second largest chip of the presently newest generation, holds all features and shows decent performance. 7000 cards won't hit market until middle of January most likely.

                                                                                                                          It is if I'd test NV on GT220 and say that it is bad perf-wise. Naturally, cause it is 3 years old and it is the lowest of the class.

                                                                                                                            • Catalyst 11.11 is broken too
                                                                                                                              freighter

                                                                                                                               

                                                                                                                              Originally posted by: Meteorhead Indeed. There have been a few topics where people were asking what sort of card to buy for testing. Consulting some of these before buying would've saved some money and time. 4670 sure supports OpenCL and might seem an economically wise decision, however 4xxx cards only emulate some features, such as __local (or shared for CUDA) memory as it is. It resides in __global (or device) memory, therefore it hurts performance badly.

                                                                                                                               

                                                                                                                              I don't know about AGP connector, bud naturally, it is badly deprecated.

                                                                                                                               

                                                                                                                              Testing performance on 3 years old HW is somewhat self-controversial. If you don't want to spend a lot of money on a test machine, consider buying 6850, which is downclocked second largest chip of the presently newest generation, holds all features and shows decent performance. 7000 cards won't hit market until middle of January most likely.

                                                                                                                               

                                                                                                                              It is if I'd test NV on GT220 and say that it is bad perf-wise. Naturally, cause it is 3 years old and it is the lowest of the class.

                                                                                                                               

                                                                                                                              Reasoning : If an OpenCL app works on one of these lower end HD4xxx GPUs it will run on all the newer ones, too. That seems to be true at least for Seti@home's OpenCL app.  This is not about performance or new features it is about providing compatibilty for a wider userbase.

                                                                                                                  • Catalyst 11.11 is broken too
                                                                                                                    MicahVillmow
                                                                                                                    freighter,
                                                                                                                    If you want compatibility, don't target a specific device(HD4XXX), but target the language version. You can buy a higher end card but target only core OpenCL 1.0(without images), and it should work on the lower end cards. If it doesn't work, then it is a problem with the our software and should be reported as such.
                                                                                                                      • Catalyst 11.11 is broken too
                                                                                                                        freighter

                                                                                                                         

                                                                                                                        Originally posted by: MicahVillmow freighter, If you want compatibility, don't target a specific device(HD4XXX), but target the language version. You can buy a higher end card but target only core OpenCL 1.0(without images), and it should work on the lower end cards. If it doesn't work, then it is a problem with the our software and should be reported as such.


                                                                                                                        MicahVillmov,

                                                                                                                        sorry if my issue is still a problem.

                                                                                                                        Where did i state that i target a "specific device(HD4xxx)" ? Only using a specific device for testing an OpenCL(1.0 without images) version of Seti@home that tries to follow AMDs recommendations for "HD4xxx GPUs with a max. workgroup size of 128" : Use not more than workgroup sizes of 32 if you want to be on the safe side. Workgroup sizes of 64 and higher can lead to incorrect results.

                                                                                                                        So, i made sure that AMDs recommendation is in use and works ok on at least one of the (former) supported platforms : openSuse 11.3, 64bit and Catalyst driver version 11.9 and older. Newer drivers lead to the same failure like i reported earlier in this thread for the windows version.

                                                                                                                        ???

                                                                                                                      • Catalyst 11.11 is broken too
                                                                                                                        MicahVillmow
                                                                                                                        freighter/Raistmer,
                                                                                                                        Does the crashing occur on anything outside of the HD4XXX series? We believe we have the performance issue understood and are working on a fix.
                                                                                                                          • Catalyst 11.11 is broken too
                                                                                                                            freighter

                                                                                                                             

                                                                                                                            Originally posted by: MicahVillmow freighter/Raistmer, Does the crashing occur on anything outside of the HD4XXX series? We believe we have the performance issue understood and are working on a fix.


                                                                                                                            No, only  HD4xxx GPUs with max. workgroup size of 128 are showing the crashes.

                                                                                                                            • Re: Catalyst 11.11 is broken too
                                                                                                                              freighter

                                                                                                                              Hi,

                                                                                                                              it's ca. 3 months now and after trying out all the drivers i could get somehow, upto 12.4 preview, the crashing issue that happens only on HD4xxx GPUs with max. workgroup size of 128 is still not resolved. Hopefully with one of the next drivers the fix you were taking about in january gets in the driver.

                                                                                                                               

                                                                                                                              Patiently (a)waiting ...

                                                                                                                                • Re: Catalyst 11.11 is broken too
                                                                                                                                  MicahVillmow

                                                                                                                                  freighter,

                                                                                                                                  Are you querying the 'CL_KERNEL_WORK_GROUP_SIZE' attribute of the device you are executing on? In some cases we must limit the size on a per kernel basis.

                                                                                                                                    • Re: Catalyst 11.11 is broken too
                                                                                                                                      freighter

                                                                                                                                      int getMaxKernelWorkGroupSize(cl_fft_plan *plan, unsigned int *max_wg_size, unsigned int num_devices, cl_device_id *devices) the

                                                                                                                                      if(wg_size < kInfo->num_workitems_per_workgroup)" is replaced with "if(wg_size < plan->max_work_item_per_workgroup)", where "plan->max_work_item_per_workgroup" is constantly set to 32, for