Archives Discussions

Raistmer · ‎10-11-2011

Invalid results produced under this driver

Unfortunately, not all users read message boards.
Hosts with this driver already appeared in list of participants, producing incorrect results.

When next Catalyst version, or at least withdrawal of this one can be expected? Having it online and downloadable hurts project I worked on.

Meteorhead · ‎11-24-2011

I also thought of reverting to some older Catalyst, but the problem was that on Ubuntu, it regularly crashed the machine when I tried to run multi-GPU applications. 11.11 is the first that was stable in that sense, but unfortunately it failes to boot in a significant ratio of times.

Raistmer · ‎11-28-2011

About those crashes: so far (~10 identical runs complete ) there are 2 locations in program's computational loop where crash occurs.
But number of loop iterations before crash varies. Also, in 2 cases incorrect data was detected (so, memory buffer was damaged before crash).
If AMD wants executable for test I can provide it.
It's separate problem (different application's modification) from problem in discussion in this topic.

Raistmer · ‎11-29-2011

Hope this speeds up AMD solution:

I narrowed down place of error to single kernel invocation:

Results of kernel call under Cat 11.2:

INFO: Setting kernel argument:FindAutoCorrelation_reduce1_kernel_cl ok
INFO: Enqueuing FindAutoCorrelation_reduce1_kernel_cl ok
INFO: clReadBuffer(gpu_AutoCorrelationResults,PowerBin) ok
fully reduced:
i=0, power=7.331,bin=9861
i=1, power=8.084,bin=1.983e+004
i=2, power=8.756,bin=4.039e+004
i=3, power=8.399,bin=1.305e+004
i=4, power=8.169,bin=3.056e+004
i=5, power=8.838,bin=1.377e+004
i=6, power=7.687,bin=6.322e+004
i=7, power=6.46,bin=4.259e+004

Under Cat 11.11:

INFO: Setting kernel argument:FindAutoCorrelation_reduce1_kernel_cl ok
INFO: Enqueuing FindAutoCorrelation_reduce1_kernel_cl ok
INFO: clReadBuffer(gpu_AutoCorrelationResults,PowerBin) ok
fully reduced:
i=0, power=150,bin=2.456e+004
i=1, power=135.1,bin=3.156e+004
i=2, power=139.5,bin=3.984e+004
i=3, power=141.9,bin=2.395e+004
i=4, power=139.9,bin=3.459e+004
i=5, power=141.9,bin=4.022e+004
i=6, power=140.4,bin=3.872e+004
i=7, power=138.1,bin=2.452e+004

And with so big power program ends with too many false positives found.

The code:

{ int i=0; size_t localThreads[1]={1}; size_t globalThreads[1]={8};//R: hardwired number of FFTs for now err = clSetKernelArg(FindAutoCorrelation_reduce1_kernel_cl,i++,sizeof(cl_mem),(void *)&FFTbuf); err |= clSetKernelArg(FindAutoCorrelation_reduce1_kernel_cl,i++,sizeof(cl_mem),(void *)&gpu_AutoCorrelationResults); if(err) fprintf(stderr,"ERROR: Setting kernel argument:FindAutoCorrelation_reduce1_kernel_cl: %d\n",err); #if OCL_VERBOSE fprintf(stderr,"INFO: Setting kernel argument:FindAutoCorrelation_reduce1_kernel_cl ok\n"); #endif err = clEnqueueNDRangeKernel(cq,FindAutoCorrelation_reduce1_kernel_cl, 1, NULL,globalThreads, localThreads, 0,NULL,NULL); if(err) fprintf(stderr,"ERROR: Enqueueing FindAutoCorrelation_reduce1_kernel_cl: %d\n",err); #if OCL_VERBOSE fprintf(stderr,"INFO: Enqueuing FindAutoCorrelation_reduce1_kernel_cl ok\n"); #endif } #endif #if OCL_WDM err=clFlush(cq); if(err)fprintf(stderr,"ERROR: submitting kernels autocorr search: %d\n",err); #if __Win32 Sleep(OCL_WDM_SLEEP); #elif _GNU_SOURCE usleep(OCL_WDM_SLEEP); #endif #endif err=clEnqueueReadBuffer(cq,gpu_AutoCorrelationResults,CL_TRUE,0,sizeof(cl_float2)*8,PowerBin,0, NULL,NULL); if(err) fprintf(stderr,"ERROR: clReadBuffer(gpu_AutoCorrelationResults,PowerBin): %d\n",err); #if OCL_VERBOSE fprintf(stderr,"INFO: clReadBuffer(gpu_AutoCorrelationResults,PowerBin) ok\n"); #endif #if 1 //dump fprintf(stderr,"fully reduced:\n"); for(int i=0;i<8;i++)fprintf(stderr,"i=%d, power=%.4g,bin=%.4g\n",i,PowerBin[2*i],PowerBin[2*i+1]); exit(0); #endif } The kernel: __kernel void FindAutoCorrelation_reduce1_kernel_cl(__global float* PartialSumMaxIdx, __global float2* result){ const int sid=256*get_global_id(0);//only 8 workitems here float full_max=0.0f; float idx=0.0f; float Sum=0.0f; for(int i=0;i<256;i++){ Sum+=PartialSumMaxIdx[3*(sid+i)]; float tmp=PartialSumMaxIdx[(sid+i)*3+1]; if(tmp>full_max){ full_max=tmp; idx=PartialSumMaxIdx[(sid+i)*3+2]; } } result[get_global_id(0)].x=full_max*(128.0f*1024.0f)/Sum;//peak value result[get_global_id(0)].y=idx;//bin }

Raistmer · ‎12-01-2011

So, what is the status of this bug report ?
Now, when Cat11.12 ready for preview, what you get with this new driver or any newer internal builds? Will it fix bug or again too old to contain any fixes? ...

himanshu_gautam · ‎12-01-2011

Hi Raistmer,

Thanks for sending the code. I was trying to see what type of outputs the code generate on my system. The Kernel posted appears to be running fine with the internal driver i have installed on my machine for some set of values.

I wil try to install 11.11b performance driver and try that too. It would be helpful if you can give what kind of values the FFTBuf buffer contain.

Raistmer · ‎12-01-2011

Originally posted by: himanshu.gautam

Hi Raistmer,

Thanks for sending the code. I was trying to see what type of outputs the code generate on my system. The Kernel posted appears to be running fine with the internal driver i have installed on my machine for some set of values.

I wil try to install 11.11b performance driver and try that too. It would be helpful if you can give what kind of values the FFTBuf buffer contain.

Hi Himanshu,
Here I attached full logs for Cat 11.2 and Cat 11.11 (as one can see, they are identical before very last call, different part I cited in prev post) along with debug build of app itself and test workunit .
I will post whole code that generates dumps in log. I think you don't need all other kernels cause they work OK and data they generate listed in logs.

EDIT:
and here goes the code.
So you can easely compare log outputs with places in code that generated corresponding outputs.

void PC_FindAutoCorrelation_cl(float* PowerBin){ // Because positive and negative delays are the same // we only have to look at half the points //Jason: Use 4N-FFT method for Type 2 Discrete Cosine Tranform for now, to match fftw's REDFT10 // 1 Autocorrelation from global powerspectrum at fft_num*ac_fft_len (fft_num*ul_NumDataPoints ) //R: all transforms on whole 2D matrix instead of single power specrum. #if 1 //for dump const int ac_fftlen=128*1024; float* b=new float[ac_fftlen*4*2]; if(b==NULL)fprintf(stderr,"Can't allocate buffer\n"); #endif //Step 1: Preprocessing - repack relevant powerspectrum into a 4N array with 'real-even symmetry' { size_t localThreads[2]={64,1};//R: TODO tune size_t globalThreads[2]={128*1024,8};//R: hardwired FFT len and number of FFTs for now #if 1 //dump err=clEnqueueReadBuffer(cq,gpu_PowerSpectrum,CL_TRUE,sizeof(float)*ac_fftlen*1,sizeof(float)*ac_fftlen,b,0, NULL,NULL); if(err)fprintf(stderr,"ERROR: ReadBuffer:%d\n",err); fprintf(stderr,"power spectrum:\n"); for(int i=0;i<ac_fftlen;i++) fprintf(stderr,"i=%d,x=%g\n",i,b); #endif int i=0; err = clSetKernelArg(RepackInput_kernel_cl,i++,sizeof(cl_mem),(void *)&gpu_PowerSpectrum); err |= clSetKernelArg(RepackInput_kernel_cl,i++,sizeof(cl_mem),(void *)&gpu_WorkData); //int size=128*1024; //err |=clSetKernelArg(RepackInput_kernel_cl,i++,sizeof(cl_int),(void *)&size); if(err) fprintf(stderr,"ERROR: Setting kernel argument:RepackInput_kernel_cl: %d\n",err); #if OCL_VERBOSE fprintf(stderr,"INFO: Setting kernel argument:RepackInput_kernel_cl ok\n"); #endif err = clEnqueueNDRangeKernel(cq,RepackInput_kernel_cl, 2, NULL,globalThreads, localThreads, 0,NULL,NULL); if(err) fprintf(stderr,"ERROR: RepackInput_kernel_cl: %d\n",err); #if OCL_VERBOSE fprintf(stderr,"INFO: RepackInput_kernel_cl ok\n"); #endif } #if 1 //dump err=clEnqueueReadBuffer(cq,gpu_WorkData,CL_TRUE,1*4*2*sizeof(float)*ac_fftlen,4*2*sizeof(float)*ac_fftlen,b,0, NULL,NULL); if(err)fprintf(stderr,"ERROR: ReadBuffer:%d\n",err); fprintf(stderr,"after inflation:\n"); for(int i=0;i<4*ac_fftlen;i++) fprintf(stderr,"i=%d,x=%g, y=%g\n",i,b[2*i],b[2*i+1]); #endif //Step 2: Process the 4N-FFT (Complex to Complex, size is 4 * ac_fft_len) err |= clFFT_ExecuteInterleaved_mb(cq, autocorr_plan,8, clFFT_Forward, gpu_WorkData,gpu_WorkData, FFTbuf, 0, NULL, NULL); if(err) fprintf(stderr,"ERROR: autocorr fft: %d\n",err); #if OCL_VERBOSE fprintf(stderr,"INFO: autocorr fft ok\n"); #endif #if 1 //dump err=clEnqueueReadBuffer(cq,gpu_WorkData,CL_TRUE,1*4*2*sizeof(float)*ac_fftlen,4*2*sizeof(float)*ac_fftlen,b,0, NULL,NULL); if(err)fprintf(stderr,"ERROR: ReadBuffer:%d\n",err); fprintf(stderr,"after fft:\n"); for(int i=0;i<4*ac_fftlen;i++) fprintf(stderr,"i=%d,x=%g, y=%g\n",i,b[2*i],b[2*i+1]); #endif #if 0 //R: this step removed now //Step 3: Postprocess the FFT result (Scale, take powers & normalize), discarding unused data packing into AutoCorr_in first half for VRAM reuse { size_t localThreads[2]={64,1};//R: TODO tune size_t globalThreads[2]={32*1024,8};//R: hardwired FFT len and number of FFTs for now int i=0; err = clSetKernelArg(RepackOutput_kernel_cl,i++,sizeof(cl_mem),(void *)&gpu_WorkData); err |= clSetKernelArg(RepackOutput_kernel_cl,i++,sizeof(cl_mem),(void *)&FFTbuf); //int size=64*1024; //err |=clSetKernelArg(RepackOutput_kernel_cl,i++,sizeof(cl_int),(void *)&size); if(err) fprintf(stderr,"ERROR: Setting kernel argument:RepackOutput_kernel_cl: %d\n",err); err = clEnqueueNDRangeKernel(cq,RepackOutput_kernel_cl, 2, NULL,globalThreads, localThreads, 0,NULL,NULL); if(err) fprintf(stderr,"ERROR: RepackOutput_kernel_cl: %d\n",err); } #endif #if 0 err=clEnqueueReadBuffer(cq,FFTbuf,CL_TRUE,1*64*1024*sizeof(float),sizeof(float)*ac_fftlen/2,b,0, NULL,NULL); if(err)fprintf(stderr,"ERROR: ReadBuffer:%d\n",err); fprintf(stderr,"after packing:\n"); for(int j=0;j<ac_fftlen/2;j++) fprintf(stderr,"AutoCorrelation[%d]=%.4g\n",j,b); //fprintf(stderr,"i=%d, x=%g, y=%g\n",i,b[2*i],b[2*i+1]); exit(0); #endif #if 0 //USE_OPENCL_NV || USE_OPENCL_HD5xxx {size_t localThreads[2]={256,1}; size_t globalThreads[2]={256,8};//R: hardwired number of FFTs for now int i=0; err = clSetKernelArg(FindAutoCorrelation_kernel1_cl,i++,sizeof(cl_mem),(void *)&gpu_WorkData); err |= clSetKernelArg(FindAutoCorrelation_kernel1_cl,i++,sizeof(cl_mem),(void *)&gpu_AutoCorrelationResults); if(err) fprintf(stderr,"ERROR: Setting kernel argument:FindAutoCorrelation_kernel1_cl: %d\n",err); err = clEnqueueNDRangeKernel(cq,FindAutoCorrelation_kernel1_cl, 2, NULL,globalThreads, localThreads, 0,NULL,NULL); if(err) fprintf(stderr,"ERROR: FindAutoCorrelation_reduce0_kernel_cl: %d\n",err); } #else // const int len = 64*1024;//autocorr_fft_len/2; // const int blksize = UNSTDMAX(4, UNSTDMIN(pow2a((unsigned int) sqrt((float) (len / 32)) * 32), 512)); {size_t localThreads[2]={64,1}; size_t globalThreads[2]={256,8};//R: hardwired number of FFTs for now int i=0; err = clSetKernelArg(FindAutoCorrelation_reduce0_kernel_cl,i++,sizeof(cl_mem),(void *)&gpu_WorkData); err |= clSetKernelArg(FindAutoCorrelation_reduce0_kernel_cl,i++,sizeof(cl_mem),(void *)&FFTbuf); if(err) fprintf(stderr,"ERROR: Setting kernel argument:FindAutoCorrelation_reduce0_kernel_cl: %d\n",err); #if OCL_VERBOSE fprintf(stderr,"INFO: Setting kernel argument:FindAutoCorrelation_reduce0_kernel_cl ok\n"); #endif err = clEnqueueNDRangeKernel(cq,FindAutoCorrelation_reduce0_kernel_cl, 2, NULL,globalThreads, localThreads, 0,NULL,NULL); if(err) fprintf(stderr,"ERROR: FindAutoCorrelation_reduce0_kernel_cl: %d\n",err); #if OCL_VERBOSE fprintf(stderr,"INFO: FindAutoCorrelation_reduce0_kernel_cl ok\n"); #endif } #if 1 //dump err=clEnqueueReadBuffer(cq,gpu_WorkData,CL_TRUE,0*256*sizeof(float),sizeof(float)*256*8*3,b,0, NULL,NULL); if(err)fprintf(stderr,"ERROR: ReadBuffer:%d\n",err); fprintf(stderr,"partially reduced:\n"); for(int j=0;j<256*8;j++) fprintf(stderr,"j=%d,partial_sum=%.4g,partial_max=%.4g, idx=%.4g\n",j,b[3*j],b[3*j+1],b[3*j+2]); //fprintf(stderr,"i=%d, x=%g, y=%g\n",i,b[2*i],b[2*i+1]); //exit(0); #endif { int i=0; size_t localThreads[1]={1}; size_t globalThreads[1]={8};//R: hardwired number of FFTs for now err = clSetKernelArg(FindAutoCorrelation_reduce1_kernel_cl,i++,sizeof(cl_mem),(void *)&FFTbuf); err |= clSetKernelArg(FindAutoCorrelation_reduce1_kernel_cl,i++,sizeof(cl_mem),(void *)&gpu_AutoCorrelationResults); if(err) fprintf(stderr,"ERROR: Setting kernel argument:FindAutoCorrelation_reduce1_kernel_cl: %d\n",err); #if OCL_VERBOSE fprintf(stderr,"INFO: Setting kernel argument:FindAutoCorrelation_reduce1_kernel_cl ok\n"); #endif err = clEnqueueNDRangeKernel(cq,FindAutoCorrelation_reduce1_kernel_cl, 1, NULL,globalThreads, localThreads, 0,NULL,NULL); if(err) fprintf(stderr,"ERROR: Enqueueing FindAutoCorrelation_reduce1_kernel_cl: %d\n",err); #if OCL_VERBOSE fprintf(stderr,"INFO: Enqueuing FindAutoCorrelation_reduce1_kernel_cl ok\n"); #endif } #endif #if OCL_WDM err=clFlush(cq); if(err)fprintf(stderr,"ERROR: submitting kernels autocorr search: %d\n",err); #if __Win32 Sleep(OCL_WDM_SLEEP); #elif _GNU_SOURCE usleep(OCL_WDM_SLEEP); #endif #endif err=clEnqueueReadBuffer(cq,gpu_AutoCorrelationResults,CL_TRUE,0,sizeof(cl_float2)*8,PowerBin,0, NULL,NULL); if(err) fprintf(stderr,"ERROR: clReadBuffer(gpu_AutoCorrelationResults,PowerBin): %d\n",err); #if OCL_VERBOSE fprintf(stderr,"INFO: clReadBuffer(gpu_AutoCorrelationResults,PowerBin) ok\n"); #endif #if 1 //dump fprintf(stderr,"fully reduced:\n"); for(int i=0;i<8;i++)fprintf(stderr,"i=%d, power=%.4g,bin=%.4g\n",i,PowerBin[2*i],PowerBin[2*i+1]); exit(0); #endif }

himanshu_gautam · ‎12-05-2011

Hi raistmer,

I am not able to access the logs you provided link for, as the link is non-english.

Can you please provide a english link.

Raistmer · ‎12-05-2011

Originally posted by: himanshu.gautam

Hi raistmer,

I am not able to access the logs you provided link for, as the link is non-english.

Can you please provide a english link.

There was "english" link on site before, perhaps hoster changed site design... OK, if Google's translate didn't help I will attach file to my bug report ticket. And find another hoster in the future.

Raistmer · ‎12-05-2011

Files are attached here: http://developer.amd.com/suppo...ils.aspx?TicketID=1549

himanshu_gautam · ‎12-07-2011

Got them 🙂

MicahVillmow · ‎12-14-2011

freighter,
While we still include the OpenCL runtime components in the graphics driver for pre-SDK 2.6 releases, SDK 2.6 features are not supported on windows xp.

freighter · ‎12-15-2011

Thanks for info about which SDK features are supported by XP.

Seti@home apps do not use any SDK 2.6 features currently.

Back to sub-topic :

Found some freeware tool called "BlueScreenView" to decipher

bluescreens on windows XP and later. Hopefully that can help to ease bug search a little.

Dump File : Mini121511-01.dmp

Crash Time : 15.12.2011 02:00:54

Bug Check String : THREAD_STUCK_IN_DEVICE_DRIVER

Bug Check Code : 0x000000ea

Parameter 1 : 0x89989948

Parameter 2 : 0x89a68e08

Parameter 3 : 0x88f0d248

Parameter 4 : 0x00000001

Caused By Driver : ati2cqag.dll

Caused By Address : ati2cqag.dll+1aeb4

File Description : Central Memory Manager / Queue Server Module

Product Name : ATI Radeon Family

Company : ATI Technologies Inc.

File Version : 6.14.10.0534

Processor : 32-bit

Crash Address : ati2cqag.dll+1ad92

Meteorhead · ‎12-15-2011

"Seti@home apps do not use any SDK 2.6 features currently."

What are SDK 2.6 features exactly???

Raistmer · ‎12-15-2011

I'm glad to report that with Catalyst 12.1 preview under Win7 x64 app produces correct results now.
Also, I have reports from testers that Cat 11.12 work OK too.
Thanks to AMD driver/support team for finally fixing this ugly issue.
Will provide data about performance with Cat 12.1 preview little later.

Raistmer · ‎12-15-2011

Originally posted by: MicahVillmow

freighter,

While we still include the OpenCL runtime components in the graphics driver for pre-SDK 2.6 releases, SDK 2.6 features are not supported on windows xp.

Micah, app was built with SDK 2.5
Why you speak about SDK 2.6 ?

MicahVillmow · ‎12-15-2011

Raistmer,
It was in response to raistmer/freighter. XP support is being dropped, so if issues arise, they probably won't be fixed.

freighter · ‎12-15-2011

Originally posted by: MicahVillmow Raistmer, It was in response to raistmer/freighter. XP support is being dropped, so if issues arise, they probably won't be fixed.

Will repeat the tests after newer Win7x64 has finished install on that XP host i'm currently running tests.

Raistmer · ‎12-15-2011

Unfortunately, not all bugs introduced after Cat 11.2 were fixed still.
Catalyst 12.1 preview shows much longer execution times than Catalyst 11.2
(same hardware, Cat 11.2 on Vista x86, Cat 12.1 preview on Win7x64).
CPU usage is low in both tests. So,that old increased CPU usage bug doesn't appear for single-GPU host now.

Is it possible to speedup total execution (elapsed) time to the level app has under Catalyst 11.2 ?

What info required to fix this issue ?

freighter · ‎01-04-2012

Originally posted by: freighter
Originally posted by: MicahVillmow Raistmer, It was in response to raistmer/freighter. XP support is being dropped, so if issues arise, they probably won't be fixed.

Will repeat the tests after newer Win7x64 has finished install on that XP host i'm currently running tests.

Finally was able to reproduce the issue, first seen on WinXPsp3, now on Win7x64sp1, Cat 11.12 :

Some bluescreen data (added TdrLevel=1 in windows registry) right from crash :

Technical Info:

STOP: 0x0116 (0xFFFFFA8001B044E0, 0xFFFFF88002C078C8, 0x0000000000000000, 0x0000000000000001)

atikmpag.sys - Address FFFFF88002C078C8 base at FFFFF88002C00000, DateStamp 0x4ebb331c

AND additional after restarting the testmachine found in dump :

STOP: 0x0116 (0xFFFFFA8001B044E0, 0xFFFFF88002C078C8, 0x0000000000000000, 0x0000000000000001)

dxgmms1.sys - Address FFFFF88003F8FF07 base at FFFFF88003F80000, DateStamp 0x4ce799c1

Will retry with Cat 12.1preview also.

MicahVillmow · ‎12-15-2011

Raistmer, just a test case that shows the differences is enough.

Raistmer · ‎12-15-2011

Originally posted by: MicahVillmow

Raistmer, just a test case that shows the differences is enough.

Ok.
I just found that there is quite big dispersion of timings under Cat 11.2 with CPU busy too.

For example:

App Name Task name AR CPU time Elapsed
MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7.wu 0.008955 35.974 132.818
MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_1.wu 0.008955 36.535 126.451
MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_2.wu 0.008955 36.535 126.236
MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_3.wu 0.008955 33.899 279.133
MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_4.wu 0.008955 36.582 127.559
MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_5.wu 0.008955 34.554 301.493

It's 6 successive runs of the same input data.
Look for last 2 columns for CPU and Elapsed time.
While CPU time differs not too much, elapsed time differs strong from run to run.
Will aquire some statistics under both drivers.

P.S. Interesting, lower CPU times corresponding to highly increased elapsed times.
Elapsed vary 2 fold (!)

Raistmer · ‎12-18-2011

Here the example of difference between 11.2 and 12.1preview elapsed times for my app:

11.2:
App Name Task name AR CPU time Elapsed

MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7.wu 0.008955 34.975 129.685
MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_1.wu 0.008955 36.879 125.781
MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_2.wu 0.008955 32.511 251.097
MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_3.wu 0.008955 32.823 274.248
MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_4.wu 0.008955 35.085 124.304
MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_5.wu 0.008955 31.731 276.672
MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_6.wu 0.008955 31.762 291.919
MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_7.wu 0.008955 33.025 259.631
MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_8.wu 0.008955 32.76 256.995
MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7.wu 0.008955 35.35 132.97
MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_1.wu 0.008955 37.596 127.986
MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_2.wu 0.008955 34.024 252.894
MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_3.wu 0.008955 36.707 125.958
MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_4.wu 0.008955 35.802 125.538
MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_5.wu 0.008955 35.319 124.983
MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_6.wu 0.008955 33.837 259.316
MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_7.wu 0.008955 35.802 126.343
MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_8.wu 0.008955 36.411 126.186

12.1p:
App Name Task name AR CPU time Elapsed

MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7.wu 0.008955 29.734 470.622
MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_1.wu 0.008955 29.718 497.89
MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_2.wu 0.008955 29.812 471.589
MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_3.wu 0.008955 29.562 474.88
MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_4.wu 0.008955 34.632 116.392
MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_5.wu 0.008955 34.648 116.47
MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_6.wu 0.008955 30.763 388.191
MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_7.wu 0.008955 34.897 116.855
MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_8.wu 0.008955 34.866 116.674
MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7.wu 0.008955 30.795 583.831
MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_1.wu 0.008955 30.311 576.561
MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_2.wu 0.008955 31.091 499.032
MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_3.wu 0.008955 32.745 173.332
MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_4.wu 0.008955 33.384 186.435
MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_5.wu 0.008955 35.506 117.972
MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_6.wu 0.008955 33.01 185.255
MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_7.wu 0.008955 35.677 116.281
MB7_win_x86_SSE3_OpenCL_ATi_r390 PG0009_v7_8.wu 0.008955 31.887 250.409

As one can see, sometimes 12.1p much slower.

This leads to 2 issues:

1) under both drivers elapsed time increased a lot sometimes
2) under new 12.1p driver this increae can be much bigger

Testing environment included few idle-priority CPU-intencive processes + GPU app running at increased (increased above normal process) priority.

_heinz · ‎12-24-2011

I bought a HD 4670 AGP to have a machine to test OCL programs, OS W7 x32. Latest driver http://support.amd.com/de/kbarticles/Pages/CatalystAGPHotfix.aspx forces CPU always to 100%. This is a error in latest driver. CPU usage should be max 5%. AMD produce good hardware, but it is useless if driver fails.

I'm very disappointed from AMD company. I will revise my decision to bouhgt new AMD server's if driver support is so bad. This way you lost all my trust to AMD.

_heinz

nou · ‎12-24-2011

you bouht the worst card for opencl. first it is 4xxx card which can hardly manage opencl. second it is AGP which have another set of problem of own.

Meteorhead · ‎12-24-2011

Indeed. There have been a few topics where people were asking what sort of card to buy for testing. Consulting some of these before buying would've saved some money and time. 4670 sure supports OpenCL and might seem an economically wise decision, however 4xxx cards only emulate some features, such as __local (or shared for CUDA) memory as it is. It resides in __global (or device) memory, therefore it hurts performance badly.

I don't know about AGP connector, bud naturally, it is badly deprecated.

Testing performance on 3 years old HW is somewhat self-controversial. If you don't want to spend a lot of money on a test machine, consider buying 6850, which is downclocked second largest chip of the presently newest generation, holds all features and shows decent performance. 7000 cards won't hit market until middle of January most likely.

It is if I'd test NV on GT220 and say that it is bad perf-wise. Naturally, cause it is 3 years old and it is the lowest of the class.

freighter · ‎01-04-2012

Originally posted by: Meteorhead Indeed. There have been a few topics where people were asking what sort of card to buy for testing. Consulting some of these before buying would've saved some money and time. 4670 sure supports OpenCL and might seem an economically wise decision, however 4xxx cards only emulate some features, such as __local (or shared for CUDA) memory as it is. It resides in __global (or device) memory, therefore it hurts performance badly.

I don't know about AGP connector, bud naturally, it is badly deprecated.

Testing performance on 3 years old HW is somewhat self-controversial. If you don't want to spend a lot of money on a test machine, consider buying 6850, which is downclocked second largest chip of the presently newest generation, holds all features and shows decent performance. 7000 cards won't hit market until middle of January most likely.

It is if I'd test NV on GT220 and say that it is bad perf-wise. Naturally, cause it is 3 years old and it is the lowest of the class.

Reasoning : If an OpenCL app works on one of these lower end HD4xxx GPUs it will run on all the newer ones, too. That seems to be true at least for Seti@home's OpenCL app. This is not about performance or new features it is about providing compatibilty for a wider userbase.

MicahVillmow · ‎01-04-2012

freighter,
If you want compatibility, don't target a specific device(HD4XXX), but target the language version. You can buy a higher end card but target only core OpenCL 1.0(without images), and it should work on the lower end cards. If it doesn't work, then it is a problem with the our software and should be reported as such.

freighter · ‎01-05-2012

Originally posted by: MicahVillmow freighter, If you want compatibility, don't target a specific device(HD4XXX), but target the language version. You can buy a higher end card but target only core OpenCL 1.0(without images), and it should work on the lower end cards. If it doesn't work, then it is a problem with the our software and should be reported as such.

MicahVillmov,

sorry if my issue is still a problem.

Where did i state that i target a "specific device(HD4xxx)" ? Only using a specific device for testing an OpenCL(1.0 without images) version of Seti@home that tries to follow AMDs recommendations for "HD4xxx GPUs with a max. workgroup size of 128" : Use not more than workgroup sizes of 32 if you want to be on the safe side. Workgroup sizes of 64 and higher can lead to incorrect results.

So, i made sure that AMDs recommendation is in use and works ok on at least one of the (former) supported platforms : openSuse 11.3, 64bit and Catalyst driver version 11.9 and older. Newer drivers lead to the same failure like i reported earlier in this thread for the windows version.

???

MicahVillmow · ‎01-05-2012

freighter/Raistmer,
Does the crashing occur on anything outside of the HD4XXX series? We believe we have the performance issue understood and are working on a fix.

freighter · ‎01-06-2012

Originally posted by: MicahVillmow freighter/Raistmer, Does the crashing occur on anything outside of the HD4XXX series? We believe we have the performance issue understood and are working on a fix.

No, only HD4xxx GPUs with max. workgroup size of 128 are showing the crashes.

freighter · ‎04-02-2012

Hi,

it's ca. 3 months now and after trying out all the drivers i could get somehow, upto 12.4 preview, the crashing issue that happens only on HD4xxx GPUs with max. workgroup size of 128 is still not resolved. Hopefully with one of the next drivers the fix you were taking about in january gets in the driver.

Patiently (a)waiting ...

MicahVillmow · ‎04-02-2012

freighter,

Are you querying the 'CL_KERNEL_WORK_GROUP_SIZE' attribute of the device you are executing on? In some cases we must limit the size on a per kernel basis.

freighter · ‎04-03-2012

MicahVillmow,

the 'CL_KERNEL_WORK_GROUP_SIZE' attribute of the device is used inside of Apple's OpenCL_FFT, which we adapted for our needs. It contains most of the modifications for the HD4xxx GPUs with a max. workgroup size of 128.

In fft_setup.cpp, int getMaxKernelWorkGroupSize(cl_fft_plan *plan, unsigned int *max_wg_size, unsigned int num_devices, cl_device_id *devices) the 'CL_KERNEL_WORK_GROUP_SIZE' attribute of the device is queried, but its value only used for comparisons.

In that function the codeline "if(wg_size < kInfo->num_workitems_per_workgroup)" is replaced with "if(wg_size < plan->max_work_item_per_workgroup)", where "plan->max_work_item_per_workgroup" is constantly set to 32, for the HD4xxx GPUs with a max. workgroup size of 128.

Like mentioned earlier, this works upto Catalyst 11.9 and AMD-APP-SDK 2.4 on linux.

MicahVillmow · ‎04-04-2012

Ok,

Can you supply us with a test case that shows the issue? We have plenty of samples running on the HD4XXX series of GPUs without issue.

freighter · ‎04-04-2012

Micah Villmow,

as Raistmer has sent you the sources already, I've put some example together that should make reproducing the crashes possible (Linux Testcase for LHD4K crashes) :

edit: link removed

Please inform me if you got it, so i can free the server space again.

Nachricht wurde geändert durch: Urs Echternacht to remove a link.

MicahVillmow · ‎04-12-2012

I have the test case, thanks.

Archives Discussions

Catalyst 11.11 is broken too