Archives Discussions

Raistmer · ‎01-20-2010

only FFT up to size 1024 calculating correctly on GPU but bigger size possible on CPU !

I trying to use code from Apple's OpenCL_FFT sample for OS X to get FFT on ATI's GPU.
OpenCL_FFT
For correctness check I use FFTW CPU library to compute FFT from same data.

I need FFT size 32k, 32768. Results from oclFFT completely different (difference in first digit) from FFTW results.
Then I started to try anothe FFT sizes to check if sample built correctly at all and found that sizes up to 1024 (tried 32, 1024) compute just excellent. Results are the same for 4 or mor first digits, further small errors perhaps from different rounding errors appears.
But bigger sizes completely screwed. For example, with size of 2048 oclFFT changes only first 8 elements of input arry, then go unchanged input data and at index of 128 some changes (again 8 elements) then unchanged data, then at index 384 and so on. Changed elements no way similar with FFTW results in this case (first digit differs).

Something wrong with kernels sequence that used for sizes bigger than 1024.
But no errors reported.

Can someone experienced in OpenCL look at sample's code for some clues why it works for small FFT sizes and breaks after size of 1024, please. Help needed.

P.S. tried to run on HD4870.
P.P.S.
from FFT plan setup for oclFFT:
plan->max_localmem_fft_size = 2048;
plan->max_work_item_per_workgroup = 256;
plan->max_radix = 16;
plan->min_mem_coalesce_width = 16;
plan->num_local_mem_banks = 16;
can something be so wrong for ATI GPU that size of 2048 and more fails?

n0thing · ‎01-20-2010

Can you post the ported sample?

Raistmer · ‎01-20-2010

In sample itself only main.cpp mostly changed, device initialization was replaced by same thing from TemplateC sample.
fft_setup.cpp unchanged,
in other file where required all log2() calls were replaced with int_log2() call, where:
inline int int_log2(int input) {
int i = 0;
while(input >>= 1) i++;
return i;
}

I already incorporated needed fft call into my app where fftw was used before.
Relevant places are:

fft plan init:

#if USE_FFTW
wisdom.load();
fp = fftwf_plan_dft_1d(2048/*fft_len*/, data, data, FFTW_FORWARD, FFTW_MEASURE);
#endif
#if USE_OPENCL //R

penCL related FFT
clFFT_Dim3 n;
n.x=2048;//fft_len;
n.y=n.z=1;
cl_int err=CL_SUCCESS;
plan = clFFT_CreatePlan( context, n, clFFT_1D, clFFT_InterleavedComplexFormat, &err );
if(!plan || err)
{
fprintf(stderr,"ERROR: clFFT_CreatePlan failed\n");
exit(0);
}
#endif

fft call:

#elif USE_OPENCL
cl_int err = CL_SUCCESS;
data_in = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, 2048/*fft_len*/*sizeof(float)*2, data, &err);
if(!data_in)
{
fprintf(stderr,"ERROR: clCreateBuffer failed\n");
goto cleanup;
}
data_out = data_in;//R:inplace transform for now
err |= clFFT_ExecuteInterleaved(commandQueue, plan, 1, clFFT_Forward, data_in, data_out, 0, NULL, NULL);
err |= clFinish(commandQueue);
if(err)
{
fprintf(stderr,"ERROR: clFFT_Execute\n");
goto cleanup;
}
err |= clEnqueueReadBuffer(commandQueue, data_out, CL_TRUE, 0, 2048/*fft_len*/*sizeof(float)*2, data, 0, NULL, NULL);
if(err)
{
fprintf(stderr,"ERROR: clEnqueueReadBuffer failed\n");
goto cleanup;
}
cleanup:
if(data_in)
clReleaseMemObject(data_in);
#elif USE_FFTW
fftwf_execute(fp);

Raistmer · ‎01-20-2010

More info about issue:
I just completed size 2048 FFT using oclFFT on Q9450 CPU device instead of HD4870 ATI GPU device.
Results very similar with FFTW ones !
That is, nothing wrong IMO with code per se, something wrong when it executed on ATI's GPU specifically!
ATI OpenCL crew, your turn

MicahVillmow · ‎01-20-2010

Raistmer,
This is a known issue that we are working on a fix for.

Raistmer · ‎01-20-2010

Originally posted by: MicahVillmow

Raistmer,

This is a known issue that we are working on a fix for.

Ah, thanks for info. Please, keep me informed on progress, you know how badly I need FFT for ATI GPUs

Raistmer · ‎02-15-2010

Originally posted by: MicahVillmow

Raistmer,

This is a known issue that we are working on a fix for.

GPU build still doesn't work under SDK 2.01 too.
CPU one works Ok with SDK 2.01 as with SDK 2.0

Tristan23 · ‎02-16-2010

Originally posted by: MicahVillmow Raistmer, This is a known issue that we are working on a fix for.

This would be very much appreciated - since there's huge community out there waiting for this: People running Seti@home.

Currently the vast majority of them is using nVidia cards.

This could be an excellent chance for ATI to get new customers.

Regards,

Tristan

genaganna · ‎02-16-2010

Originally posted by: Tristan23
Originally posted by: MicahVillmow Raistmer, This is a known issue that we are working on a fix for.

This would be very much appreciated - since there's huge community out there waiting for this: People running Seti@home

Currently the vast majority of them is using nVidia cards.

This could be an excellent chance for ATI to get new customers.

This issue is fixed internally. upcoming release includes this fix.

gapon · ‎02-16-2010

I downloaded and ported the Apple's OpenCL FFT to Linux a month ago. So I had a chance to try the code on both nVidia C1060 and AMD HD5870. And I'm seeing a number of issues with this code. In my tests I was only interested in 2D FFT of relatively large images (around 1024x1024).

The first observation was made on nVidia C1060. It turns out that the OpenCL FFT implementation is 2-3 times (depending on a problem size) slower compared with CUFFT. I presume this is a general problem of the Apple's OpenCL FFT implementation.

The second issue. When I moved with my tests to SDK 2.01 & Ubunti 9.04 & HD5870 the performance got even worse, which was a big surprise to me as I was expecting the opposite. In particular, Apple's OpenCL FFT was doing x8 slower on 512x512 images on HD5870 (AMD Streams SDK 2.01) as compared with the same algorithm run on C1060.

The next problem became a real show-stopper for me. In my SDK 2.01 & HD5870 tests I could not test 1024x1024 or anything bigger due to an apparent hard kernel lockup happening within clFlush or clFinish! Interesting enough, SDK 2.00 had a similar lockup at smaller images of the 512x512 size. Is there any explanation for this?

Thanks!

Tristan23 · ‎02-17-2010

Originally posted by: genaganna

This issue is fixed internally. upcoming release includes this fix.

Can you please tell us when this release will be publicly available?

Would it be possible to have access to a beta version?

genaganna · ‎02-17-2010

Originally posted by: Tristan23
Originally posted by: genaganna

This issue is fixed internally. upcoming release includes this fix.

Can you please tell us when this release will be publicly available?

Would it be possible to have access to a beta version?

I can't give an exact date but should be in the next few months.

Tristan23 · ‎02-17-2010

Originally posted by: genaganna

I can't give an exact date but should be in the next few months.

In a few month??? In a few month nVidias Fermi cards are available. If I would be AMD I would get my sh*t sorted ASAP!

Fr4nz · ‎02-17-2010

Originally posted by: Tristan23
Originally posted by: genaganna

I can't give an exact date but should be in the next few months.
In a few month??? In a few month nVidias Fermi cards are available. If I would be AMD I would get my sh*t sorted ASAP!

I agree, Nvidia is going to make a massacre on ATI in the OpenCL field, if ATI doesn't hurry up with the development of their OpenCL implementation and releases more often bugfixes and new features...

afo · ‎02-17-2010

Hi,

I think that AMD is not scary about Fermi because if you see the performance specifications for the Tesla series, nVidia says that it will have 600GFlops peak in double precision and the board will be available in Q2 2010

(http://www.nvidia.com/object/product_tesla_C2050_C2070_us.html)

AMD instead has its HD5970 today with 928GFlops peak in double precision

(http://www.amd.com/la/products/desktop/graphics/ati-radeon-hd-5000/hd-5970/Pages/ati-radeon-hd-5970-specifications.aspx)

Of course we agree that ATI's OpenCL implementation is not the most beautiful girl in town right now, but I think that their strategic is to have OpenCL in production state before Fermi's launch.

best regards,

Alfonso

Tristan23 · ‎02-17-2010

> I think that AMD is not scary about Fermi ...

I fear so too - but I'd say they better should.

> ... AMD instead has its HD5970 today with 928GFlops peak in double precision

GFlops are only theoretical as long as the software/driver sucks.

> ... but I think that their strategic is to have OpenCL in production state before Fermi's launch.

Doesn't look like thats going to happen.

gapon · ‎02-19-2010

Originally posted by: Tristan23 > I think that AMD is not scary about Fermi ...

I fear so too - but I'd say they better should.

> ... AMD instead has its HD5970 today with 928GFlops peak in double precision

GFlops are only theoretical as long as the software/driver sucks.

> ... but I think that their strategic is to have OpenCL in production state before Fermi's launch.

Doesn't look like thats going to happen.

Apparently here is a reason why AMD isn't concerned too much:

http://techreport.com/discussions.x/18492

Raistmer · ‎01-22-2010

BTW, trying to run it on nVidia GPU and recived next error:

FFT program build log on device GeForce 9400 GT
:248: error: cannot codegen this l-value expression yet
fftKernel16(a, dir);
^~~~~~~~~~~

Raistmer · ‎01-23-2010

After this correction 32k FFT doing fine on GT9400.
That is, CPU & nVidia GPUs are ok, ATI GPU still under question, please, fix issue ASAP.

#if USE_OPENCL_NV
"float2 complexMul(float2 a,float2 b) { return (float2)(mad(-(a).y, (b).y, (a).x * (b).x), mad((a).y, (b).x, (a).x * (b).y));}\n"
#else
"#define complexMul(a,b) ((float2)(mad(-(a).y, (b).y, (a).x * (b).x), mad((a).y, (b).x, (a).x * (b).y)))\n"
#endif

Raistmer · ‎04-17-2010

There are many owners of 5xxx cards already who will to run my app, but w/o promised update to SDK they can't produce valid results with their GPUs.
Few months passed already, now we approach to "many months" area

When we can expect new SDK release? Or maybe I can get at least some kind of hotfix for described issue ??

omkaranathan · ‎04-20-2010

Raistmer,

The new SDK is going to be released soon.

fulcrum_xyz · ‎10-07-2010

hi

it would be really great if you could post your ported OpenCL FFT code...

thanks

Raistmer · ‎10-07-2010

Originally posted by: fulcrum_xyz

hi

it would be really great if you could post your ported OpenCL FFT code...

thanks

New SDK works with default parameters values.
Updated oclFFT sampel can be obtained here:
http://developer.apple.com/lib...troduction/Intro.html

fulcrum_xyz · ‎10-08-2010

Thanks Raistmer, I have the apple version...and currently porting it to run on my OpenSUSE 11.2.

So, I was wondering if you had already ported it to a linux (non MacOS version) and if you could share that ?

thanks again...

P.S: I have taken a look at the OpenCL SDK FFT sample, that seems to be very preliminary and support very minimal parameters (on 1D, no batching, no complex)...

Raistmer · ‎10-08-2010

Originally posted by: fulcrum_xyz

Thanks Raistmer, I have the apple version...and currently porting it to run on my OpenSUSE 11.2.

So, I was wondering if you had already ported it to a linux (non MacOS version) and if you could share that ?

thanks again...

P.S: I have taken a look at the OpenCL SDK FFT sample, that seems to be very preliminary and support very minimal parameters (on 1D, no batching, no complex)...

SDK sample just not worth mention actually. It's hardwired to single FFT size, just some technique demonstation, not useful piece of code for FFT.
Usable FFT was promised in next SDK release, will see

About linux porting there was attempt with earlier bugged SDK (2.0) and as far as I can remember it works even better than windows part. So there should be no problems on linux with current SDK.
With SDK 2.0 default base radix of 128 failed. value of 32 was used. But currently I see better performance on 1D 32k-size transform for old 128 value (and it works).
Smaller base radix of 32 better suited for app that uses 1D FFT with different sizes from 8 to 128k.
There are few parameters for playing. I use HD4870 GPU, obsolete hardware from AMD point of view

, so someone with newer HD5xxx card could see different performance optimum.

fulcrum_xyz · ‎10-09-2010

hey thanks for the info...

i wanted to benchmark some (mostly 2^x) 2D FFTs on OpenCL on the GPU

On the NVIDIA cars, i think we can safely assume that the performance with OpenCL with <= cufft performance ( ~ 20 - 40 % ). I am not sure if NVD is even thinking of a OpenCL version of theier library anytime soon...

But, with the ATI cards its not all the clear...so I was looking to get an estimate for the same (it would be also great if someone from AMD could fill us in if they have nay information in this regard..)

So, with I've concluded that porting the Apple OpenCL fft and benchmarking it both the hardware is the best way to go (with the lack of any futher info...)....'

Raistmer · ‎10-09-2010

You could find this article helpful also:
http://www.bealto.com/gpu-fft_ref.html

Archives Discussions

porting OpenCL_FFT Apple's sample to ATI GPU