I trying to use code from Apple's OpenCL_FFT sample for OS X to get FFT on ATI's GPU.

OpenCL_FFT

For correctness check I use FFTW CPU library to compute FFT from same data.

I need FFT size 32k, 32768. Results from oclFFT completely different (difference in first digit) from FFTW results.

Then I started to try anothe FFT sizes to check if sample built correctly at all and found that sizes up to 1024 (tried 32, 1024) compute just excellent. Results are the same for 4 or mor first digits, further small errors perhaps from different rounding errors appears.

But bigger sizes completely screwed. For example, with size of 2048 oclFFT changes only first 8 elements of input arry, then go unchanged input data and at index of 128 some changes (again 8 elements) then unchanged data, then at index 384 and so on. Changed elements no way similar with FFTW results in this case (first digit differs).

Something wrong with kernels sequence that used for sizes bigger than 1024.

But no errors reported.

Can someone experienced in OpenCL look at sample's code for some clues why it works for small FFT sizes and breaks after size of 1024, please. Help needed.

P.S. tried to run on HD4870.

P.P.S.

from FFT plan setup for oclFFT:

plan->max_localmem_fft_size = 2048;

plan->max_work_item_per_workgroup = 256;

plan->max_radix = 16;

plan->min_mem_coalesce_width = 16;

plan->num_local_mem_banks = 16;

can something be so wrong for ATI GPU that size of 2048 and more fails?

Can you post the ported sample?