26 Replies Latest reply on Oct 9, 2010 8:53 AM by Raistmer

    porting OpenCL_FFT Apple's sample to ATI GPU

    Raistmer
      only FFT up to size 1024 calculating correctly on GPU but bigger size possible on CPU !

      I trying to use code from Apple's OpenCL_FFT sample for OS X to get FFT on ATI's GPU.
      OpenCL_FFT
      For correctness check I use FFTW CPU library to compute FFT from same data.

      I need FFT size 32k, 32768. Results from oclFFT completely different (difference in first digit) from FFTW results.
      Then I started to try anothe FFT sizes to check if sample built correctly at all and found that sizes up to 1024 (tried 32, 1024) compute just excellent. Results are the same for 4 or mor first digits, further small errors perhaps from different rounding errors appears.
      But bigger sizes completely screwed. For example, with size of 2048 oclFFT changes only first 8 elements of input arry, then go unchanged input data and at index of 128 some changes (again 8 elements) then unchanged data, then at index 384 and so on. Changed elements no way similar with FFTW results in this case (first digit differs).

      Something wrong with kernels sequence that used for sizes bigger than 1024.
      But no errors reported.

      Can someone experienced in OpenCL look at sample's code for some clues why it works for small FFT sizes and breaks after size of 1024, please. Help needed.

      P.S. tried to run on HD4870.
      P.P.S.
      from FFT plan setup for oclFFT:
      plan->max_localmem_fft_size = 2048;
      plan->max_work_item_per_workgroup = 256;
      plan->max_radix = 16;
      plan->min_mem_coalesce_width = 16;
      plan->num_local_mem_banks = 16;
      can something be so wrong for ATI GPU that size of 2048 and more fails?
        • porting OpenCL_FFT Apple's sample to ATI GPU
          n0thing

          Can you post the ported sample?

          • porting OpenCL_FFT Apple's sample to ATI GPU
            Raistmer
            In sample itself only main.cpp mostly changed, device initialization was replaced by same thing from TemplateC sample.
            fft_setup.cpp unchanged,
            in other file where required all log2() calls were replaced with int_log2() call, where:
            inline int int_log2(int input) {
            int i = 0;
            while(input >>= 1) i++;
            return i;
            }

            I already incorporated needed fft call into my app where fftw was used before.
            Relevant places are:

            fft plan init:

            #if USE_FFTW
            wisdom.load();
            fp = fftwf_plan_dft_1d(2048/*fft_len*/, data, data, FFTW_FORWARD, FFTW_MEASURE);
            #endif
            #if USE_OPENCL //RpenCL related FFT
            clFFT_Dim3 n;
            n.x=2048;//fft_len;
            n.y=n.z=1;
            cl_int err=CL_SUCCESS;
            plan = clFFT_CreatePlan( context, n, clFFT_1D, clFFT_InterleavedComplexFormat, &err );
            if(!plan || err)
            {
            fprintf(stderr,"ERROR: clFFT_CreatePlan failed\n");
            exit(0);
            }
            #endif

            fft call:

            #elif USE_OPENCL
            cl_int err = CL_SUCCESS;
            data_in = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, 2048/*fft_len*/*sizeof(float)*2, data, &err);
            if(!data_in)
            {
            fprintf(stderr,"ERROR: clCreateBuffer failed\n");
            goto cleanup;
            }
            data_out = data_in;//R:inplace transform for now
            err |= clFFT_ExecuteInterleaved(commandQueue, plan, 1, clFFT_Forward, data_in, data_out, 0, NULL, NULL);
            err |= clFinish(commandQueue);
            if(err)
            {
            fprintf(stderr,"ERROR: clFFT_Execute\n");
            goto cleanup;
            }
            err |= clEnqueueReadBuffer(commandQueue, data_out, CL_TRUE, 0, 2048/*fft_len*/*sizeof(float)*2, data, 0, NULL, NULL);
            if(err)
            {
            fprintf(stderr,"ERROR: clEnqueueReadBuffer failed\n");
            goto cleanup;
            }
            cleanup:
            if(data_in)
            clReleaseMemObject(data_in);
            #elif USE_FFTW
            fftwf_execute(fp);
            • porting OpenCL_FFT Apple's sample to ATI GPU
              Raistmer
              More info about issue:
              I just completed size 2048 FFT using oclFFT on Q9450 CPU device instead of HD4870 ATI GPU device.
              Results very similar with FFTW ones !
              That is, nothing wrong IMO with code per se, something wrong when it executed on ATI's GPU specifically!
              ATI OpenCL crew, your turn
              • porting OpenCL_FFT Apple's sample to ATI GPU
                MicahVillmow
                Raistmer,
                This is a known issue that we are working on a fix for.
                  • porting OpenCL_FFT Apple's sample to ATI GPU
                    Raistmer
                    Originally posted by: MicahVillmow

                    Raistmer,

                    This is a known issue that we are working on a fix for.


                    Ah, thanks for info. Please, keep me informed on progress, you know how badly I need FFT for ATI GPUs
                    • porting OpenCL_FFT Apple's sample to ATI GPU
                      Raistmer
                      Originally posted by: MicahVillmow

                      Raistmer,

                      This is a known issue that we are working on a fix for.


                      GPU build still doesn't work under SDK 2.01 too.
                      CPU one works Ok with SDK 2.01 as with SDK 2.0
                      • porting OpenCL_FFT Apple's sample to ATI GPU
                        Tristan23

                         

                        Originally posted by: MicahVillmow Raistmer, This is a known issue that we are working on a fix for.


                        This would be very much appreciated - since there's huge community out there waiting for this: People running Seti@home.

                        Currently the vast majority of them is using nVidia cards.

                        This could be an excellent chance for ATI to get new customers.

                         

                        Regards,

                        Tristan

                          • porting OpenCL_FFT Apple's sample to ATI GPU
                            genaganna

                             

                            Originally posted by: Tristan23
                            Originally posted by: MicahVillmow Raistmer, This is a known issue that we are working on a fix for.


                             

                            This would be very much appreciated - since there's huge community out there waiting for this: People running Seti@home

                             

                            Currently the vast majority of them is using nVidia cards.

                             

                            This could be an excellent chance for ATI to get new customers.

                             

                             

                             

                            This issue is fixed internally. upcoming release includes this fix.

                              • porting OpenCL_FFT Apple's sample to ATI GPU
                                gapon

                                I downloaded and ported the Apple's OpenCL FFT to Linux a month ago. So I had a chance to try the code on both nVidia C1060 and AMD HD5870. And I'm seeing a number of issues with this code. In my tests I was only interested in 2D FFT of relatively large images (around 1024x1024).

                                The first observation was made on nVidia C1060. It turns out that the OpenCL FFT implementation is 2-3 times (depending on a problem size) slower compared with CUFFT. I presume this is a general problem of the Apple's OpenCL FFT implementation.

                                The second issue. When I moved with my tests to SDK 2.01 & Ubunti 9.04 & HD5870 the performance got even worse, which was a big surprise to me as I was expecting the opposite. In particular, Apple's OpenCL FFT was doing x8 slower on 512x512 images on HD5870 (AMD Streams SDK 2.01) as compared with the same algorithm run on C1060.

                                The next problem became a real show-stopper for me. In my SDK 2.01 & HD5870 tests I could  not test 1024x1024 or anything bigger due to an apparent hard kernel lockup happening within clFlush or clFinish! Interesting enough, SDK 2.00 had a similar lockup at  smaller images of the 512x512 size. Is there any explanation for this?

                                Thanks!

                                • porting OpenCL_FFT Apple's sample to ATI GPU
                                  Tristan23

                                   

                                  Originally posted by: genaganna

                                  This issue is fixed internally. upcoming release includes this fix.

                                   

                                   

                                  Can you please tell us when this release will be publicly available?

                                  Would it be possible to have access to a beta version?

                            • porting OpenCL_FFT Apple's sample to ATI GPU
                              Raistmer
                              BTW, trying to run it on nVidia GPU and recived next error:

                              FFT program build log on device GeForce 9400 GT
                              :248: error: cannot codegen this l-value expression yet
                              fftKernel16(a, dir);
                              ^~~~~~~~~~~

                              • porting OpenCL_FFT Apple's sample to ATI GPU
                                Raistmer
                                After this correction 32k FFT doing fine on GT9400.
                                That is, CPU & nVidia GPUs are ok, ATI GPU still under question, please, fix issue ASAP.

                                #if USE_OPENCL_NV
                                "float2 complexMul(float2 a,float2 b) { return (float2)(mad(-(a).y, (b).y, (a).x * (b).x), mad((a).y, (b).x, (a).x * (b).y));}\n"
                                #else
                                "#define complexMul(a,b) ((float2)(mad(-(a).y, (b).y, (a).x * (b).x), mad((a).y, (b).x, (a).x * (b).y)))\n"
                                #endif
                                • porting OpenCL_FFT Apple's sample to ATI GPU
                                  Raistmer
                                  There are many owners of 5xxx cards already who will to run my app, but w/o promised update to SDK they can't produce valid results with their GPUs.
                                  Few months passed already, now we approach to "many months" area
                                  When we can expect new SDK release? Or maybe I can get at least some kind of hotfix for described issue ??
                                    • porting OpenCL_FFT Apple's sample to ATI GPU
                                      omkaranathan

                                      Raistmer,

                                      The new SDK is going to be released soon.

                                        • porting OpenCL_FFT Apple's sample to ATI GPU
                                          fulcrum_xyz

                                          hi

                                          it would be really great if you could post your ported OpenCL FFT code...

                                          thanks

                                            • porting OpenCL_FFT Apple's sample to ATI GPU
                                              Raistmer
                                              Originally posted by: fulcrum_xyz

                                              hi




                                              it would be really great if you could post your ported OpenCL FFT code...




                                              thanks



                                              New SDK works with default parameters values.
                                              Updated oclFFT sampel can be obtained here:
                                              http://developer.apple.com/lib...troduction/Intro.html

                                                • porting OpenCL_FFT Apple's sample to ATI GPU
                                                  fulcrum_xyz

                                                  Thanks Raistmer, I have the apple version...and currently porting it to run on my OpenSUSE 11.2.

                                                  So, I was wondering if you had already ported it to a linux (non MacOS version) and if you could share that ?

                                                  thanks again...

                                                  P.S: I have taken a look at the OpenCL SDK FFT sample, that seems to be very preliminary and support very minimal parameters (on 1D, no batching, no complex)...

                                                    • porting OpenCL_FFT Apple's sample to ATI GPU
                                                      Raistmer
                                                      Originally posted by: fulcrum_xyz

                                                      Thanks Raistmer, I have the apple version...and currently porting it to run on my OpenSUSE 11.2.




                                                      So, I was wondering if you had already ported it to a linux (non MacOS version) and if you could share that ?




                                                      thanks again...




                                                      P.S: I have taken a look at the OpenCL SDK FFT sample, that seems to be very preliminary and support very minimal parameters (on 1D, no batching, no complex)...




                                                      SDK sample just not worth mention actually. It's hardwired to single FFT size, just some technique demonstation, not useful piece of code for FFT.
                                                      Usable FFT was promised in next SDK release, will see

                                                      About linux porting there was attempt with earlier bugged SDK (2.0) and as far as I can remember it works even better than windows part. So there should be no problems on linux with current SDK.
                                                      With SDK 2.0 default base radix of 128 failed. value of 32 was used. But currently I see better performance on 1D 32k-size transform for old 128 value (and it works).
                                                      Smaller base radix of 32 better suited for app that uses 1D FFT with different sizes from 8 to 128k.
                                                      There are few parameters for playing. I use HD4870 GPU, obsolete hardware from AMD point of view , so someone with newer HD5xxx card could see different performance optimum.
                                                        • porting OpenCL_FFT Apple's sample to ATI GPU
                                                          fulcrum_xyz

                                                          hey thanks for the info...

                                                          i wanted to benchmark some (mostly 2^x) 2D FFTs on OpenCL on the GPU

                                                          On the NVIDIA cars, i think we can safely assume that the performance with OpenCL with <= cufft performance ( ~ 20 - 40 % ). I am not sure if NVD is even thinking of a OpenCL version of theier library anytime soon...

                                                          But, with the ATI cards its not all the clear...so I was looking to get an estimate for the same (it would be also great if someone from AMD could fill us in if they have nay information in this regard..)

                                                          So, with I've concluded that porting the Apple OpenCL fft and benchmarking it both the hardware is the best way to go (with the lack of any futher info...)....'

                                                           

                                              • porting OpenCL_FFT Apple's sample to ATI GPU
                                                Raistmer
                                                You could find this article helpful also:
                                                http://www.bealto.com/gpu-fft_ref.html