13 Replies Latest reply on Jul 16, 2013 10:05 AM by ash

    Apple's FFT on AMD cards with C++ OpenCL API

    gdebrecz

      Hi,

       

      I'm trying to run Apple's clFFT on AMD cards using the OpenCL C++ API, to

      create the platform, context and devices, queues. It seems that it works only

      for the C API, but not for the C++.

       

      On Nvidia Cards it works both with C and C++ API.

       

      Here it is the C way of creating the context,

      -----------------------------------------------

          cl_uint numPlatforms;

          cl_platform_id platform = NULL;

       

          err = clGetPlatformIDs(0, NULL, &numPlatforms);

          cout << "clGetPlatformIDS status : " << err << endl;

       

          if (0 < numPlatforms) {

               cl_platform_id* platforms = new cl_platform_id[numPlatforms];

               status   = clGetPlatformIDs(numPlatforms, platforms, NULL);

              cout << "clGetPlatformIDs status : " << err << endl;

              platform = platforms[0];

          }

       

          cl_device_id device_ids[16];

          unsigned int num_devices;

       

          err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 16, device_ids, &num_devices);

          cout << "clGetDeviceIDs err: " << err << endl;

       

          cl_context_properties ctxProps[3] = {CL_CONTEXT_PLATFORM, (cl_context_properties) platform$

          cl_context myContext = clCreateContext(ctxProps, 1, device_ids, NULL, NULL, &err);

          cout << "clCreateContext err: " << err << endl;

      -----------------------------------------------------------------

       

      and this is the C++

       

      ------------------------------------------------------------

      std::vector <cl::Platform> platforms;

       

          err = cl::Platform::get(&platforms);

          cl_context_properties context_properties[3] = {

                CL_CONTEXT_PLATFORM, (cl_context_properties) platforms[0](), 0 };

       

          cl::Context myCtx = cl::Context(CL_DEVICE_TYPE_ALL,

                context_properties, NULL, NULL, &err);

       

         cl_context myContext = myCtx();

      ------------------------------------------------------------------

       

       

      after that I call clFFT in this way:

       

      -------------------------

      clFFT_Dim3 n;

          clFFT_Plan plan;

          cl_uint plan_length;

       

          n.x = 1024;

          n.y = 1;

          n.z = 1;

      cout << "Creating plan" << endl;

       

      plan = clFFT_CreatePlan((cl_context) myContext, n,
                                  clFFT_1D, clFFT_InterleavedComplexFormat, &err);

       

      A couple of question:

         - When I use the C++ API the plan creation hangs and the CPU is on 100% load... anybody has similar experiences ?

         - Does anybody using clFFT with C++ API ?

        

      thanks,

      Gergely

        • Re: Apple's FFT on AMD cards with C++ OpenCL API
          binying

          Apple's clFFT on AMD card? It might be the source of trouble.

          :-P

            • Re: Apple's FFT on AMD cards with C++ OpenCL API
              gdebrecz

              Well, since it is much faster than clAMDFFT... I'd like to use it.

                • Re: Apple's FFT on AMD cards with C++ OpenCL API
                  bragadeesh

                  Much faster? Can you elaborate on that? What problem are you running and what card did you measure this on? What version of apple FFT and clAmdFft are you using? I am particularly interested in your performance comparison statement. AMD's FFT library is very competitive in terms of performance at many problem sizes.

                    • Re: Apple's FFT on AMD cards with C++ OpenCL API
                      gdebrecz

                      Hi thanks for your answer ! Concerning the speed these are my measurement maybe I was mistaken,

                      but I'm sure that you also have these numbers....

                       

                      However I would be interested for an answer to my original question ! Could you comment on that ? There could be

                      any differemce between contexts

                       

                        a.) Created with the C++ API as cl::Context and the devices are not listed explicitely just specified as CL_DEVICE_TYPE_ALL ?

                       

                      b.) A context created with the C API as cl_context, where you have the devices explicitely enumerated.

                       

                      thanks a lot,

                      Gergely

                        • Re: Apple's FFT on AMD cards with C++ OpenCL API
                          LeeHowes

                          The obvious difference is that in C version you asked for only GPUs but for C++ you asked for ALL, which would give you the CPU device too (and presumably wouldn't on NVIDIA's platform).

                          • Re: Apple's FFT on AMD cards with C++ OpenCL API
                            yurtesen

                            gdebrecz wrote:

                             

                            Hi thanks for your answer ! Concerning the speed these are my measurement maybe I was mistaken,

                            but I'm sure that you also have these numbers....

                             

                             

                            Would you care to share your measurements and how you measured them? It would be quite interesting to see...

                              • Re: Apple's FFT on AMD cards with C++ OpenCL API
                                gdebrecz

                                OK, I'll try ti re-run my benchmarks and will post you. However could you please comment on my original question:

                                 

                                There could be any difference between context created using the C api and context returned by a cl::context() C++ API

                                context, (assuming they have identical contextpropeties set in advance) ?

                                 

                                thanks a lot,

                                Gergely

                                  • Re: Apple's FFT on AMD cards with C++ OpenCL API
                                    yurtesen

                                    Coincidentally, I was just looking at Apple's FFT recently and it does not seem to perform well or correctly on CPU devices.  Your problem is probably not related to a difference between C/C++ APIs.

                                     

                                    Apple's FFT library's plan creation, wont work if you only have CPU as a device. (it returns an invalid context error and this is hard coded in the implementation). I guess it might somewhat get confused if you have CPU + GPU in your context. I recently tried to change offending code to accept CPU device type and got strange results. It is by design... it is not suppose to work on CPU devices (as far as I can see). You should use GPU devices only if you want to use Apple FFT or else use clAmdFFT library.

                                     

                                    It is probably your code which does not catch the error created by the plan creation. You should check for CL_SUCCESS != err after calling it, you will find 'err' stored -34 invalid context.

                                     

                                    This probably works on Nvidia devices, because you will only get GPU devices if you use Nvidia OpenCL platform. Therefore it will not spit out error.

                                     

                                    I recommend sticking to AMD's FFT libraries if you want to run both on CPU and GPU. I also recommend using AMD hardware in this case since it is probably optimized for AMD cards. I did not use FFT library directly, but I ran some tests on Tesla cards and Tahiti cards with clAmdBlas. clAmdBlas was superior to cuBlas (sure AMD's GCN Tahitii is better than Nvidia Tesla solutions hardware-wise but still). I would  be surprised if clAmdFFT does not function well based on my experience with clAmdBlas

                                      • Re: Apple's FFT on AMD cards with C++ OpenCL API
                                        gdebrecz

                                        Hi,

                                         

                                        Thanks for spotting all these out. I re-run my performance test again:

                                         

                                        gdebrecz@xxxxxx:~/amdfft$ lspci | grep -i radeon

                                        04:00.0 VGA compatible controller: Advanced Micro Devices [AMD] nee ATI Cayman XT [Radeon HD 6970]

                                         

                                        here are some timing tests for C2C ForwardFFT as a function of the length (power of 2)

                                         

                                        7 128 339.259 fft / sec2.9476ms / fft
                                        8 256 519.462 fft / sec1.92507ms / fft
                                        9 512 392.807 fft / sec2.54578ms / fft
                                        10 1024 303.491 fft / sec3.29499ms / fft
                                        11 2048 275.528 fft / sec3.6294ms / fft
                                        12 4096 138.66 fft / sec7.21186ms / fft
                                        13 8192 163.3 fft / sec6.1237ms / fft
                                        14 16384 136.564 fft / sec7.32257ms / fft
                                        15 32768 181.027 fft / sec5.52405ms / fft
                                        16 65536 227.177 fft / sec4.40185ms / fft
                                        17 131072 113.805 fft / sec8.78693ms / fft
                                        18 262144 121.572 fft / sec8.22555ms / fft
                                        19 524288 97.6293 fft / sec10.2428ms / fft
                                        20 1048576 101.543 fft / sec9.84801ms / fft
                                        21 2097152 80.9414 fft / sec12.3546ms / fft
                                        22 4194304 78.31 fft / sec12.7698ms / fft
                                        23 8388608 40.6493 fft / sec24.6007ms / fft
                                        24 16777216 29.2014 fft / sec34.245ms / fft

                                         

                                         

                                        are these numbers reasonable, or I do something wrong ? I create a context only one GPU device in it, still

                                        during the test I see the CPU running 100%...is it possible that it runs on the CPU then, why it is so slow ?

                                         

                                        thanks again for your help and  ansewers...

                                        here is the relevant code pieces from the testing:

                                         

                                          clAmdFftSetupData  fftSetupData;

                                          clAmdFftPlanHandle fftPlan;

                                          clAmdFftDim fftDim = CLFFT_1D;

                                         

                                          clAmdFftSetup(&fftSetupData);

                                          clAmdFftInitSetupData(&fftSetupData);

                                        .

                                        .

                                        .

                                          cl::Buffer * d_src = new cl::Buffer(myRuntimeEnv.appContexts[0], CL_MEM_READ_WRITE, buffersize, NULL, &err);

                                          if (err != 0 ) { std::cout << "Error creating buffer1. Exiting. Error code: " << err << endl;  return -1;}

                                          cl::Buffer * d_dest = new cl::Buffer(myRuntimeEnv.appContexts[0], CL_MEM_READ_WRITE, buffersize, NULL, &err);

                                          if (err != 0 ) { std::cout << "Error creating buffer2. Exiting. Error code: " << err <<< endl;  return -1;}

                                        .

                                        .

                                        .

                                        for (.....

                                        clAmdFftEnqueueTransform(fftPlan, CLFFT_FORWARD, 1,

                                                    &myRuntimeEnv.appQueues[0](), 0, NULL, NULL, &(*d_src)(), &(*d_dest)(), NULL);

                                        }

                                         

                                        Gergely

                          • Re: Apple's FFT on AMD cards with C++ OpenCL API
                            Micha_M

                            Hi,

                            I am interested in  OpenCL  source code that run Apple's FFT  using Visual studio 2010 and windows 7.

                            Is someone have a link or have this source code?

                             

                            Thanks,

                            Micha

                            • Re: Apple's FFT on AMD cards with C++ OpenCL API
                              ash

                              Hi,

                              I'm trying to use the Apple's lib for FFT(forward/inverse) in my project. I'm on linux working with Nvidia GPU and found some code adapted from Apple's sample on Github to linux. It compiles fine.

                              So I added the files of the samples in my program and it also compiles. However at the execution, when I call the function createPlan I have this error :

                              undefined symbol: _Z5FFT1DP11cl_fft_plan12kernel_dir_t

                              Did you encountered this error too? I hope you'll be able to help. If I could make it work it would be nice progress for me.

                               

                              Best regards,

                              ash