Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Journeyman III

Apple's FFT on AMD cards with C++ OpenCL API


I'm trying to run Apple's clFFT on AMD cards using the OpenCL C++ API, to

create the platform, context and devices, queues. It seems that it works only

for the C API, but not for the C++.

On Nvidia Cards it works both with C and C++ API.

Here it is the C way of creating the context,


    cl_uint numPlatforms;

    cl_platform_id platform = NULL;

    err = clGetPlatformIDs(0, NULL, &numPlatforms);

    cout << "clGetPlatformIDS status : " << err << endl;

    if (0 < numPlatforms) {

         cl_platform_id* platforms = new cl_platform_id[numPlatforms];

         status   = clGetPlatformIDs(numPlatforms, platforms, NULL);

        cout << "clGetPlatformIDs status : " << err << endl;

        platform = platforms[0];


    cl_device_id device_ids[16];

    unsigned int num_devices;

    err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 16, device_ids, &num_devices);

    cout << "clGetDeviceIDs err: " << err << endl;

    cl_context_properties ctxProps[3] = {CL_CONTEXT_PLATFORM, (cl_context_properties) platform$

    cl_context myContext = clCreateContext(ctxProps, 1, device_ids, NULL, NULL, &err);

    cout << "clCreateContext err: " << err << endl;


and this is the C++


std::vector <cl::Platform> platforms;

    err = cl::Platform::get(&platforms);

    cl_context_properties context_properties[3] = {

          CL_CONTEXT_PLATFORM, (cl_context_properties) platforms[0](), 0 };

    cl::Context myCtx = cl::Context(CL_DEVICE_TYPE_ALL,

          context_properties, NULL, NULL, &err);

   cl_context myContext = myCtx();


after that I call clFFT in this way:


clFFT_Dim3 n;

    clFFT_Plan plan;

    cl_uint plan_length;

    n.x = 1024;

    n.y = 1;

    n.z = 1;

cout << "Creating plan" << endl;

plan = clFFT_CreatePlan((cl_context) myContext, n,
                            clFFT_1D, clFFT_InterleavedComplexFormat, &err);

A couple of question:

   - When I use the C++ API the plan creation hangs and the CPU is on 100% load... anybody has similar experiences ?

   - Does anybody using clFFT with C++ API ?




13 Replies

Apple's clFFT on AMD card? It might be the source of trouble.



Well, since it is much faster than clAMDFFT... I'd like to use it.


Much faster? Can you elaborate on that? What problem are you running and what card did you measure this on? What version of apple FFT and clAmdFft are you using? I am particularly interested in your performance comparison statement. AMD's FFT library is very competitive in terms of performance at many problem sizes.


Hi thanks for your answer ! Concerning the speed these are my measurement maybe I was mistaken,

but I'm sure that you also have these numbers....

However I would be interested for an answer to my original question ! Could you comment on that ? There could be

any differemce between contexts

  a.) Created with the C++ API as cl::Context and the devices are not listed explicitely just specified as CL_DEVICE_TYPE_ALL ?

b.) A context created with the C API as cl_context, where you have the devices explicitely enumerated.

thanks a lot,



The obvious difference is that in C version you asked for only GPUs but for C++ you asked for ALL, which would give you the CPU device too (and presumably wouldn't on NVIDIA's platform).


Dear Lee,

Thanks for your answer. Indeed in this specific example I've copied there is this difference you correctly spotted,

however I've been trying all the possible compbinations, i.e. also GPU-s only in both C and C++ APi with the

same result. So probably this is not the rason.



gdebrecz wrote:

Hi thanks for your answer ! Concerning the speed these are my measurement maybe I was mistaken,

but I'm sure that you also have these numbers....

Would you care to share your measurements and how you measured them? It would be quite interesting to see...


OK, I'll try ti re-run my benchmarks and will post you. However could you please comment on my original question:

There could be any difference between context created using the C api and context returned by a cl::context() C++ API

context, (assuming they have identical contextpropeties set in advance) ?

thanks a lot,



Coincidentally, I was just looking at Apple's FFT recently and it does not seem to perform well or correctly on CPU devices.  Your problem is probably not related to a difference between C/C++ APIs.

Apple's FFT library's plan creation, wont work if you only have CPU as a device. (it returns an invalid context error and this is hard coded in the implementation). I guess it might somewhat get confused if you have CPU + GPU in your context. I recently tried to change offending code to accept CPU device type and got strange results. It is by design... it is not suppose to work on CPU devices (as far as I can see). You should use GPU devices only if you want to use Apple FFT or else use clAmdFFT library.

It is probably your code which does not catch the error created by the plan creation. You should check for CL_SUCCESS != err after calling it, you will find 'err' stored -34 invalid context.

This probably works on Nvidia devices, because you will only get GPU devices if you use Nvidia OpenCL platform. Therefore it will not spit out error.

I recommend sticking to AMD's FFT libraries if you want to run both on CPU and GPU. I also recommend using AMD hardware in this case since it is probably optimized for AMD cards. I did not use FFT library directly, but I ran some tests on Tesla cards and Tahiti cards with clAmdBlas. clAmdBlas was superior to cuBlas (sure AMD's GCN Tahitii is better than Nvidia Tesla solutions hardware-wise but still). I would  be surprised if clAmdFFT does not function well based on my experience with clAmdBlas



Thanks for spotting all these out. I re-run my performance test again:

gdebrecz@xxxxxx:~/amdfft$ lspci | grep -i radeon

04:00.0 VGA compatible controller: Advanced Micro Devices [AMD] nee ATI Cayman XT [Radeon HD 6970]

here are some timing tests for C2C ForwardFFT as a function of the length (power of 2)

7 128 339.259 fft / sec2.9476ms / fft
8 256 519.462 fft / sec1.92507ms / fft
9 512 392.807 fft / sec2.54578ms / fft
10 1024 303.491 fft / sec3.29499ms / fft
11 2048 275.528 fft / sec3.6294ms / fft
12 4096 138.66 fft / sec7.21186ms / fft
13 8192 163.3 fft / sec6.1237ms / fft
14 16384 136.564 fft / sec7.32257ms / fft
15 32768 181.027 fft / sec5.52405ms / fft
16 65536 227.177 fft / sec4.40185ms / fft
17 131072 113.805 fft / sec8.78693ms / fft
18 262144 121.572 fft / sec8.22555ms / fft
19 524288 97.6293 fft / sec10.2428ms / fft
20 1048576 101.543 fft / sec9.84801ms / fft
21 2097152 80.9414 fft / sec12.3546ms / fft
22 4194304 78.31 fft / sec12.7698ms / fft
23 8388608 40.6493 fft / sec24.6007ms / fft
24 16777216 29.2014 fft / sec34.245ms / fft

are these numbers reasonable, or I do something wrong ? I create a context only one GPU device in it, still

during the test I see the CPU running it possible that it runs on the CPU then, why it is so slow ?

thanks again for your help and  ansewers...

here is the relevant code pieces from the testing:

  clAmdFftSetupData  fftSetupData;

  clAmdFftPlanHandle fftPlan;

  clAmdFftDim fftDim = CLFFT_1D;






  cl::Buffer * d_src = new cl::Buffer(myRuntimeEnv.appContexts[0], CL_MEM_READ_WRITE, buffersize, NULL, &err);

  if (err != 0 ) { std::cout << "Error creating buffer1. Exiting. Error code: " << err << endl;  return -1;}

  cl::Buffer * d_dest = new cl::Buffer(myRuntimeEnv.appContexts[0], CL_MEM_READ_WRITE, buffersize, NULL, &err);

  if (err != 0 ) { std::cout << "Error creating buffer2. Exiting. Error code: " << err <<< endl;  return -1;}




for (.....

clAmdFftEnqueueTransform(fftPlan, CLFFT_FORWARD, 1,

            &myRuntimeEnv.appQueues[0](), 0, NULL, NULL, &(*d_src)(), &(*d_dest)(), NULL);




I would also be interested in what the issue is at hand here... I do not find anything wrong with the code.

Journeyman III


I am interested in  OpenCL  source code that run Apple's FFT  using Visual studio 2010 and windows 7.

Is someone have a link or have this source code?



Journeyman III


I'm trying to use the Apple's lib for FFT(forward/inverse) in my project. I'm on linux working with Nvidia GPU and found some code adapted from Apple's sample on Github to linux. It compiles fine.

So I added the files of the samples in my program and it also compiles. However at the execution, when I call the function createPlan I have this error :

undefined symbol: _Z5FFT1DP11cl_fft_plan12kernel_dir_t

Did you encountered this error too? I hope you'll be able to help. If I could make it work it would be nice progress for me.

Best regards,