Hi,
I'm trying to run Apple's clFFT on AMD cards using the OpenCL C++ API, to
create the platform, context and devices, queues. It seems that it works only
for the C API, but not for the C++.
On Nvidia Cards it works both with C and C++ API.
Here it is the C way of creating the context,
-----------------------------------------------
cl_uint numPlatforms;
cl_platform_id platform = NULL;
err = clGetPlatformIDs(0, NULL, &numPlatforms);
cout << "clGetPlatformIDS status : " << err << endl;
if (0 < numPlatforms) {
cl_platform_id* platforms = new cl_platform_id[numPlatforms];
status = clGetPlatformIDs(numPlatforms, platforms, NULL);
cout << "clGetPlatformIDs status : " << err << endl;
platform = platforms[0];
}
cl_device_id device_ids[16];
unsigned int num_devices;
err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 16, device_ids, &num_devices);
cout << "clGetDeviceIDs err: " << err << endl;
cl_context_properties ctxProps[3] = {CL_CONTEXT_PLATFORM, (cl_context_properties) platform$
cl_context myContext = clCreateContext(ctxProps, 1, device_ids, NULL, NULL, &err);
cout << "clCreateContext err: " << err << endl;
-----------------------------------------------------------------
and this is the C++
------------------------------------------------------------
std::vector <cl::Platform> platforms;
err = cl::Platform::get(&platforms);
cl_context_properties context_properties[3] = {
CL_CONTEXT_PLATFORM, (cl_context_properties) platforms[0](), 0 };
cl::Context myCtx = cl::Context(CL_DEVICE_TYPE_ALL,
context_properties, NULL, NULL, &err);
cl_context myContext = myCtx();
------------------------------------------------------------------
after that I call clFFT in this way:
-------------------------
clFFT_Dim3 n;
clFFT_Plan plan;
cl_uint plan_length;
n.x = 1024;
n.y = 1;
n.z = 1;
cout << "Creating plan" << endl; |
plan = clFFT_CreatePlan((cl_context) myContext, n, | |
clFFT_1D, clFFT_InterleavedComplexFormat, &err); |
A couple of question:
- When I use the C++ API the plan creation hangs and the CPU is on 100% load... anybody has similar experiences ?
- Does anybody using clFFT with C++ API ?
thanks,
Gergely
Apple's clFFT on AMD card? It might be the source of trouble.
😛
Well, since it is much faster than clAMDFFT... I'd like to use it.
Much faster? Can you elaborate on that? What problem are you running and what card did you measure this on? What version of apple FFT and clAmdFft are you using? I am particularly interested in your performance comparison statement. AMD's FFT library is very competitive in terms of performance at many problem sizes.
Hi thanks for your answer ! Concerning the speed these are my measurement maybe I was mistaken,
but I'm sure that you also have these numbers....
However I would be interested for an answer to my original question ! Could you comment on that ? There could be
any differemce between contexts
a.) Created with the C++ API as cl::Context and the devices are not listed explicitely just specified as CL_DEVICE_TYPE_ALL ?
b.) A context created with the C API as cl_context, where you have the devices explicitely enumerated.
thanks a lot,
Gergely
The obvious difference is that in C version you asked for only GPUs but for C++ you asked for ALL, which would give you the CPU device too (and presumably wouldn't on NVIDIA's platform).
Dear Lee,
Thanks for your answer. Indeed in this specific example I've copied there is this difference you correctly spotted,
however I've been trying all the possible compbinations, i.e. also GPU-s only in both C and C++ APi with the
same result. So probably this is not the rason.
Gergo
gdebrecz wrote:
Hi thanks for your answer ! Concerning the speed these are my measurement maybe I was mistaken,
but I'm sure that you also have these numbers....
Would you care to share your measurements and how you measured them? It would be quite interesting to see...
OK, I'll try ti re-run my benchmarks and will post you. However could you please comment on my original question:
There could be any difference between context created using the C api and context returned by a cl::context() C++ API
context, (assuming they have identical contextpropeties set in advance) ?
thanks a lot,
Gergely
Coincidentally, I was just looking at Apple's FFT recently and it does not seem to perform well or correctly on CPU devices. Your problem is probably not related to a difference between C/C++ APIs.
Apple's FFT library's plan creation, wont work if you only have CPU as a device. (it returns an invalid context error and this is hard coded in the implementation). I guess it might somewhat get confused if you have CPU + GPU in your context. I recently tried to change offending code to accept CPU device type and got strange results. It is by design... it is not suppose to work on CPU devices (as far as I can see). You should use GPU devices only if you want to use Apple FFT or else use clAmdFFT library.
It is probably your code which does not catch the error created by the plan creation. You should check for CL_SUCCESS != err after calling it, you will find 'err' stored -34 invalid context.
This probably works on Nvidia devices, because you will only get GPU devices if you use Nvidia OpenCL platform. Therefore it will not spit out error.
I recommend sticking to AMD's FFT libraries if you want to run both on CPU and GPU. I also recommend using AMD hardware in this case since it is probably optimized for AMD cards. I did not use FFT library directly, but I ran some tests on Tesla cards and Tahiti cards with clAmdBlas. clAmdBlas was superior to cuBlas (sure AMD's GCN Tahitii is better than Nvidia Tesla solutions hardware-wise but still). I would be surprised if clAmdFFT does not function well based on my experience with clAmdBlas
Hi,
Thanks for spotting all these out. I re-run my performance test again:
gdebrecz@xxxxxx:~/amdfft$ lspci | grep -i radeon
04:00.0 VGA compatible controller: Advanced Micro Devices [AMD] nee ATI Cayman XT [Radeon HD 6970]
here are some timing tests for C2C ForwardFFT as a function of the length (power of 2)
7 128 339.259 fft / sec | 2.9476 | ms / fft |
8 256 519.462 fft / sec | 1.92507 | ms / fft |
9 512 392.807 fft / sec | 2.54578 | ms / fft |
10 1024 303.491 fft / sec | 3.29499 | ms / fft |
11 2048 275.528 fft / sec | 3.6294 | ms / fft |
12 4096 138.66 fft / sec | 7.21186 | ms / fft |
13 8192 163.3 fft / sec | 6.1237 | ms / fft |
14 16384 136.564 fft / sec | 7.32257 | ms / fft |
15 32768 181.027 fft / sec | 5.52405 | ms / fft |
16 65536 227.177 fft / sec | 4.40185 | ms / fft |
17 131072 113.805 fft / sec | 8.78693 | ms / fft |
18 262144 121.572 fft / sec | 8.22555 | ms / fft |
19 524288 97.6293 fft / sec | 10.2428 | ms / fft |
20 1048576 101.543 fft / sec | 9.84801 | ms / fft |
21 2097152 80.9414 fft / sec | 12.3546 | ms / fft |
22 4194304 78.31 fft / sec | 12.7698 | ms / fft |
23 8388608 40.6493 fft / sec | 24.6007 | ms / fft |
24 16777216 29.2014 fft / sec | 34.245 | ms / fft |
are these numbers reasonable, or I do something wrong ? I create a context only one GPU device in it, still
during the test I see the CPU running 100%...is it possible that it runs on the CPU then, why it is so slow ?
thanks again for your help and ansewers...
here is the relevant code pieces from the testing:
clAmdFftSetupData fftSetupData;
clAmdFftPlanHandle fftPlan;
clAmdFftDim fftDim = CLFFT_1D;
clAmdFftSetup(&fftSetupData);
clAmdFftInitSetupData(&fftSetupData);
.
.
.
cl::Buffer * d_src = new cl::Buffer(myRuntimeEnv.appContexts[0], CL_MEM_READ_WRITE, buffersize, NULL, &err);
if (err != 0 ) { std::cout << "Error creating buffer1. Exiting. Error code: " << err << endl; return -1;}
cl::Buffer * d_dest = new cl::Buffer(myRuntimeEnv.appContexts[0], CL_MEM_READ_WRITE, buffersize, NULL, &err);
if (err != 0 ) { std::cout << "Error creating buffer2. Exiting. Error code: " << err <<< endl; return -1;}
.
.
.
for (.....
clAmdFftEnqueueTransform(fftPlan, CLFFT_FORWARD, 1,
&myRuntimeEnv.appQueues[0](), 0, NULL, NULL, &(*d_src)(), &(*d_dest)(), NULL);
}
Gergely
I would also be interested in what the issue is at hand here... I do not find anything wrong with the code.
Hi,
I am interested in OpenCL source code that run Apple's FFT using Visual studio 2010 and windows 7.
Is someone have a link or have this source code?
Thanks,
Micha
Hi,
I'm trying to use the Apple's lib for FFT(forward/inverse) in my project. I'm on linux working with Nvidia GPU and found some code adapted from Apple's sample on Github to linux. It compiles fine.
So I added the files of the samples in my program and it also compiles. However at the execution, when I call the function createPlan I have this error :
undefined symbol: _Z5FFT1DP11cl_fft_plan12kernel_dir_t
Did you encountered this error too? I hope you'll be able to help. If I could make it work it would be nice progress for me.
Best regards,
ash