Archives Discussions

gdebrecz · ‎10-11-2012

Hi,

I'm trying to run Apple's clFFT on AMD cards using the OpenCL C++ API, to

create the platform, context and devices, queues. It seems that it works only

for the C API, but not for the C++.

On Nvidia Cards it works both with C and C++ API.

Here it is the C way of creating the context,

-----------------------------------------------

cl_uint numPlatforms;

cl_platform_id platform = NULL;

err = clGetPlatformIDs(0, NULL, &numPlatforms);

cout << "clGetPlatformIDS status : " << err << endl;

if (0 < numPlatforms) {

cl_platform_id* platforms = new cl_platform_id[numPlatforms];

status = clGetPlatformIDs(numPlatforms, platforms, NULL);

cout << "clGetPlatformIDs status : " << err << endl;

platform = platforms[0];

}

cl_device_id device_ids[16];

unsigned int num_devices;

err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 16, device_ids, &num_devices);

cout << "clGetDeviceIDs err: " << err << endl;

cl_context_properties ctxProps[3] = {CL_CONTEXT_PLATFORM, (cl_context_properties) platform$

cl_context myContext = clCreateContext(ctxProps, 1, device_ids, NULL, NULL, &err);

cout << "clCreateContext err: " << err << endl;

-----------------------------------------------------------------

and this is the C++

------------------------------------------------------------

std::vector <cl::Platform> platforms;

err = cl::Platform::get(&platforms);

cl_context_properties context_properties[3] = {

CL_CONTEXT_PLATFORM, (cl_context_properties) platforms[0](), 0 };

cl::Context myCtx = cl::Context(CL_DEVICE_TYPE_ALL,

context_properties, NULL, NULL, &err);

cl_context myContext = myCtx();

------------------------------------------------------------------

after that I call clFFT in this way:

-------------------------

clFFT_Dim3 n;

clFFT_Plan plan;

cl_uint plan_length;

n.x = 1024;

n.y = 1;

n.z = 1;

cout << "Creating plan" << endl;

	plan = clFFT_CreatePlan((cl_context) myContext, n,
	clFFT_1D, clFFT_InterleavedComplexFormat, &err);

A couple of question:

- When I use the C++ API the plan creation hangs and the CPU is on 100% load... anybody has similar experiences ?

- Does anybody using clFFT with C++ API ?

thanks,

Gergely

binying · ‎10-12-2012

Apple's clFFT on AMD card? It might be the source of trouble.

😛

gdebrecz · ‎10-12-2012

Well, since it is much faster than clAMDFFT... I'd like to use it.

bragadeesh · ‎10-12-2012

Much faster? Can you elaborate on that? What problem are you running and what card did you measure this on? What version of apple FFT and clAmdFft are you using? I am particularly interested in your performance comparison statement. AMD's FFT library is very competitive in terms of performance at many problem sizes.

gdebrecz · ‎10-13-2012

Hi thanks for your answer ! Concerning the speed these are my measurement maybe I was mistaken,

but I'm sure that you also have these numbers....

However I would be interested for an answer to my original question ! Could you comment on that ? There could be

any differemce between contexts

a.) Created with the C++ API as cl::Context and the devices are not listed explicitely just specified as CL_DEVICE_TYPE_ALL ?

b.) A context created with the C API as cl_context, where you have the devices explicitely enumerated.

thanks a lot,

Gergely

LeeHowes · ‎10-14-2012

The obvious difference is that in C version you asked for only GPUs but for C++ you asked for ALL, which would give you the CPU device too (and presumably wouldn't on NVIDIA's platform).

gdebrecz · ‎10-14-2012

Dear Lee,

Thanks for your answer. Indeed in this specific example I've copied there is this difference you correctly spotted,

however I've been trying all the possible compbinations, i.e. also GPU-s only in both C and C++ APi with the

same result. So probably this is not the rason.

Gergo

yurtesen · ‎10-22-2012

gdebrecz wrote:
Hi thanks for your answer ! Concerning the speed these are my measurement maybe I was mistaken,
but I'm sure that you also have these numbers....

Would you care to share your measurements and how you measured them? It would be quite interesting to see...

gdebrecz · ‎10-23-2012

OK, I'll try ti re-run my benchmarks and will post you. However could you please comment on my original question:

There could be any difference between context created using the C api and context returned by a cl::context() C++ API

context, (assuming they have identical contextpropeties set in advance) ?

thanks a lot,

Gergely

yurtesen · ‎10-23-2012

Coincidentally, I was just looking at Apple's FFT recently and it does not seem to perform well or correctly on CPU devices. Your problem is probably not related to a difference between C/C++ APIs.

Apple's FFT library's plan creation, wont work if you only have CPU as a device. (it returns an invalid context error and this is hard coded in the implementation). I guess it might somewhat get confused if you have CPU + GPU in your context. I recently tried to change offending code to accept CPU device type and got strange results. It is by design... it is not suppose to work on CPU devices (as far as I can see). You should use GPU devices only if you want to use Apple FFT or else use clAmdFFT library.

It is probably your code which does not catch the error created by the plan creation. You should check for CL_SUCCESS != err after calling it, you will find 'err' stored -34 invalid context.

This probably works on Nvidia devices, because you will only get GPU devices if you use Nvidia OpenCL platform. Therefore it will not spit out error.

I recommend sticking to AMD's FFT libraries if you want to run both on CPU and GPU. I also recommend using AMD hardware in this case since it is probably optimized for AMD cards. I did not use FFT library directly, but I ran some tests on Tesla cards and Tahiti cards with clAmdBlas. clAmdBlas was superior to cuBlas (sure AMD's GCN Tahitii is better than Nvidia Tesla solutions hardware-wise but still). I would be surprised if clAmdFFT does not function well based on my experience with clAmdBlas

gdebrecz · ‎10-23-2012

Hi,

Thanks for spotting all these out. I re-run my performance test again:

gdebrecz@xxxxxx:~/amdfft$ lspci | grep -i radeon

04:00.0 VGA compatible controller: Advanced Micro Devices [AMD] nee ATI Cayman XT [Radeon HD 6970]

here are some timing tests for C2C ForwardFFT as a function of the length (power of 2)

7 128 339.259 fft / sec	2.9476	ms / fft
8 256 519.462 fft / sec	1.92507	ms / fft
9 512 392.807 fft / sec	2.54578	ms / fft
10 1024 303.491 fft / sec	3.29499	ms / fft
11 2048 275.528 fft / sec	3.6294	ms / fft
12 4096 138.66 fft / sec	7.21186	ms / fft
13 8192 163.3 fft / sec	6.1237	ms / fft
14 16384 136.564 fft / sec	7.32257	ms / fft
15 32768 181.027 fft / sec	5.52405	ms / fft
16 65536 227.177 fft / sec	4.40185	ms / fft
17 131072 113.805 fft / sec	8.78693	ms / fft
18 262144 121.572 fft / sec	8.22555	ms / fft
19 524288 97.6293 fft / sec	10.2428	ms / fft
20 1048576 101.543 fft / sec	9.84801	ms / fft
21 2097152 80.9414 fft / sec	12.3546	ms / fft
22 4194304 78.31 fft / sec	12.7698	ms / fft
23 8388608 40.6493 fft / sec	24.6007	ms / fft
24 16777216 29.2014 fft / sec	34.245	ms / fft

are these numbers reasonable, or I do something wrong ? I create a context only one GPU device in it, still

during the test I see the CPU running 100%...is it possible that it runs on the CPU then, why it is so slow ?

thanks again for your help and ansewers...

here is the relevant code pieces from the testing:

clAmdFftSetupData fftSetupData;

clAmdFftPlanHandle fftPlan;

clAmdFftDim fftDim = CLFFT_1D;

clAmdFftSetup(&fftSetupData);

clAmdFftInitSetupData(&fftSetupData);

.

cl::Buffer * d_src = new cl::Buffer(myRuntimeEnv.appContexts[0], CL_MEM_READ_WRITE, buffersize, NULL, &err);

if (err != 0 ) { std::cout << "Error creating buffer1. Exiting. Error code: " << err << endl; return -1;}

cl::Buffer * d_dest = new cl::Buffer(myRuntimeEnv.appContexts[0], CL_MEM_READ_WRITE, buffersize, NULL, &err);

if (err != 0 ) { std::cout << "Error creating buffer2. Exiting. Error code: " << err <<< endl; return -1;}

.

for (.....

clAmdFftEnqueueTransform(fftPlan, CLFFT_FORWARD, 1,

&myRuntimeEnv.appQueues[0](), 0, NULL, NULL, &(*d_src)(), &(*d_dest)(), NULL);

}

Gergely

Meteorhead · ‎11-16-2012

I would also be interested in what the issue is at hand here... I do not find anything wrong with the code.

Micha_M · ‎05-11-2013

Hi,

I am interested in OpenCL source code that run Apple's FFT using Visual studio 2010 and windows 7.

Is someone have a link or have this source code?

Thanks,

Micha

ash · ‎07-16-2013

Hi,

I'm trying to use the Apple's lib for FFT(forward/inverse) in my project. I'm on linux working with Nvidia GPU and found some code adapted from Apple's sample on Github to linux. It compiles fine.

So I added the files of the samples in my program and it also compiles. However at the execution, when I call the function createPlan I have this error :

undefined symbol: _Z5FFT1DP11cl_fft_plan12kernel_dir_t

Did you encountered this error too? I hope you'll be able to help. If I could make it work it would be nice progress for me.

Best regards,

ash

Archives Discussions

Apple's FFT on AMD cards with C++ OpenCL API