I want to run my application on both intel CPU and nvidia GPU. As I read some other posts, I clearly understood that I need 2 SDK for this configuration so I chose : AMD SDK and NVIDIA SDK. A part from the "FATA error no flgrx found", when I run my application it can find the 2 devices.
But how can I load dynamically the good library? I heard of the extension cl_khr_icd but I can't manage to understand ho to use it. Can anybody help please?
Ok I'll see if I can get one but might be hard.
I wanted to start a new discussion on CodeXL forum but I didn't succeed. I had some message error saying I didn't have the right to do that, is that normal?
I would suggest you to wait for 1-2 days. The issue should come up live on forum, if you have submitted it once. Because of heavy spam, strict moderations were done a few weeks back. Otherwise, you can create a thread here.
I created a topic on CodeXL forum and it seems like an AMD GPU is compulsory for kernel debugging with CodeXL. I got an AMD GPU ( HD 6450) to test codeXL and I'm kind of stuck. By the way , the drivers I installed are Catalyst 13.4.
First of all while running my program on this AMD GPU I get strange results :
If I remove printf in the kernel I don't get the good results, but if I let them, it seems to work fine since I get the good result.
But to use codeXL I have to remove printf. I can lauch the kernel and put some breakpoints but when I look at my variables in the buffer it's the same value everywhere and completely wrong it's like 1501213 (instead of 0<x<1023).
Is there some problem with float? I'm really lost I don't understand those errors at all. Hope someone could help.
Yeah seems like a bug in my code again I'll ask if I can post. But I had to give back the AMD GPU and after the Amd drivers were un-installed, the icd loader in /etc/OpenCL/vendors was missing. So I put it back but still when launching samples or some of my programs, it doesn't see my CPU as an OpenCL device anymore. What should I do? Can you please help. All the samples only see the GPU I don't even have the message : "AMD GPU not found falling back to CPU" like before.
I don't know what was the problem but reinstalling AMD SDK fixed this problem.
You're right there really is something wrong somewhere : I tested my code on another GPU ( NVIDIA Quadro 290) and I also get wrong results for the reduction.
What I don't understand is that the power of opencl is that it should enable us to launch the same code on whatever compatible device. Does it has something to do with dimensioning, like local and global parameters? I tried to test different values resulting in freezing my GPU and had to reboot.
If the array to reduce has 1024 elements, then the global arg to the kernel in enqueueNDRangekernel should be of size 1024.
And for the local argument, which is also the size of my local array shared between threads from the same workgroup I gave 64.
I get the same good results for GPU NVIDIA GTX 650 and CPU Intel Xeon E5430 but wrong result for the small GPU. Looking at the output the sum is much more smaller , 2.6 smaller actually.
It doesn't crash with this setup but obviously there is something wrong. I checked error after each step (context, program, command queue, writtebuffer, etc...) and I don't see where the problem is.
The advance editor is not working correctly, I had to reply to my own post to get the page. And I can't upload a file either.
If you could test on an AMD GPU it could be a great help. Because I tested when I put the printf ( #define DEBUG_OCL) it was working fine but when I didn't put any printf( comment #DEBUG_OCL) it gave wrong results and didn't pass my test.
#define DEBUG_OCL // /!\ This option is only available for AMD Platform
#define OCLINTEGRITY_NUMS 1024
#define OCLINTEGRITY_WORK_ITEMS 64
#define OCLINTEGRITY_WORK_GROUPS (OCLINTEGRITY_NUMS/(OCLINTEGRITY_WORK_ITEMS*2))
int main(int argc, char* argv)
float* m_h_input = new float[OCLINTEGRITY_NUMS];
float* m_h_output = new float[OCLINTEGRITY_WORK_GROUPS];
// Init input array
for( int i = 0; i < OCLINTEGRITY_NUMS; i++ )
m_h_input = i;
* CHANGE HERE FOR THE TYPE OF DEVICE YOU WANT TO USE
cl_device_type type = CL_DEVICE_TYPE_GPU;
std::string platform_name = "AMD";
//Look for specified platform
for(size_t i=0; i <platforms.size(); ++i)
if(val.find(platform_name) != std::string::npos)
std::cout<<"Platform name found "<<val<<std::endl;
platform = platforms;
std::cerr<<"Error: No device found !"<<std::endl;
cl::Device m_device = devices;
if(m_device.getInfo(CL_DEVICE_NAME, &val) != CL_SUCCESS)
std::cerr<<"Error: Can't get device name"<<std::endl;
std::cout<<"--> Choosen Device name: "<<val<<std::endl;
// Read source file
cl::Program::Sources source(1, std::make_pair(sourceCode.c_str(), sourceCode.length()+1));
// Create an OpenCL context
cl::Context context(devices, NULL, NULL, NULL, &err);
if (err != CL_SUCCESS)
std::cout << "Error: Can't create context" << std::endl;
// Create a command queue
cl::CommandQueue command_queue(context, m_device, 0, &err);
if (err != CL_SUCCESS)
std::cout << "Error: Failed to create commandQueue " << err << "\n";
# warning "DEBUG MODE : make sure you use AMD platform"
options += "-g -DDEBUG_AMD";
// Build programm
cl::Program program(context, source, &err);
err = program.build(devices, options.c_str());
if (err != CL_SUCCESS)
std::cerr << "Error : Failed to build program " << std::endl;
std::cerr << program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(m_device)<< std::endl;
// Create memory buffers on the device for input and output values
cl::Buffer input_buffer(context, CL_MEM_READ_ONLY, OCLINTEGRITY_NUMS * sizeof(float), NULL, &err);
cl::Buffer output_buffer(context, CL_MEM_WRITE_ONLY, OCLINTEGRITY_WORK_GROUPS * sizeof(float), NULL, &err);
if (err != CL_SUCCESS)
std::cerr << "Error: Failed to create memory buffers " << err << "\n";
// Copy input to memory buffer
err = command_queue.enqueueWriteBuffer(input_buffer, CL_TRUE, 0, OCLINTEGRITY_NUMS * sizeof(float), m_h_input, NULL, NULL);
if (err != CL_SUCCESS)
std::cerr << "Error: Failed to copy to buffer " << err << "\n";
// Create Kernel
cl::Kernel kernel(program, "reduce_kernel", &err);
err = kernel.setArg(0, input_buffer);
err = kernel.setArg(1, output_buffer);
err = kernel.setArg(2, cl::Local(OCLINTEGRITY_WORK_ITEMS));
if (err != CL_SUCCESS)
std::cerr << "Error: Failed to build kernel " << err << "\n";
// Execute the OpenCL kernel on the list
err = command_queue.enqueueNDRangeKernel(kernel, 0, global, local, NULL, NULL); //Run the kernel
std::cerr << "Error: Failed to execute kernel " << err << "\n";
//Copy data from buffer to host memory
err = command_queue.enqueueReadBuffer(output_buffer, CL_TRUE, 0, OCLINTEGRITY_WORK_GROUPS * sizeof(float), m_h_output, NULL, NULL);
std::cerr << "Error: Failed to read buffer " << err << "\n";
err = command_queue.finish();
double gpu_sum = 0.0;
for (unsigned int i = 0; i < OCLINTEGRITY_WORK_GROUPS; ++i)
gpu_sum += m_h_output;
std::cout << m_h_output << std::endl;
std::cout<<"parallel sum "<<std::setprecision(6)<<gpu_sum<<std::endl;
//Compute on CPU
float reference = 0.f;
for (int i = 0; i < OCLINTEGRITY_NUMS; ++i)
reference += log(exp(sqrt(i)));
// Compare CPU - OpenCL Device
const float err_sum = fabsf(gpu_sum - reference);
if (err_sum < 10e-1f)
std::cout << "SUCCESS!\n";
std::cout << "ERROR : " << err_sum << std::endl;
__kernel void reduce_kernel(__global float *a_g_idata, __global float *a_g_odata, __local float* ocl_test_sdata)
// perform first level of reduction,
// reading from global memory, writing to shared memory
const unsigned int tid = get_local_id(0);
const unsigned int i = get_group_id(0)*(get_local_size(0)*2) + get_local_id(0);
ocl_test_sdata[tid] = log(exp(sqrt(a_g_idata))) + log(exp(sqrt(a_g_idata[i+get_local_size(0)]))) ;
printf("---KERNEL input[%d] = %f \n",i, a_g_idata);
// do reduction in shared mem
for(unsigned int s=get_local_size(0)/2; s>0; s>>=1)
if (tid < s)
ocl_test_sdata[tid] += ocl_test_sdata[tid + s];
// write result for this block to global mem
if (tid == 0)
a_g_odata[get_group_id(0)] = ocl_test_sdata;
printf("output[%d] : %f\n",get_group_id(0),a_g_odata[get_group_id(0)]);
1. I do not see the need to launch 1024 work-items for reducing 1024 elements. And then using conditions inside kernel, which disables half the thread directly. Why not launch 512 threads only.
2. use get_global_id(0). The group_id method may be right, but is very confusing (with that 2 inside it).
Just rewriting the small section of kernel.
Global Size:512, Local Size=64
int gid = get_global_id(0);
int lid = get_local_id(0);
int grp_id = get_group_id(0);
int grp_size = get_group_size(0);
if(gid < 512)
// 3 versions for varying access pattern. Just check once before using, not tested
//ocl_test_sdata[lid] = log(exp(sqrt(a_g_idata[gid]))) + log(exp(sqrt(a_g_idata[gid + get_global_size(0)]))) ;
//ocl_test_sdata[lid] = log(exp(sqrt(a_g_idata[2 * gid]))) + log(exp(sqrt(a_g_idata[2 * gid + 1]))) ;
//ocl_test_sdata[lid] = log(exp(sqrt(a_g_idata[(2 * grp_id) * grp_size + lid]))) +
log(exp(sqrt(a_g_idata[(2 * grp_id + 1) * grp_size + lid]))) ;
Thanks for the comments I'll try that, I think I mixed the local parameter that we pass to the enqueueNDRangeKernel function and the total number of elements that should be computed. I thought it was the same but from what you told it's not really the same.
Another question, were you able to test my code on an AMD GPU to see of the test passed even if you disable printf?
I'd be reassured if my code run on NVIDIA and AMD GPU correctly.
Also, could you please tell me how to post code as a zipped attachment?
Have a nice day.
Your code returns SUCCESS with/witout using printf. Here is the output when debug was disabled.
Platform name found AMD Accelerated Parallel Processing
--> Choosen Device name: Capeverde
parallel sum 21829.1
Good to know, thanks a lot!
Then maybe the problem was from the AMD GPU I got. I'll try to test on another one if possible later.
I'm now porting a CUDA application to OpenCL and I encountered some problems. I don't know if you're familiar with Cuda, I'm facing some diffculties to "translate" tex3D and textures in OpenCL. I read about cl::Image so I think that I choose use that to pass data to the kernel but it's not very clear.
You are right. Look into cl::image, you can checkout some APP SDK Samples (although most of them have been written without OpenCL C++ wrapper). SimpleImage, MatrixMulImage are a few to name.
I have a small question about cl::Image3D. When you enqueueWriteImage it asks for an origin and a region.
If I want to read the whole image, then the region should be defined as (width,height,depth), isn't it?
region defines the (width, height, depth) in pixels of the 2D or 3D rectangle being read or written. If image is a 2D image object, the depth value given by region must be 1.
From the khronos C++ wrapper document.
Ok then it should be fine, sorry for the bother.
I have (again) another question : in the CUDA code that I'm porting there is a CudaPitchPtr. I read the specs and when you create a 3D image, it's said that you can pass the row_pitch which should be the equivalent of the host_ptr.pitch.
but What about the xSize and ySize seems like slice_pitch but not too sure. Also I really don't know what to give as a host_ptr when I construct the 3d Image. I think I should allocate an array for the size of the image which means 3 dimensions but seems like in cuda they allocate for a 3D array dimension. I hope you could help I'm kind of lost.
h_ptr.pitch = volume_size.width*sizeof(float);
h_ptr.xsize = volume_size.width;
h_ptr.ysize = volume_size.height;
cl::Image3D(context, CL_MEM_READ_ONLY, fmt, width, height, depth,
row_pitch, // = row_pitch = height*sizeof(float);
I hope I'm not too far but some help would be pretty well welcomed.
Then I "only" have to allocate memory for the host pointer?
So If I have a 3D image I have to allocate memory for a 3D Array? Sorry if my question is dumb but i haven't really understood yet.
No it's okay I was just confused, but it's the same with buffer object, when you use the flag CL_MEM_ALLOC_HOST_PTR
May I ask if you know some good sources that could help me for programming a kernel using gaussian smooth on a 3D image ?
Very tiny question : in a for loop where I call my kernel, if I change some argument' s value, do I have to reset the argument with setArg function or is it done automatically?
what you mean by reset. kernel remember argument until it is changed via clSetKernelArg(). you can change only one arg and enqueue kernel and it will run with this new value and other arguments will have old value.
I'm really down. I had to change the include from /usr/local/cuda/include ( NVidia folder) to /opt/AMDAPP/include ( AMD folder). The thing is that, the both cl.hpp files are exactly the same (copied the latest version form Khronos registry) so why do I get errors when pointing the include path to amd folder?
I hope somebdy could help and I have absolutely no clue.
Previously with a test program I had good results on NVIDIA but the same code was giving memory leaks or wrong results on AMD CPU. Then I found out that some objects were desallocated before the enqueueNDRange call, and corrected it. My code was then working fine on AMD CPU and NVIDIA GPU. Now, and I really don't know why my code is not running on NVIDIA anymore, I have the message : memory corrupted free some libgcc detected error and it crash.
I'll try to run my code on another computer and tell you what I find. It doesn't seem to come from the code anymore, or at least I hope so . Should be some library or some systems linking problems. Because, it seems that people are able to run the same code on NVIDIA gpu and AMD cpu without any problem.
The difference form the code above, is that I used the #define __CL_ENABLE_EXCEPTIONS( for error handling) and that my function doing the test takes in argument a reference to a device. Is that a wrong thing to do?
Going on further investigation.
if somebody has any idea, meanwhile I'll test and come back later.
No it's not coming from the __CL_ENABLE_EXCEPTIONS. I have some other test program using this macro that were working fine.
My code was running fine on another computer. I think it's some lib links problems. I have to focus on something else for now, so I'll come back to this after I finish.
Thanks for your help.
It's been a long time. I've been doing my work leaving the AMD problem I had for now. And I have some questions about the convolution. I want to use the FFT implementation for a convolution. Since I still work on NVidia device I read here that it's better to use Apple's clFFT. What library or implementation do you recommand me to use to work on Nvidia GPU with the C++ OpenCL bindings ?