Archives Discussions

ash · ‎04-17-2013

Hi,

I want to run my application on both intel CPU and nvidia GPU. As I read some other posts, I clearly understood that I need 2 SDK for this configuration so I chose : AMD SDK and NVIDIA SDK. A part from the "FATA error no flgrx found", when I run my application it can find the 2 devices.

But how can I load dynamically the good library? I heard of the extension cl_khr_icd but I can't manage to understand ho to use it. Can anybody help please?

Best regards,

Jacq

ash · ‎05-21-2013

Ok I'll see if I can get one but might be hard.

I wanted to start a new discussion on CodeXL forum but I didn't succeed. I had some message error saying I didn't have the right to do that, is that normal?

himanshu_gautam · ‎05-21-2013

I would suggest you to wait for 1-2 days. The issue should come up live on forum, if you have submitted it once. Because of heavy spam, strict moderations were done a few weeks back. Otherwise, you can create a thread here.

ash · ‎05-27-2013

Hi again,

I created a topic on CodeXL forum and it seems like an AMD GPU is compulsory for kernel debugging with CodeXL. I got an AMD GPU ( HD 6450) to test codeXL and I'm kind of stuck. By the way , the drivers I installed are Catalyst 13.4.

First of all while running my program on this AMD GPU I get strange results :

If I remove printf in the kernel I don't get the good results, but if I let them, it seems to work fine since I get the good result.

But to use codeXL I have to remove printf. I can lauch the kernel and put some breakpoints but when I look at my variables in the buffer it's the same value everywhere and completely wrong it's like 1501213 (instead of 0<x<1023).

Is there some problem with float? I'm really lost I don't understand those errors at all. Hope someone could help.

Best regards,

ash

himanshu_gautam · ‎05-27-2013

I am kindaa surprised the code works if printf is enabled...This looks like a bug in your code.

Can you post your code so that we can take a look?

ash · ‎05-29-2013

Hi,

Yeah seems like a bug in my code again I'll ask if I can post. But I had to give back the AMD GPU and after the Amd drivers were un-installed, the icd loader in /etc/OpenCL/vendors was missing. So I put it back but still when launching samples or some of my programs, it doesn't see my CPU as an OpenCL device anymore. What should I do? Can you please help. All the samples only see the GPU I don't even have the message : "AMD GPU not found falling back to CPU" like before.

Best regards,

ash

ash · ‎05-29-2013

Hi,

I don't know what was the problem but reinstalling AMD SDK fixed this problem.

You're right there really is something wrong somewhere : I tested my code on another GPU ( NVIDIA Quadro 290) and I also get wrong results for the reduction.

What I don't understand is that the power of opencl is that it should enable us to launch the same code on whatever compatible device. Does it has something to do with dimensioning, like local and global parameters? I tried to test different values resulting in freezing my GPU and had to reboot.

If the array to reduce has 1024 elements, then the global arg to the kernel in enqueueNDRangekernel should be of size 1024.

And for the local argument, which is also the size of my local array shared between threads from the same workgroup I gave 64.

I get the same good results for GPU NVIDIA GTX 650 and CPU Intel Xeon E5430 but wrong result for the small GPU. Looking at the output the sum is much more smaller , 2.6 smaller actually.

It doesn't crash with this setup but obviously there is something wrong. I checked error after each step (context, program, command queue, writtebuffer, etc...) and I don't see where the problem is.

Best regards,

ash

himanshu_gautam · ‎05-29-2013

Please post the code here. The Advanced editor is working now.

ash · ‎05-31-2013

Hi,

The advanced editor is not working " can't find the page" and I'm not able to upload the file either. How can I do?

Best regards,

ash

ash · ‎05-31-2013

The advance editor is not working correctly, I had to reply to my own post to get the page. And I can't upload a file either.

If you could test on an AMD GPU it could be a great help. Because I tested when I put the printf ( #define DEBUG_OCL) it was working fine but when I didn't put any printf( comment #DEBUG_OCL) it gave wrong results and didn't pass my test.

HOST CODE:


#include <iostream>

#include <CL/cl.hpp>

#include <iomanip>

#include <cmath>

#include <fstream>

#define DEBUG_OCL // /!\ This option is only available for AMD Platform

#define OCLINTEGRITY_NUMS 1024

#define OCLINTEGRITY_WORK_ITEMS 64

#define OCLINTEGRITY_WORK_GROUPS (OCLINTEGRITY_NUMS/(OCLINTEGRITY_WORK_ITEMS*2))



int main(int argc, char* argv[])

{

    float* m_h_input = new float[OCLINTEGRITY_NUMS];

    float* m_h_output = new float[OCLINTEGRITY_WORK_GROUPS];

    cl_int err;


    // Init input array

    for( int i = 0; i < OCLINTEGRITY_NUMS; i++ )

    {

        m_h_input = i;

    }


    /*

     *

     * CHANGE HERE FOR THE TYPE OF DEVICE YOU WANT TO USE

     *

     */

    cl_device_type type = CL_DEVICE_TYPE_GPU;

    std::string platform_name = "AMD";


    std::vector<cl::Platform> platforms;

    std::vector<cl::Device> devices;

    cl::Platform::get(&platforms);

    cl::Platform platform;


    //Look for specified platform

    for(size_t i=0; i <platforms.size(); ++i)

    {

        std::string val;

        platforms.getInfo(CL_PLATFORM_NAME, &val);

        if(val.find(platform_name) != std::string::npos)

        {

            std::cout<<"Platform name found "<<val<<std::endl;

            platform = platforms;

        }

    }


    if(platform.getDevices(type,&devices)!= CL_SUCCESS)

    {

        std::cerr<<"Error: No device found !"<<std::endl;

        return -1;

    }


    cl::Device m_device = devices[0];


    std::string val;

    if(m_device.getInfo(CL_DEVICE_NAME, &val) != CL_SUCCESS)

    {

        std::cerr<<"Error: Can't get device name"<<std::endl;

        return false;

    }

    std::cout<<"--> Choosen Device name: "<<val<<std::endl;


    // Read source file

    std::ifstream sourceFile("kernel.cl");

    std::string sourceCode(

    std::istreambuf_iterator<char>(sourceFile),(std::istreambuf_iterator<char>()));

    cl::Program::Sources source(1, std::make_pair(sourceCode.c_str(), sourceCode.length()+1));


    // Create an OpenCL context

    cl::Context context(devices, NULL, NULL, NULL, &err);

    if (err != CL_SUCCESS)

    {

        std::cout << "Error: Can't create context" << std::endl;

        return false;

    }


    // Create a command queue

     cl::CommandQueue command_queue(context, m_device, 0, &err);

    if (err != CL_SUCCESS)

    {

        std::cout << "Error: Failed to create commandQueue " << err << "\n";

        return false;

    }

    std::string options="";


#ifdef DEBUG_OCL

#    warning "DEBUG MODE :  make sure you use AMD platform"

    options += "-g -DDEBUG_AMD";

#endif


    // Build programm

    cl::Program program(context, source, &err);

    err = program.build(devices, options.c_str());

    if (err != CL_SUCCESS)

    {

        std::cerr << "Error : Failed to build program " << std::endl;

        std::cerr << program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(m_device)<< std::endl;

        return false;

    }


    // Create memory buffers on the device for input and output values

    cl::Buffer input_buffer(context, CL_MEM_READ_ONLY, OCLINTEGRITY_NUMS * sizeof(float), NULL, &err);

    cl::Buffer output_buffer(context, CL_MEM_WRITE_ONLY, OCLINTEGRITY_WORK_GROUPS * sizeof(float), NULL, &err);

    if (err != CL_SUCCESS)

    {

        std::cerr << "Error: Failed to create memory buffers " << err << "\n";

        return false;

    }


    // Copy input to memory buffer

    err = command_queue.enqueueWriteBuffer(input_buffer, CL_TRUE, 0, OCLINTEGRITY_NUMS * sizeof(float), m_h_input, NULL, NULL);

    if (err != CL_SUCCESS)

    {

        std::cerr << "Error: Failed to copy to buffer " << err << "\n";

        return false;

    }


    // Create Kernel

    cl::Kernel kernel(program, "reduce_kernel", &err);

    err = kernel.setArg(0, input_buffer);

    err = kernel.setArg(1, output_buffer);

    err = kernel.setArg(2, cl::Local(OCLINTEGRITY_WORK_ITEMS));

    if (err != CL_SUCCESS)

    {

        std::cerr << "Error: Failed to build kernel " << err << "\n";

        return false;

    }


    // Execute the OpenCL kernel on the list

    cl::NDRange global(OCLINTEGRITY_NUMS);

    cl::NDRange local(OCLINTEGRITY_WORK_ITEMS);


    err = command_queue.enqueueNDRangeKernel(kernel, 0, global, local, NULL, NULL); //Run the kernel

    if(err!=CL_SUCCESS)

    {

        std::cerr << "Error: Failed to execute kernel " << err << "\n";

        return -1;

    }

    //Copy data from buffer to host memory

    err = command_queue.enqueueReadBuffer(output_buffer, CL_TRUE, 0, OCLINTEGRITY_WORK_GROUPS * sizeof(float), m_h_output, NULL, NULL);

    if(err!=CL_SUCCESS)

    {

        std::cerr << "Error: Failed to read buffer " << err << "\n";

        return -1;

    }

    err = command_queue.finish();


    //Sum blocks

    double gpu_sum = 0.0;

    for (unsigned int i = 0; i < OCLINTEGRITY_WORK_GROUPS; ++i)

    {

        gpu_sum += m_h_output;

        std::cout << m_h_output << std::endl;

    }


    std::cout<<"parallel sum "<<std::setprecision(6)<<gpu_sum<<std::endl;


    //Compute on CPU

    float reference = 0.f;

    for (int i = 0; i < OCLINTEGRITY_NUMS; ++i)

        reference += log(exp(sqrt(i)));


    // Compare CPU - OpenCL Device

    const float err_sum = fabsf(gpu_sum - reference);

    if (err_sum < 10e-1f)

    {

        std::cout << "SUCCESS!\n";


    }

    else

    {

        std::cout << "ERROR : " << err_sum << std::endl;

    }


    //Release memory

    delete[] m_h_input;

    delete[] m_h_output;


    return 0;

}

KERNEL:


__kernel void reduce_kernel(__global float *a_g_idata, __global float *a_g_odata, __local float* ocl_test_sdata)

{

    // perform first level of reduction,

    // reading from global memory, writing to shared memory

    const unsigned int tid = get_local_id(0);

    const unsigned int i = get_group_id(0)*(get_local_size(0)*2) + get_local_id(0);

   


    if(i+get_local_size(0)<1024)

    {

        ocl_test_sdata[tid] = log(exp(sqrt(a_g_idata)))  +  log(exp(sqrt(a_g_idata[i+get_local_size(0)]))) ;

#ifdef DEBUG_AMD

        printf("---KERNEL input[%d] = %f \n",i, a_g_idata);

#endif


    }

    

    barrier(CLK_LOCAL_MEM_FENCE);

    

    // do reduction in shared mem   

    for(unsigned int s=get_local_size(0)/2; s>0; s>>=1) 

    {

        if (tid < s) 

        {

            ocl_test_sdata[tid] += ocl_test_sdata[tid + s];

        }

        barrier(CLK_LOCAL_MEM_FENCE);

    }


    // write result for this block to global mem 

    if (tid == 0)

    { 

        a_g_odata[get_group_id(0)] = ocl_test_sdata[0];

#ifdef DEBUG_AMD

        printf("output[%d] : %f\n",get_group_id(0),a_g_odata[get_group_id(0)]);

#endif


    }

    

    

}

Best regards,

ash

himanshu_gautam · ‎05-31-2013

Few Comments:

1. I do not see the need to launch 1024 work-items for reducing 1024 elements. And then using conditions inside kernel, which disables half the thread directly. Why not launch 512 threads only.

2. use get_global_id(0). The group_id method may be right, but is very confusing (with that 2 inside it).

Just rewriting the small section of kernel.

Global Size:512, Local Size=64

int gid = get_global_id(0);

int lid = get_local_id(0);

int grp_id = get_group_id(0);

int grp_size = get_group_size(0);

if(gid < 512)

{

// 3 versions for varying access pattern. Just check once before using, not tested

//ocl_test_sdata[lid] = log(exp(sqrt(a_g_idata[gid]))) + log(exp(sqrt(a_g_idata[gid + get_global_size(0)]))) ;

//ocl_test_sdata[lid] = log(exp(sqrt(a_g_idata[2 * gid]))) + log(exp(sqrt(a_g_idata[2 * gid + 1]))) ;

//ocl_test_sdata[lid] = log(exp(sqrt(a_g_idata[(2 * grp_id) * grp_size + lid]))) +

log(exp(sqrt(a_g_idata[(2 * grp_id + 1) * grp_size + lid]))) ;

}

ash · ‎06-03-2013

Hi,

Thanks for the comments I'll try that, I think I mixed the local parameter that we pass to the enqueueNDRangeKernel function and the total number of elements that should be computed. I thought it was the same but from what you told it's not really the same.

Another question, were you able to test my code on an AMD GPU to see of the test passed even if you disable printf?

I'd be reassured if my code run on NVIDIA and AMD GPU correctly.

Also, could you please tell me how to post code as a zipped attachment?

Have a nice day.

Best regards,

ash

himanshu_gautam · ‎06-03-2013

Your code returns SUCCESS with/witout using printf. Here is the output when debug was disabled.

C:\Users\cas\Desktop\reduce>host.exe
Platform name found AMD Accelerated Parallel Processing
--> Choosen Device name: Capeverde
959.575
1762.89
2284.09
2705.42
3069.08
3393.84
3690.06
3964.17
parallel sum 21829.1
SUCCESS!

ash · ‎06-03-2013

Good to know, thanks a lot!

Then maybe the problem was from the AMD GPU I got. I'll try to test on another one if possible later.

I'm now porting a CUDA application to OpenCL and I encountered some problems. I don't know if you're familiar with Cuda, I'm facing some diffculties to "translate" tex3D and textures in OpenCL. I read about cl::Image so I think that I choose use that to pass data to the kernel but it's not very clear.

himanshu_gautam · ‎06-05-2013

You are right. Look into cl::image, you can checkout some APP SDK Samples (although most of them have been written without OpenCL C++ wrapper). SimpleImage, MatrixMulImage are a few to name.

ash · ‎06-07-2013

Hi,

I have a small question about cl::Image3D. When you enqueueWriteImage it asks for an origin and a region.

If I want to read the whole image, then the region should be defined as (width,height,depth), isn't it?

himanshu_gautam · ‎06-07-2013

region defines the (width, height, depth) in pixels of the 2D or 3D rectangle being read or written. If image is a 2D image object, the depth value given by region[2] must be 1.

From the khronos C++ wrapper document.

ash · ‎06-07-2013

Ok then it should be fine, sorry for the bother.

I have (again) another question : in the CUDA code that I'm porting there is a CudaPitchPtr. I read the specs and when you create a 3D image, it's said that you can pass the row_pitch which should be the equivalent of the host_ptr.pitch.

but What about the xSize and ySize seems like slice_pitch but not too sure. Also I really don't know what to give as a host_ptr when I construct the 3d Image. I think I should allocate an array for the size of the image which means 3 dimensions but seems like in cuda they allocate for a 3D array dimension. I hope you could help I'm kind of lost.

Cuda :

cudaPitchedPtr h_ptr;

h_ptr.pitch = volume_size.width*sizeof(float);

h_ptr.xsize = volume_size.width;

h_ptr.ysize = volume_size.height;

OpenCL :

cl::Image3D(context, CL_MEM_READ_ONLY, fmt, width, height, depth,

row_pitch, // = row_pitch = height*sizeof(float);

slice_pitch, //?

host_ptr); //?

I hope I'm not too far but some help would be pretty well welcomed.

best regards,

ash

nou · ‎06-07-2013

you can pass pitch parameters as 0 then OpenCL will compute proper value automatically as row_pitch = width*sizeof(pixel type) and slice_pitch=height*row_pitch

ash · ‎06-07-2013

Then I "only" have to allocate memory for the host pointer?

So If I have a 3D image I have to allocate memory for a 3D Array? Sorry if my question is dumb but i haven't really understood yet.

nou · ‎06-07-2013

yes? I am not sure what exactly are you asking. What else memory you want to allocate?

ash · ‎06-10-2013

Hi,

No it's okay I was just confused, but it's the same with buffer object, when you use the flag CL_MEM_ALLOC_HOST_PTR

May I ask if you know some good sources that could help me for programming a kernel using gaussian smooth on a 3D image ?

Best regards,

ash

ash · ‎06-11-2013

Hi,

Very tiny question : in a for loop where I call my kernel, if I change some argument' s value, do I have to reset the argument with setArg function or is it done automatically?

Best regards,

ash

nou · ‎06-11-2013

what you mean by reset. kernel remember argument until it is changed via clSetKernelArg(). you can change only one arg and enqueue kernel and it will run with this new value and other arguments will have old value.

ash · ‎06-11-2013

Ok thanks. Yes what I meant by re-set was to give a new value

Have a nice day.

ash · ‎06-13-2013

Hi everybody,

I'm really down. I had to change the include from /usr/local/cuda/include ( NVidia folder) to /opt/AMDAPP/include ( AMD folder). The thing is that, the both cl.hpp files are exactly the same (copied the latest version form Khronos registry) so why do I get errors when pointing the include path to amd folder?

I hope somebdy could help and I have absolutely no clue.

Regards,

ash

nou · ‎06-13-2013

cl.hpp include cl.h and other headers. so chect those too. and what error do you get?

himanshu_gautam · ‎06-14-2013

What is the error that you get? Without specifying the error, we really cannot help you out here -- as much as we want to.

-

Bruha

ash · ‎06-14-2013

Hi,

Previously with a test program I had good results on NVIDIA but the same code was giving memory leaks or wrong results on AMD CPU. Then I found out that some objects were desallocated before the enqueueNDRange call, and corrected it. My code was then working fine on AMD CPU and NVIDIA GPU. Now, and I really don't know why my code is not running on NVIDIA anymore, I have the message : memory corrupted free some libgcc detected error and it crash.

I'll try to run my code on another computer and tell you what I find. It doesn't seem to come from the code anymore, or at least I hope so . Should be some library or some systems linking problems. Because, it seems that people are able to run the same code on NVIDIA gpu and AMD cpu without any problem.

The difference form the code above, is that I used the #define __CL_ENABLE_EXCEPTIONS( for error handling) and that my function doing the test takes in argument a reference to a device. Is that a wrong thing to do?

Going on further investigation.

if somebody has any idea, meanwhile I'll test and come back later.

Regards,

ash

himanshu_gautam · ‎06-16-2013

If you think that __CL_ENABLE_EXCEPTIONS is not working, try this simple program at http://www.thebigblob.com/using-the-cpp-bindings-for-opencl/

Can you run it properly?

ash · ‎06-20-2013

Hi,

No it's not coming from the __CL_ENABLE_EXCEPTIONS. I have some other test program using this macro that were working fine.

My code was running fine on another computer. I think it's some lib links problems. I have to focus on something else for now, so I'll come back to this after I finish.

Thanks for your help.

Best regards,

ash

ash · ‎07-15-2013

Hi everybody,

It's been a long time. I've been doing my work leaving the AMD problem I had for now. And I have some questions about the convolution. I want to use the FFT implementation for a convolution. Since I still work on NVidia device I read here that it's better to use Apple's clFFT. What library or implementation do you recommand me to use to work on Nvidia GPU with the C++ OpenCL bindings ?

Regards,

ash

himanshu_gautam · ‎05-08-2013

Please post the code as a zipped attachment if the issue is still not resolved.

Archives Discussions

How to implement cl_khr_icd?