cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

rick_weber
Adept II

clUtil - a library for making OpenCL as easy to use as CUDA

I've written a library that makes OpenCL really easy to use. Consider this trivial program that writes the number 20 to every element in an array:

#include <Opencl/cl.h>

char const* kernelSource = "__kernel void fill(__global float* array, unsigned int arrayLength, float val)"
"{"
"    if(get_global_id(0) < arrayLength)"
"    {"
"        array[get_global_id(0)] = val;}"
"    }"
"}";

int main(int argc, char** argv)
{
   
float val = 20.0f;
   
float array[2000];
    cl_int err
;
    cl_platform_id platform
;
    cl_device_id device
;
    cl_context context
;
    cl_command_queue commandQueue
;
    cl_mem buffer
;
    cl_program program
;
    cl_kernel kernel
;
   
unsigned int length = 2000;

   
//Initialization
    err
= clGetPlatform_IDs(1, &platform, NULL);
    err
= clGetDeviceIDs(platform, CL_DEVICE_TYPE_ANY, 1, &device, NULL);
    context
= clCreateContext(NULL, 1, &device, NULL, NULL, &err);
    commandQueue
= clCreateCommandQueue(context, device, 0, &err);
    program
= clCreateProgramWithSource(context, 1, &kernelSource, 0, &err);
    err
= clBuildProgram(program, 1, &device, NULL, NULL, NULL);
    kernel
= clCreateKernel(program, "fill", &err);

   
//Allocate memory    
    buffer
= clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(array), NULL, &err);

   
//Actually call the kernel
    err
= clSetKernelArg(kernel, 0, sizeof(buffer), &buffer);
    err
= clSetKernelArg(kernel, 1, sizeof(length), &length);
    err
= clSetKernelArg(kernel, 2, sizeof(val), &val);

    size_t
global;
    size_t
local = 64;
   
   
global = length % local == 0 ? length : (length / local + 1) * local;

    err
= clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL, &global, &local, 0, NULL, NULL);

   
//Copy data back
    err
= clEnqueueReadBuffer(commandQueue, buffer, CL_TRUE, 0, sizeof(array), array, 0, NULL, NULL);

   
//Free the data on the GPU
    clReleaseMemObject
(buffer);
}
This is the same program using clUtil:

kernel.cl:

__kernel void fill(__global float* array, unsigned int arrayLength, float val)
{
   
if(get_global_id(0) < arrayLength)
   
{
        array
[get_global_id(0)] = val;
   
}
}

main.cc:

#include <clUtil.h>

int main(int argc, char** argv)
{
   
const char* kernelFiles[] = {"kernel.cl"};
    cl_mem buffer
;
   
float array[2000];
   
unsigned int length = 2000;
   
float val = 20.0f;

    clUtilInitialize
(kernelFiles, 1);
    clUtilAlloc
(sizeof(array), &buffer);
     
    clUtilEnqueueKernel
("fill", clUtilGrid(length, 64), buffer, length, val);
    clUtilDeviceGet
(array, sizeof(array), buffer);

    clUtilFree
(buffer);
}
It makes assumptions about devices and platforms to reduce the number of handles you have floating around while using C++0x constructs to make calling kernels significantly easier.
Currently runs in Linux
Source and documentation at http://code.google.com/p/clutil/


0 Likes
6 Replies
rick_weber
Adept II

I've updated clUtil to support 1D images (emulated on 2D images) and asynchronous data transfers/executions. I've also added examples of each of these features in the examples directory. You need gcc 4.4+ to compile the library and gcc 4.5+ to use lambdas (as is done in the Asynchronous example) with -std=c++0x.

0 Likes

Hi rick,

It's good to see people working on this kind of thing, and particularly interesting to see C++0x features being used (variadic templates should have been in C++ years ago). Anything to get us away from C. When I have a chance I'll try to take a look at your code, though I'm not a big linux user at the moment so it may be a few weeks.

Do you feel that this sort of util library is useful in production (if you extend it to optionally not assume platforms and devices, anyway) or more as a learning tool?

Lee

0 Likes

The library significantly eases coding and debugging OpenCL programs, so I would say that it is a good learning tool. However, I would also say that it significantly reduces the amount of stuff you have to take care of as a developer. For example, every kernel is shoved into a std::map wen you call clUtilInit, so when you call clUtilEnqueueKernel(), it looks for the kernel by name in the map associated with the currently selected device. That way, you don't have to call clCreateKernel() every time you call a routine, and then figure out what to do with it when you're done. You also don't have to call clKernelSetArg() or any of that stupidness that comes with the territory of C. I hope clUtil is extremely useful in production by dramatically increasing programmer productivity, reducing code bloat, and improving readability.

I've been thinking about a Windows port, but I don't have Visual Studio, and I'm not sure if it supports rvalues and variadic templates yet (both of which are required by clUtil). I know it was one of the first compilers to support C++0x lambdas (probably for their ppl library).

As for the assumptions clUtil makes about platforms and devices, they're generally pessemistic. They assume you want to compile all source files for every device in every platform. You effectively get a flat device list of all the devices on the system. This assumption has a small esoteric problem that I'm looking at addressing now, namely what happens if a given device is supported by more than one platform?

0 Likes

Excellent, it sounds like you're putting a good amount of effort in. If I get a chance at some point I may experiment with VC++ on your behalf. Keep us informed!

0 Likes

I've added new features to clUtil. You can now specify that you want out of order execution before calling initialize and it will be enabled on platforms that support it.

I've added clUtilFinalize, that does the opposite of clUtilInitialize; it frees all the contexts, command queues and whathaveyou clUtil uses behind the scenes.

Also, I've started writing examples. If anyone wants to see a specific example just let me know, they generally aren't hard to write.

Additionally, I've started adding library functions callable from kernels that are executed on a thread block level. For example, radixSortLG sorts an array of length n using all threads in get_local_size(0). This is useful if you want to sort many arrays in parallel (e.g. sort each column or row of a matrix). I've also included sum, scan, and max. Currently, I'm just adding functions that I need for my own projects, but if anyone has requests (or wants to contribute), that would be great.

To use the aforementioned functions, just #include <clUtil.cl> in your kernel files. 

Also, I've updated the Makefile, so you can actually install clutil into /usr/include and /usr/lib by doing make install.

0 Likes

I've created a user group for clUtil at:

http://groups.google.com/group/clutil-users

If you actually use this, send me an email, as I have no idea how much interest there is in this or what people want.

0 Likes