cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Wheedle
Journeyman III

Newbie question about workspace, buffers and char3

Hi everybody,

I’m stuck with a small problem which is most probably due to my lack of knowledge of OpenCL. What I’m trying to do is to capture a colored image send it to my GPU transform it to grayscale and get it back. To do so I’m using buffers because it’s only an exercise. Because it’s most probably a logical mistake I’m writing down also my thinking. Here it is:

In order to read the picture I first use OpenCV:

IplImage* frame = cvLoadImage("D:\\Special Folders\\Pictures\\SamplePictures\\JellyfishS.jpg");       

W = frame->width;

H = frame->height;

To access the image itself the IplImage Structure provides me with a char pointer “imageData”. I can also access the width and height of the image in pixels.

I create also an “empty image” with only one channel that will receive the result of the kernel.

IplImage * op_data = cvCreateImage(cvSize(frame->width,frame->height),frame->depth,1); //last argument is the number of channel

After creating the platform, context, queue, program and so on I create the buffers like that:

size_t mem_size = (size_t)frame->imageSize; //return the size of the image in bytes

size_t outputSize = (size_t)op_data->imageSize;

cl_mem data = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,mem_size, (void*)frame->imageData, &ciErrNum);

cl_mem output = clCreateBuffer(context,CL_MEM_WRITE_ONLY,outputSize,NULL,&ciErrNum);

My kernel is created like that:

cl_kernel mykernel = clCreateKernel (myprog , "luminosityGrayscale",&status);

I set the arguments:

status = clSetKernelArg(mykernel, 0, sizeof(cl_mem), &data);

status = clSetKernelArg(mykernel,1,sizeof(cl_mem), &output);

And finally I execute like that:

size_t globalws[2] = {W, H};

cl_event evt;

status = clEnqueueNDRangeKernel(queue, mykernel,2, 0, globalws, NULL,0, NULL, &evt);

I read back the data like that:

status = clEnqueueReadBuffer(queue, output, CL_TRUE,0, outputSize, 

(void*) op_data->imageData , NULL, NULL, NULL);

Now because the image I manipulate is a 24 bits RBG image (so 8 bits per channel) and because OpenCV store images in sequence like this RBGRBGRBGRBG,I decided for my kernel to use a char3 for the imput argument thinking that like that I could manipulate easily each pixel and for the output argument a char since the resulting image should only have one channel (so only 8 bits per pixel). I also, as shown in the code above, created a workspace with the size of my image in pixels (which means one workItem per pixel – if I well understand the concept). So my kernel looks like that:

__kernel void luminosityGrayscale (__global uchar3 * data, __global uchar * output){

       const int i = get_global_id(0);

       const int j = get_global_id(1);

       const int sizei = get_global_size(0);

       int index = i+j*sizei;

       uint R = (int)data[index].s0;

       uint G = (int)data[index].s1;

       uint B =(int)data[index].s2;

       uint average = R*0.07+G*0.71+B*0.21;    

       output[index] = (uchar)average;  

}

The result of this is wrong. I have a grayscale image alright, but it is kind of a duplicated (three of four times), superposed, shifted image.

My guess is that I screw up the index part in the kernel…But I can’t pinpoint the flaw. It’s also possible that I made a mistake somewhere else. So my question is: “Where my logic is flawed?”


0 Likes
1 Solution
antzrhere
Adept III

The reason it is wrong is that all 3 component data types (char3, int3, float3 etc.) are arranged as 4-component vectors IN MEMORY (i.e. your char3 is actually aligned to a 4 byte boundary). 3 component vectors are a special case that was added to OpenCL later on during it's cycle. This allows the forth component to be ignored in arithmetic.

That's why your image is 'shifted' as (because of the different sizes) each read is incremently misaligned by an additional byte.

The only work around is to input your data as an RGBA image file. You can then choose to read it as a char4, or as char3 and the alpha channel will be ignored. While it is true you could also read each component as a scalar value from a char[] buffer but this would be very inefficient and bad practice.

View solution in original post

0 Likes
21 Replies
antzrhere
Adept III

The reason it is wrong is that all 3 component data types (char3, int3, float3 etc.) are arranged as 4-component vectors IN MEMORY (i.e. your char3 is actually aligned to a 4 byte boundary). 3 component vectors are a special case that was added to OpenCL later on during it's cycle. This allows the forth component to be ignored in arithmetic.

That's why your image is 'shifted' as (because of the different sizes) each read is incremently misaligned by an additional byte.

The only work around is to input your data as an RGBA image file. You can then choose to read it as a char4, or as char3 and the alpha channel will be ignored. While it is true you could also read each component as a scalar value from a char[] buffer but this would be very inefficient and bad practice.

0 Likes

Great! Thank you very much antzhere.

I've modified my kernel like that:

__kernel void luminosityGrayscale (__global uchar * data, __global uchar * output, int widthStep){

//widthStep: the length of a row in byte

          const int i = get_global_id(0);

          const int j = get_global_id(1);

          const int sizei = get_global_size(0);

          int indexIn = 3*i+j*widthStep;

          int indexOut = i+j*sizei;

          int average = (int)data[indexIn]*0.07+(int)data[indexIn+1]*0.71+(int)data[indexIn+2]*0.21;

          output[indexOut] = (uchar)average; 

}

Do you think it is also inefficient and a bad practice like that? I modified it like that because adding a 4th byte to each pixel seems a bit heavy especially that I have to do it from the host side.

Anyhow, thanks again I would have spent weeks to figure this out.

0 Likes

char3 is really char4 in memory?  weird.

anyway: yes doing it by byte will be relatively inefficient (as antzrhere has already stated), but outside of a micro-benchmark or particularly large input image, it might not be all that important for you.

About the best I could manage was taking the input argument as a uint (or uchar4), and converting 3 consecutive uints into 4 consecutive output pixels, but that requires the data to be a multiple of 4 pixels in total (or handling 1 edge case somehow).  Here every read/every write is a 4-byte int, which is more efficient for the gpu.  In this case just use a 1-d work size width*height/4 otherwise you need a 4-pixel alignment for each row (it also lets you easily use 'persistent' kernels, which means you can specify the work size to match the device characteristics, and let them handle the batching, rather than specifying the work size to match the problem size - it's a lot simpler than it sounds).

If you have to convert the RGB image to RGBx on the cpu anyway, you're might be better off just converting it to greyscale there, since you're already scanning the memory and the ALU will be effectively free, and you'll have less data to send to the gpu too.

Also note that if you're going to then be taking the greyscale and 'doing stuff' with it, it might be more efficient to store the data in a format that matches the algorithms and the hardware: e.g. float.  For kernels that do 1 bit of work on a single pixel, the conversion overheads (and necessary clamping and so on) can be quite onerous, not to mention the loss of precision.  But obviously that depends on your application.

0 Likes

If your using VLIW the compiler will probably pack the instructions pretty well regardless if you use vector operations...however accessing individual bytes at a time isn't efficient (it will be cached and the compiler may do a good job, but it isn't guaranteed)....but that's the problem working with unaligned data. Put it through KernelAnalyzer to get an idea of the performance. Anyway, to say if it's efficient or not is probably abit mute as I imagine this is a test case and the whole thing could be performed faster on the host.

one thing I noticed in the last line of code you convert data[] to (int) and then multiply it by a double precision floating point value (unless the compiler is converting 0.71 to 0.71f, in which case single precision) and the back again to an int. A better way would be to keep it as ints (on Cayman for instance, float<->int conversions can only be performed on T-vector afaik and thus reduce performance significantly):

int average = convert_int(data[indexIn])*4588 + convert_int(data[indexIn+1])*46531 + convert_int(data[indexIn+2])*12763;

          output[indexOut] = convert_uchar(average >> 16);

antzrhere
Adept III

...just a note on my last post - I just read on Southern Islands integer multiplication is 1/4 of SP multiplication, so depending on your hardware it may work quicker converting to floats (i.e. floats on Southern Island based chips, but possibly use ints on Cayman chips)... however then it all depends on how quickly float<->int is performed on Southern islands  - the best bet is to test it with kernel analyzer.

0 Likes

What about using the amd media ops pack/unpack stuff?  Is that accessing instructions not otherwise used in the compilation, or are they just helper macros?

And you don't need a full 32-bit multiply for 8-bit values either, using the mul24 stuff would suffice, e.g. it could used fixed point and avoid the format conversion.

At one point I decided that I was better of using images for pixel data as it gives you the format conversion 'for free' (and also allows the same code to work with different data formats more easily, e.g. the storage might only be 8 bit), and just coding everything in floats (i'm a fan of fixed point, but it does get tedious).  Although using images comes with it's own performance costs and issues which affect algorithmic choice.

Thank you both for your inputs. Your suggestions will be pretty soon very helpful. Until now i was concerned to make it work whatever the efficiency. I have some more kernels to write because at the end I want to make a basic motion detection app (for a project at school) so later on efficiency will definitely be a concern, but until then I have to master some more basic knowledge .

0 Likes

If your planning to do some heavy image manipulation/detection stuff then I suggest you move to 2d images instead of buffers as caching will be more efficient for 2D lookups, plus as notzed said you get free image conversion to floats. However, I'm quite sure 24bit RGB is not supported for images leaving you with either 16bit RGB or 32bit RGBA. I know you may have to do some conversion host side, but if the kernel is quite computationally intensive, you will see the performance benefits from doing it this way (as opposed to a simple greyscale conversion which is probably PCIe transfer limited). That or reading blocks of the image into __local memory and reading from that to limit global memory access - but this may not fit your problem, whereas the image approach is more flexible.

0 Likes

I was going to say..if you implement a simple RGB to RGBA conversion host side now (very simple) and start using 2d images early on (i.e. now) it will speed your coding/development up alot as read_image / write_image functions are much more friendly and efficient than buffers for image manipulation and are is more readable. Will also save you doing it later when you have a lot more code to change...just a thought

0 Likes

Definitely I'm gonna switch now for the 2D-image structure. i was using buffers only to get a grasp on the technology...and it was quite useful if only for this discussion, and you input about data alignment;

Thanks again!

0 Likes

Hello Wheedle,

would you mind sharing your code? I'm trying to load images using openCV for processing them with OpenCL and I'm having some problems.

Thanks in advance!

Martin.

0 Likes

Hey mpineyro,

The code is actually in my original post. Here it is in a clearer manner (I hope):

Get the image:

IplImage* frame = cvLoadImage("D:\\Special Folders\\Pictures\\SamplePictures\\JellyfishS.jpg");

Create the output image (empty):

IplImage * op_data = cvCreateImage(cvSize(frame->width,frame->height),frame->depth,1);

Get the sizes of the two images:

size_t mem_size = (size_t)frame->imageSize; //return the size of the image in bytes

size_t outputSize = (size_t)op_data->imageSize;

Then create the input and output buffer:

cl_mem data = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, mem_size, (void*)frame->imageData, &ciErrNum);


Now my original image is copied in the buffer, thanks to the flag (CL_MEM_COPY_HOST_PTR); and because I passed the imageData pointer of frame as parameter.

The output buffer (nothing special):

cl_mem output = clCreateBuffer(context,CL_MEM_WRITE_ONLY,outputSize,NULL,&ciErrNum);

I read back the result like that:

status = clEnqueueReadBuffer(queue, output, CL_TRUE,0, outputSize,

(void*) op_data->imageData , NULL, NULL, NULL);

  I pass the imageData pointer of the output image to the function clEnqueueReadBuffer

Hope it'll help!

0 Likes

Hey Wheedle, thanks for your answer:

I used the code you included on your original post, but I can't get it to work. Right now I get a completly black output image when I run the code. As there are no building errors and no excecution errors I'm kind of lost.

I was asking for the complete code for checking if I'm creating the context  and passing the image data to the queue properly (the code I used for the context creation was extracted from the "template" sample of AMD SDK).

It would be of great help if you could post your entire cpp file so I can have a look at it.

Thanks!

Martin

0 Likes

Hey Ok sure I can do that as soon as I'm home (I'll need to clean it up a lot though cause it's just a testing file where I put a lot of c**p ).

But if you have no errors (provided you did what's required to catch them) I'd say to check the kernel and especially the way you assign the memory addresses a work-item should work on (I had once a black screen also because I was creating a 2D index space but my logic to access the right elements in the buffer was flawed because only in 1D ).

Anyhow I'll send it ASAP

0 Likes

OK there you go. I warn you that's a very dirty code with nothing special at all (besides the fact it's messy). It does something a bit different then when i wrote the original post. Now I capture a video, and make a naive background subtraction...

0 Likes

Wheedle, thanks a lot! I'll give it a try.

Martin.

0 Likes

Hi Wheedle; still trying to get this test working.

Can you please post the background substraction kernel code? Thanks a lot!!

Martin

0 Likes

There it is:

__kernel void backgroundSubstraction(__global uchar * image, __global uchar * reference, __global uchar * output){

    const int i = get_global_id(0);

    const int j = get_global_id(1);

    const int sizei = get_global_size(0);

    int index = i+j*sizei;

    output[index] = ((uint)image[index] - (uint)reference[index]);

}

if it still doesn't work, maybe you could share you code...maybe I could pin point something

0 Likes

Thanks!

Yesterday I commented the part of your code that has to do with the background substraction kernel (loading, excecution, etc) and it worked.

I still have a problem at the end of the excecution of the code; when it finishes playing the video (and also converting it to grayscale) a message pops up saying something "the program stopped workinkg correctly". Anyway I didn't had time to revise in detail the code looking for possible causes of memory leaks.

Your code was very helpful, thanks a lot!

Regards,

Maritn.

0 Likes

Yep it's because it's an infinite loop. I used to capture video from my webcam not from a file and I didn't change the loop with a condition to detect the last frame . As I told you it's a dirty code just for testing stuffs

0 Likes

I see!

So I'll code the detection of the last frame.

Best,

Martin

0 Likes