21 Replies Latest reply on Mar 20, 2012 8:32 AM by mpineyro

    Newbie question about workspace, buffers and char3

    Wheedle

      Hi everybody,

      I’m stuck with a small problem which is most probably due to my lack of knowledge of OpenCL. What I’m trying to do is to capture a colored image send it to my GPU transform it to grayscale and get it back. To do so I’m using buffers because it’s only an exercise. Because it’s most probably a logical mistake I’m writing down also my thinking. Here it is:

      In order to read the picture I first use OpenCV:

       

      IplImage* frame = cvLoadImage("D:\\Special Folders\\Pictures\\SamplePictures\\JellyfishS.jpg");       

      W = frame->width;

      H = frame->height;

       

      To access the image itself the IplImage Structure provides me with a char pointer “imageData”. I can also access the width and height of the image in pixels.

      I create also an “empty image” with only one channel that will receive the result of the kernel.

       

      IplImage * op_data = cvCreateImage(cvSize(frame->width,frame->height),frame->depth,1); //last argument is the number of channel

       

      After creating the platform, context, queue, program and so on I create the buffers like that:

      size_t mem_size = (size_t)frame->imageSize; //return the size of the image in bytes

      size_t outputSize = (size_t)op_data->imageSize;

      cl_mem data = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,mem_size, (void*)frame->imageData, &ciErrNum);

      cl_mem output = clCreateBuffer(context,CL_MEM_WRITE_ONLY,outputSize,NULL,&ciErrNum);

       

      My kernel is created like that:

      cl_kernel mykernel = clCreateKernel (myprog , "luminosityGrayscale",&status);

       

      I set the arguments:

      status = clSetKernelArg(mykernel, 0, sizeof(cl_mem), &data);

      status = clSetKernelArg(mykernel,1,sizeof(cl_mem), &output);

       

      And finally I execute like that:

      size_t globalws[2] = {W, H};

      cl_event evt;

      status = clEnqueueNDRangeKernel(queue, mykernel,2, 0, globalws, NULL,0, NULL, &evt);

       

      I read back the data like that:

      status = clEnqueueReadBuffer(queue, output, CL_TRUE,0, outputSize, 

      (void*) op_data->imageData , NULL, NULL, NULL);

       

      Now because the image I manipulate is a 24 bits RBG image (so 8 bits per channel) and because OpenCV store images in sequence like this RBGRBGRBGRBG,I decided for my kernel to use a char3 for the imput argument thinking that like that I could manipulate easily each pixel and for the output argument a char since the resulting image should only have one channel (so only 8 bits per pixel). I also, as shown in the code above, created a workspace with the size of my image in pixels (which means one workItem per pixel – if I well understand the concept). So my kernel looks like that:

       

      __kernel void luminosityGrayscale (__global uchar3 * data, __global uchar * output){

             const int i = get_global_id(0);

             const int j = get_global_id(1);

             const int sizei = get_global_size(0);

             int index = i+j*sizei;

             uint R = (int)data[index].s0;

             uint G = (int)data[index].s1;

             uint B =(int)data[index].s2;

             uint average = R*0.07+G*0.71+B*0.21;    

             output[index] = (uchar)average;  

      }

       

      The result of this is wrong. I have a grayscale image alright, but it is kind of a duplicated (three of four times), superposed, shifted image.

      My guess is that I screw up the index part in the kernel…But I can’t pinpoint the flaw. It’s also possible that I made a mistake somewhere else. So my question is: “Where my logic is flawed?”

       


        • Re: Newbie question about workspace, buffers and char3
          antzrhere

          The reason it is wrong is that all 3 component data types (char3, int3, float3 etc.) are arranged as 4-component vectors IN MEMORY (i.e. your char3 is actually aligned to a 4 byte boundary). 3 component vectors are a special case that was added to OpenCL later on during it's cycle. This allows the forth component to be ignored in arithmetic.

           

          That's why your image is 'shifted' as (because of the different sizes) each read is incremently misaligned by an additional byte.

           

          The only work around is to input your data as an RGBA image file. You can then choose to read it as a char4, or as char3 and the alpha channel will be ignored. While it is true you could also read each component as a scalar value from a char[] buffer but this would be very inefficient and bad practice.

            • Re: Newbie question about workspace, buffers and char3
              Wheedle

              Great! Thank you very much antzhere.

              I've modified my kernel like that:

               

              __kernel void luminosityGrayscale (__global uchar * data, __global uchar * output, int widthStep){

              //widthStep: the length of a row in byte

                        const int i = get_global_id(0);

                        const int j = get_global_id(1);

                        const int sizei = get_global_size(0);

                        int indexIn = 3*i+j*widthStep;

                        int indexOut = i+j*sizei;

                        int average = (int)data[indexIn]*0.07+(int)data[indexIn+1]*0.71+(int)data[indexIn+2]*0.21;

                        output[indexOut] = (uchar)average; 

              }

              Do you think it is also inefficient and a bad practice like that? I modified it like that because adding a 4th byte to each pixel seems a bit heavy especially that I have to do it from the host side.

              Anyhow, thanks again I would have spent weeks to figure this out.

                • Re: Newbie question about workspace, buffers and char3
                  notzed

                  char3 is really char4 in memory?  weird.

                   

                  anyway: yes doing it by byte will be relatively inefficient (as antzrhere has already stated), but outside of a micro-benchmark or particularly large input image, it might not be all that important for you.

                   

                  About the best I could manage was taking the input argument as a uint (or uchar4), and converting 3 consecutive uints into 4 consecutive output pixels, but that requires the data to be a multiple of 4 pixels in total (or handling 1 edge case somehow).  Here every read/every write is a 4-byte int, which is more efficient for the gpu.  In this case just use a 1-d work size width*height/4 otherwise you need a 4-pixel alignment for each row (it also lets you easily use 'persistent' kernels, which means you can specify the work size to match the device characteristics, and let them handle the batching, rather than specifying the work size to match the problem size - it's a lot simpler than it sounds).

                   

                  If you have to convert the RGB image to RGBx on the cpu anyway, you're might be better off just converting it to greyscale there, since you're already scanning the memory and the ALU will be effectively free, and you'll have less data to send to the gpu too.

                   

                  Also note that if you're going to then be taking the greyscale and 'doing stuff' with it, it might be more efficient to store the data in a format that matches the algorithms and the hardware: e.g. float.  For kernels that do 1 bit of work on a single pixel, the conversion overheads (and necessary clamping and so on) can be quite onerous, not to mention the loss of precision.  But obviously that depends on your application.

                  • Re: Newbie question about workspace, buffers and char3
                    antzrhere

                    If your using VLIW the compiler will probably pack the instructions pretty well regardless if you use vector operations...however accessing individual bytes at a time isn't efficient (it will be cached and the compiler may do a good job, but it isn't guaranteed)....but that's the problem working with unaligned data. Put it through KernelAnalyzer to get an idea of the performance. Anyway, to say if it's efficient or not is probably abit mute as I imagine this is a test case and the whole thing could be performed faster on the host.

                     

                    one thing I noticed in the last line of code you convert data[] to (int) and then multiply it by a double precision floating point value (unless the compiler is converting 0.71 to 0.71f, in which case single precision) and the back again to an int. A better way would be to keep it as ints (on Cayman for instance, float<->int conversions can only be performed on T-vector afaik and thus reduce performance significantly):

                     

                     

                    int average = convert_int(data[indexIn])*4588 + convert_int(data[indexIn+1])*46531 + convert_int(data[indexIn+2])*12763;

                              output[indexOut] = convert_uchar(average >> 16);

                    1 of 1 people found this helpful
                • Re: Newbie question about workspace, buffers and char3
                  antzrhere

                  ...just a note on my last post - I just read on Southern Islands integer multiplication is 1/4 of SP multiplication, so depending on your hardware it may work quicker converting to floats (i.e. floats on Southern Island based chips, but possibly use ints on Cayman chips)... however then it all depends on how quickly float<->int is performed on Southern islands  - the best bet is to test it with kernel analyzer.

                    • Re: Newbie question about workspace, buffers and char3
                      notzed

                      What about using the amd media ops pack/unpack stuff?  Is that accessing instructions not otherwise used in the compilation, or are they just helper macros?

                       

                      And you don't need a full 32-bit multiply for 8-bit values either, using the mul24 stuff would suffice, e.g. it could used fixed point and avoid the format conversion.

                       

                      At one point I decided that I was better of using images for pixel data as it gives you the format conversion 'for free' (and also allows the same code to work with different data formats more easily, e.g. the storage might only be 8 bit), and just coding everything in floats (i'm a fan of fixed point, but it does get tedious).  Although using images comes with it's own performance costs and issues which affect algorithmic choice.

                      1 of 1 people found this helpful
                        • Re: Newbie question about workspace, buffers and char3
                          Wheedle

                          Thank you both for your inputs. Your suggestions will be pretty soon very helpful. Until now i was concerned to make it work whatever the efficiency. I have some more kernels to write because at the end I want to make a basic motion detection app (for a project at school) so later on efficiency will definitely be a concern, but until then I have to master some more basic knowledge .

                            • Re: Newbie question about workspace, buffers and char3
                              antzrhere

                              If your planning to do some heavy image manipulation/detection stuff then I suggest you move to 2d images instead of buffers as caching will be more efficient for 2D lookups, plus as notzed said you get free image conversion to floats. However, I'm quite sure 24bit RGB is not supported for images leaving you with either 16bit RGB or 32bit RGBA. I know you may have to do some conversion host side, but if the kernel is quite computationally intensive, you will see the performance benefits from doing it this way (as opposed to a simple greyscale conversion which is probably PCIe transfer limited). That or reading blocks of the image into __local memory and reading from that to limit global memory access - but this may not fit your problem, whereas the image approach is more flexible.

                                • Re: Newbie question about workspace, buffers and char3
                                  antzrhere

                                  I was going to say..if you implement a simple RGB to RGBA conversion host side now (very simple) and start using 2d images early on (i.e. now) it will speed your coding/development up alot as read_image / write_image functions are much more friendly and efficient than buffers for image manipulation and are is more readable. Will also save you doing it later when you have a lot more code to change...just a thought

                                    • Re: Newbie question about workspace, buffers and char3
                                      Wheedle

                                      Definitely I'm gonna switch now for the 2D-image structure. i was using buffers only to get a grasp on the technology...and it was quite useful if only for this discussion, and you input about data alignment;

                                      Thanks again!

                                        • Re: Newbie question about workspace, buffers and char3
                                          mpineyro

                                          Hello Wheedle,

                                           

                                          would you mind sharing your code? I'm trying to load images using openCV for processing them with OpenCL and I'm having some problems.

                                           

                                          Thanks in advance!

                                           

                                          Martin.

                                            • Re: Newbie question about workspace, buffers and char3
                                              Wheedle

                                              Hey mpineyro,

                                               

                                              The code is actually in my original post. Here it is in a clearer manner (I hope):

                                              Get the image:

                                              IplImage* frame = cvLoadImage("D:\\Special Folders\\Pictures\\SamplePictures\\JellyfishS.jpg");

                                               

                                              Create the output image (empty):

                                              IplImage * op_data = cvCreateImage(cvSize(frame->width,frame->height),frame->depth,1);

                                              Get the sizes of the two images:

                                              size_t mem_size = (size_t)frame->imageSize; //return the size of the image in bytes

                                              size_t outputSize = (size_t)op_data->imageSize;

                                               

                                              Then create the input and output buffer:

                                               

                                              cl_mem data = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, mem_size, (void*)frame->imageData, &ciErrNum);


                                              Now my original image is copied in the buffer, thanks to the flag (CL_MEM_COPY_HOST_PTR); and because I passed the imageData pointer of frame as parameter.

                                              The output buffer (nothing special):

                                               

                                              cl_mem output = clCreateBuffer(context,CL_MEM_WRITE_ONLY,outputSize,NULL,&ciErrNum);

                                               

                                              I read back the result like that:

                                              status = clEnqueueReadBuffer(queue, output, CL_TRUE,0, outputSize,

                                              (void*) op_data->imageData , NULL, NULL, NULL);

                                               

                                                I pass the imageData pointer of the output image to the function clEnqueueReadBuffer

                                               

                                              Hope it'll help!