cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Vector Read(int4) of an Image-help

Hi

I am using int4 for read from the input Image and also to write to the output Image.

I have a 256*256 Image.

initially i assiged global work dimensions as 256,256 but its a bad idea as i used to get 16 Images of size 64*64 as its an int4 it reads 4 pixels for one work item and with 64 work items in X and Y direction it can read the entire Image.

Then i changed the global work item dimension by 64*256 ...as 64*4 is 256(in x direction the int4 reads 4 pixel/work item)...bad idea as it read the Image as 64*256.

I tried thinking if i make the global work item dimension loook like this 000100040008000120001600020...i tot work item will be executed when its a multiple of 4...definitely wrong.

Then reading inputImage as int4* and writing back as int*  is not possible or?

...i want to read image as an int4 and output as an int4...is it possible?

/************This code is for global work dimension 64*256*******************/

__kernel void convolution(

__global int4 *inputImage,                                  //cl_image_format.channel datatype CL_SIGNED_INT32;  CL_R;

__global int4 *outputImage,

  int inputWidth,                                                        //256

  int inputHeight,                                                       //256

__constant int *filter,                                           //sobel filter

      int filterWidth)                                                  //filterWidth=3

{

int x = get_global_id(0);

int y = get_global_id(1);

int4 sum = 0;

int kx, ky;

int widthby4 = inputWidth/4;

for (ky = -filterWidth/2; ky <= filterWidth/2; ++ky)

   {

      for (kx = -filterWidth/2; kx <= filterWidth/2; ++kx)

      {

           sum += inputImage[(y + ky) * widthby4 + (x + kx)] * filter[(ky + filterWidth/2) * filterWidth + (kx + filterWidth/2)];

      }

}

sum /= 9;

outputImage[(y)*widthby4+(x)] = sum;

}

0 Likes
1 Solution
heman
Adept II

Hi sharath.naidu,

The idea of using 64*256 global size instead of 256*256, while reading 4 pixels together is valid.

A few pointers, on what problem might be are:

1. To get good performance, try a bigger image. You should aim atcreating approx 4 times the workgroups as there are compute units in your GPU, to hide memory latency.

2. The snippet above confuses me a bit, because the comment showing cl_image_format as CL_R is valid for cl_image objects and not for general int4 buffers. Also if the input and output are cl_image types, there are special functions for reading/ writing into them. Refer SimpleImage sample for details.

3. If i assume that comment is irrelevent and assuming you are working on a single color component image(monochromatic), you can read 4 pixel values which are located adjacent to each other. Your math operations are involving multiplying a vector by scalar, but seems alright to me. (refer opencl spec for cross checking). And so the code above seems fine for the requirement: " .i want to read image as an int4 and output as an int4...is it possible?"

Hope it helps

View solution in original post

0 Likes
6 Replies
heman
Adept II

Hi sharath.naidu,

The idea of using 64*256 global size instead of 256*256, while reading 4 pixels together is valid.

A few pointers, on what problem might be are:

1. To get good performance, try a bigger image. You should aim atcreating approx 4 times the workgroups as there are compute units in your GPU, to hide memory latency.

2. The snippet above confuses me a bit, because the comment showing cl_image_format as CL_R is valid for cl_image objects and not for general int4 buffers. Also if the input and output are cl_image types, there are special functions for reading/ writing into them. Refer SimpleImage sample for details.

3. If i assume that comment is irrelevent and assuming you are working on a single color component image(monochromatic), you can read 4 pixel values which are located adjacent to each other. Your math operations are involving multiplying a vector by scalar, but seems alright to me. (refer opencl spec for cross checking). And so the code above seems fine for the requirement: " .i want to read image as an int4 and output as an int4...is it possible?"

Hope it helps

0 Likes

Hi heman

Thanks for the reply,The comment is irrelevant...i have not used Images,i have used buffers.

The problem is input is a 256*256 image as int4 reads 4 pixels together and wrtiting is back also in 4 pixels and i am getting the ouput as 64*256...but i want the output as 256*256...is there anyway i could do this. 

0 Likes

Hi sharath,

I hope the point 3 is still valid for your case.

But from my understanding of convolution algorithms the value of "sum" should be computed using adjacent pixel points. But the above code seems to be doing a int4 reads which will result in the "sum" value from pixel points which are 4 pixels apart.

Since convolution uses the same data multiple times, i would recommend you to read a block of image in LDS memory and then let threads read from LDS. Please share your host code too. The situation will be more clear then.

Hi heman

Exactly,that is the problem when i read the image from the global memory,i am reading 4 pixels apart and which exactl is not convolution,i just want the one neighbouring pixel,and when i try to access just the neighbouring pels by masking or shifting 3 bits in a int4 read,it doesnt make sense as workitems will be zero and i get blank spaces in the output...

Now i will try with Local mem.

but is it not feasible using the global memory?

Is there any other way...it would be very helpful to know.

0 Likes

Well, There is a very basic implementation in AMD APP SDK. It is a sample called SimpleConvolution and you can see how it can be done using only global memory. If you are using a single channel (just one color component) , i suggest you to use non-vectorized reads and writes.

I recommended to use LDS, for performance sake. Using LDS will complicate the implementation, but can provide orders of magnitude of performance as any pixel is required by many work-items which can be inside a single workgroup. In this case you can load a section of the bigger image in LDS using vectorized loads and write them using vectorized stores. Non vector reads/writes will be required from LDS.

Hi Himanshu

I was able to do the convolution using the vector reads by accessing each component of a vector and doing convolution for each component.

Now i will try with the LDS mem.vector read of image from global,vector read to local...but scalar read while convolution.

Thanks a lot.

Regards

Sharath

0 Likes