The idea of using 64*256 global size instead of 256*256, while reading 4 pixels together is valid.
A few pointers, on what problem might be are:
1. To get good performance, try a bigger image. You should aim atcreating approx 4 times the workgroups as there are compute units in your GPU, to hide memory latency.
2. The snippet above confuses me a bit, because the comment showing cl_image_format as CL_R is valid for cl_image objects and not for general int4 buffers. Also if the input and output are cl_image types, there are special functions for reading/ writing into them. Refer SimpleImage sample for details.
3. If i assume that comment is irrelevent and assuming you are working on a single color component image(monochromatic), you can read 4 pixel values which are located adjacent to each other. Your math operations are involving multiplying a vector by scalar, but seems alright to me. (refer opencl spec for cross checking). And so the code above seems fine for the requirement: " .i want to read image as an int4 and output as an int4...is it possible?"
Hope it helps
Thanks for the reply,The comment is irrelevant...i have not used Images,i have used buffers.
The problem is input is a 256*256 image as int4 reads 4 pixels together and wrtiting is back also in 4 pixels and i am getting the ouput as 64*256...but i want the output as 256*256...is there anyway i could do this.
1 of 1 people found this helpful
I hope the point 3 is still valid for your case.
But from my understanding of convolution algorithms the value of "sum" should be computed using adjacent pixel points. But the above code seems to be doing a int4 reads which will result in the "sum" value from pixel points which are 4 pixels apart.
Since convolution uses the same data multiple times, i would recommend you to read a block of image in LDS memory and then let threads read from LDS. Please share your host code too. The situation will be more clear then.
Exactly,that is the problem when i read the image from the global memory,i am reading 4 pixels apart and which exactl is not convolution,i just want the one neighbouring pixel,and when i try to access just the neighbouring pels by masking or shifting 3 bits in a int4 read,it doesnt make sense as workitems will be zero and i get blank spaces in the output...
Now i will try with Local mem.
but is it not feasible using the global memory?
Is there any other way...it would be very helpful to know.
1 of 1 people found this helpful
Well, There is a very basic implementation in AMD APP SDK. It is a sample called SimpleConvolution and you can see how it can be done using only global memory. If you are using a single channel (just one color component) , i suggest you to use non-vectorized reads and writes.
I recommended to use LDS, for performance sake. Using LDS will complicate the implementation, but can provide orders of magnitude of performance as any pixel is required by many work-items which can be inside a single workgroup. In this case you can load a section of the bigger image in LDS using vectorized loads and write them using vectorized stores. Non vector reads/writes will be required from LDS.
I was able to do the convolution using the vector reads by accessing each component of a vector and doing convolution for each component.
Now i will try with the LDS mem.vector read of image from global,vector read to local...but scalar read while convolution.
Thanks a lot.