cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

ivan
Adept I

read_image performance

Hi there,

Why does the read_imageui API always translated into 2 calls of sample_id()_sampler() with conditional branch? For example, this simple code

=========

__private int2 coords;

__read_only image2d_t img;

const sampler_t sampler=CLK_NORMALIZED_COORDS_FALSE|CLK_ADDRESS_NONE|CLK_FILTER_NEAREST;

read_imageui(img,sampler,coords);

=========

is compiled into the following sequence in IL file:

=========

dcl_literal l42, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF; f128:i128 1844674407370955161518446744073709551615

...

if_logicalnz l42

round_neginf r83, r83

mov r84.x___, l16

sample_id(0)_sampler(0)_coordtype(unnormalized) r83, r83

else

mov r84, cb1[0]

itof r84, r84

mov r84, r84

mov r88, r84

mov r89, l43

mov r84, l11

cmov_logical r88, r84, r88, r89

mov r88, r88

mul_ieee r88, r88, r83

mov r83, r88

mov r83, r83

round_neginf r88, r88

mov r88, r88

mov r88, r88

cmov_logical r83, r84, r83, r88

mov r83, r83

mov r84, cb1[1]

mov r84, r84

mul_ieee r83, r83, r84

mov r84.x___, l16

sample_id(0)_sampler(0)_coordtype(normalized) r83, r83

endif

==========

I'm porting CUDA code and have multiple calls of read_imageui in my kernel. But it runs much slower compared to CUDA. I wonder where's the bottleneck?

0 Likes
8 Replies
himanshu_gautam
Grandmaster

Are you running on AMD GPU?  Can you tell which GPU you are running it on?

Hope you have created the image with one of the "unsigned int" types.

Please publish the full specs of your system:


1. Platform - win32 / win64 / lin32 / lin64 or some other?

    Win7 or win vista or Win8.. Similarly for linux, your distribution

2. Version of driver (13.1 is the latest)

3. CPU or GPU Target?

4. CPU/GPU details of your hardware

0 Likes

I'm running it on Win7 32-bit machine with 13.1 drivers. The target is Radeon 5670 although I've tried to compiled for other devices and it's all the same.

I'm creating a 2d image and using {CL_RGBA,CL_UNSIGNED_INT32} for cl_image_format. The kernel function is look like this:

=========

uint4 GetUi4(uint pos, __read_only image2d_t img)

{

__private int2 coords;

const sampler_t sampler=CLK_NORMALIZED_COORDS_FALSE|CLK_ADDRESS_NONE|CLK_FILTER_NEAREST;

coords.x=pos&0x7ff;

coords.y=(pos>>11)&0x7ff;

return read_imageui(img,sampler,coords);

}

=========

The code works as expected but slow.

0 Likes

At this point, we need to look at the memory access pattern.

How adjacent workitems are accessing memory?

Can you throw some light on it?

0 Likes

I just cut off everything, created the following simple kernel (see below) and compiled it in kernel analyser (1.12.1288). It gives the same result.

============

uint4 GetUi4(uint pos, __read_only image2d_t img)

{

__private int2 coords;

const sampler_t sampler=CLK_NORMALIZED_COORDS_FALSE|CLK_ADDRESS_NONE|CLK_FILTER_NEAREST;

coords.x=pos&0x7ff;

coords.y=(pos>>11)&0x7ff;

return read_imageui(img,sampler,coords);

}

__kernel

__attribute__((reqd_work_group_size(256, 1, 1)))

void mykernel(           __global unsigned int *pInput,

__read_only image2d_t gpu_tex,

__global unsigned int *pResult)

{

__private unsigned int idx=get_local_id(0);

uint4 ui4=GetUi4(pInput[idx],gpu_tex);

pResult[get_global_id(0)]=ui4.x;

}

============

0 Likes
german
Staff

GPU has some precision differences with the OpenCL spec. It occurs only on certain values, but a workaround in the compiler must be generic. Don't look to the generated IL code, it's very generic. The most of instructions will be removed by SC(shader compiler). Check the final ISA. You should find just a couple extra instructions or even none for some cases.

How did you create an image object?

5670 doesn't have scalar architecture. Don't pack/unpack the texel positions. Pass int2 type into the function.

Also which nvidia board do you use for a comparison with HD5670?

0 Likes

Yeah, thanks for clarifying it. The ISA code seems to have only one call to sample. I compare 5670 with GTX 550ti. Both give almost the same results on similar tasks where there are no calls to read_imageui.

Here's how the image is created:

=============

const cl_image_format formatRGBA={CL_RGBA,CL_UNSIGNED_INT32};

ocl cl;

//foo declarations

cl->CreateImage2D(CL_MEM_READ_ONLY|CL_MEM_COPY_HOST_PTR,&formatRGBA,2048,(gc->iItems+2047)/2048,foo,&gc->ocl.gpu_pItems) ;

...

//ocl wrapper

BOOL ocl::CreateImage2D(cl_mem_flags flags, const cl_image_format *format, size_t w, size_t h, void *host_ptr, cl_mem *buf)

{

try {

if ( !m_context || !buf )

{

m_oclerr=CL_INVALID_VALUE;

return FALSE;

}

//     if ( host_ptr )

//          flags|=CL_MEM_COPY_HOST_PTR;

*buf=clCreateImage2D(m_context,flags,format,w,h,0,host_ptr,&m_oclerr);

return ( *buf && m_oclerr==CL_SUCCESS );

}

catch (...)

{

m_oclerr=CL_INVALID_OPERATION;

return FALSE;

}

=============

Thanks again for your suggestions

0 Likes

The creation code looks ok. How do you launch the kernel? What's your local size and dimensions?

For 2D images it's better to use 2 dimensions launch with local size 8x8=64 or even 16x16=256 threads. Linear launch may not be optimal and will depend from the look-up pattern. Also it will require extra ALU operations to derive x and y values. Here is an example of generic read. "size" is the real image size if it's not aligned. You can remove "if" statement, if you know for sure the image size is aligned with 8x8 or 16x16.

    int2  coords;

    coords.x = get_global_id(0);

    coords..y = get_global_id(1);

    if ((coords.x >= size.x) ||

            (coords.y >= size.y)) {

        return;

    }

    uint4  texel;

    texel = read_imageui(src, sampler, coords);

0 Likes

Thanks, looks like where I get a stall. The problem is that I use read_image on linear block of memory the size of which is dynamically set during runtime. In CUDA I use tex1Dfetch function. But since there's no such an API in OpenCL for reading big 1d blocks of memory, I have to convert it to 2d array of [2048 x N] size. The reason I choose 2048 is that the memory block may be really huge. I then launch 65536 grid (local size is 256). I'll try to play around with 2dsize as you suggested, thanks for handling it.

0 Likes