8 Replies Latest reply on Feb 13, 2013 10:48 PM by ivan

    read_image performance

    ivan

      Hi there,

       

      Why does the read_imageui API always translated into 2 calls of sample_id()_sampler() with conditional branch? For example, this simple code

      =========

      __private int2 coords;

      __read_only image2d_t img;

      const sampler_t sampler=CLK_NORMALIZED_COORDS_FALSE|CLK_ADDRESS_NONE|CLK_FILTER_NEAREST;

      read_imageui(img,sampler,coords);

      =========

       

      is compiled into the following sequence in IL file:

       

      =========

      dcl_literal l42, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF; f128:i128 1844674407370955161518446744073709551615

      ...

      if_logicalnz l42

      round_neginf r83, r83

      mov r84.x___, l16

      sample_id(0)_sampler(0)_coordtype(unnormalized) r83, r83

      else

      mov r84, cb1[0]

      itof r84, r84

      mov r84, r84

      mov r88, r84

      mov r89, l43

      mov r84, l11

      cmov_logical r88, r84, r88, r89

      mov r88, r88

      mul_ieee r88, r88, r83

      mov r83, r88

      mov r83, r83

      round_neginf r88, r88

      mov r88, r88

      mov r88, r88

      cmov_logical r83, r84, r83, r88

      mov r83, r83

      mov r84, cb1[1]

      mov r84, r84

      mul_ieee r83, r83, r84

      mov r84.x___, l16

      sample_id(0)_sampler(0)_coordtype(normalized) r83, r83

      endif

      ==========

       

      I'm porting CUDA code and have multiple calls of read_imageui in my kernel. But it runs much slower compared to CUDA. I wonder where's the bottleneck?

        • Re: read_image performance
          himanshu.gautam

          Are you running on AMD GPU?  Can you tell which GPU you are running it on?

          Hope you have created the image with one of the "unsigned int" types.

           

          Please publish the full specs of your system:


          1. Platform - win32 / win64 / lin32 / lin64 or some other?

              Win7 or win vista or Win8.. Similarly for linux, your distribution

          2. Version of driver (13.1 is the latest)

          3. CPU or GPU Target?

          4. CPU/GPU details of your hardware

            • Re: read_image performance
              ivan

              I'm running it on Win7 32-bit machine with 13.1 drivers. The target is Radeon 5670 although I've tried to compiled for other devices and it's all the same.

              I'm creating a 2d image and using {CL_RGBA,CL_UNSIGNED_INT32} for cl_image_format. The kernel function is look like this:

              =========

              uint4 GetUi4(uint pos, __read_only image2d_t img)

              {

              __private int2 coords;

              const sampler_t sampler=CLK_NORMALIZED_COORDS_FALSE|CLK_ADDRESS_NONE|CLK_FILTER_NEAREST;

               

              coords.x=pos&0x7ff;

              coords.y=(pos>>11)&0x7ff;

               

              return read_imageui(img,sampler,coords);

              }

              =========

              The code works as expected but slow.

                • Re: read_image performance
                  himanshu.gautam

                  At this point, we need to look at the memory access pattern.

                  How adjacent workitems are accessing memory?

                  Can you throw some light on it?

                    • Re: read_image performance
                      ivan

                      I just cut off everything, created the following simple kernel (see below) and compiled it in kernel analyser (1.12.1288). It gives the same result.

                       

                      ============

                      uint4 GetUi4(uint pos, __read_only image2d_t img)

                      {

                      __private int2 coords;

                      const sampler_t sampler=CLK_NORMALIZED_COORDS_FALSE|CLK_ADDRESS_NONE|CLK_FILTER_NEAREST;

                       

                      coords.x=pos&0x7ff;

                      coords.y=(pos>>11)&0x7ff;

                       

                      return read_imageui(img,sampler,coords);

                      }

                       

                      __kernel

                      __attribute__((reqd_work_group_size(256, 1, 1)))

                      void mykernel(           __global unsigned int *pInput,

                      __read_only image2d_t gpu_tex,

                      __global unsigned int *pResult)

                      {

                      __private unsigned int idx=get_local_id(0);

                       

                      uint4 ui4=GetUi4(pInput[idx],gpu_tex);

                      pResult[get_global_id(0)]=ui4.x;

                      }

                      ============

                • Re: read_image performance
                  german

                  GPU has some precision differences with the OpenCL spec. It occurs only on certain values, but a workaround in the compiler must be generic. Don't look to the generated IL code, it's very generic. The most of instructions will be removed by SC(shader compiler). Check the final ISA. You should find just a couple extra instructions or even none for some cases.

                  How did you create an image object?

                  5670 doesn't have scalar architecture. Don't pack/unpack the texel positions. Pass int2 type into the function.

                  Also which nvidia board do you use for a comparison with HD5670?

                    • Re: read_image performance
                      ivan

                      Yeah, thanks for clarifying it. The ISA code seems to have only one call to sample. I compare 5670 with GTX 550ti. Both give almost the same results on similar tasks where there are no calls to read_imageui.

                       

                      Here's how the image is created:

                      =============

                      const cl_image_format formatRGBA={CL_RGBA,CL_UNSIGNED_INT32};

                      ocl cl;

                      //foo declarations

                      cl->CreateImage2D(CL_MEM_READ_ONLY|CL_MEM_COPY_HOST_PTR,&formatRGBA,2048,(gc->iItems+2047)/2048,foo,&gc->ocl.gpu_pItems) ;

                      ...

                      //ocl wrapper

                      BOOL ocl::CreateImage2D(cl_mem_flags flags, const cl_image_format *format, size_t w, size_t h, void *host_ptr, cl_mem *buf)

                      {

                      try {

                      if ( !m_context || !buf )

                      {

                      m_oclerr=CL_INVALID_VALUE;

                      return FALSE;

                      }

                       

                      //     if ( host_ptr )

                      //          flags|=CL_MEM_COPY_HOST_PTR;

                       

                      *buf=clCreateImage2D(m_context,flags,format,w,h,0,host_ptr,&m_oclerr);

                       

                      return ( *buf && m_oclerr==CL_SUCCESS );

                      }

                      catch (...)

                      {

                      m_oclerr=CL_INVALID_OPERATION;

                      return FALSE;

                      }

                      =============

                       

                      Thanks again for your suggestions

                        • Re: read_image performance
                          german

                          The creation code looks ok. How do you launch the kernel? What's your local size and dimensions?

                          For 2D images it's better to use 2 dimensions launch with local size 8x8=64 or even 16x16=256 threads. Linear launch may not be optimal and will depend from the look-up pattern. Also it will require extra ALU operations to derive x and y values. Here is an example of generic read. "size" is the real image size if it's not aligned. You can remove "if" statement, if you know for sure the image size is aligned with 8x8 or 16x16.

                              int2  coords;

                              coords.x = get_global_id(0);

                              coords..y = get_global_id(1);

                              if ((coords.x >= size.x) ||

                                      (coords.y >= size.y)) {

                                  return;

                              }

                              uint4  texel;

                              texel = read_imageui(src, sampler, coords);

                            • Re: read_image performance
                              ivan

                              Thanks, looks like where I get a stall. The problem is that I use read_image on linear block of memory the size of which is dynamically set during runtime. In CUDA I use tex1Dfetch function. But since there's no such an API in OpenCL for reading big 1d blocks of memory, I have to convert it to 2d array of [2048 x N] size. The reason I choose 2048 is that the memory block may be really huge. I then launch 65536 grid (local size is 256). I'll try to play around with 2dsize as you suggested, thanks for handling it.