6 Replies Latest reply on Aug 18, 2011 12:41 PM by laobrasuca

    Very poor OpenCL performance

    petr.machacek

      Hello,

      I'm new to OpenCL, so I guess I'm doing some silly mistake..

      I tried to create simple program for image thresholding using OpenCL. Simplified version of the source code source is attached to this post.  The program work well, but execution time is very poor.

      When I execute the program on RGB image 5760x3240, the program output is [time is in miliseconds]:

       

      clCreateBuffer: 25.649242
      clSetKernelArg: 0.001205
      clEnqueueNDRangeKernel: 0.536059
      clFinish: 66.903236
      clEnqueueReadBuffer: 17.060545

       

      When I use IPP's threshold (Intel performance primitives) on the same picture, then the threshold takes 40 ms. 

      Why is my OpenCL program running on GPU so slow? I expected it to be much more faster.

      My hardware:

      Win7 Home, 64bit
      CPU AMD Phenom II X4 965, 3.4GHs (4 cores)
      6GB RAM
      GPU: GIGABYTE, ATI Radeon 5750 
      AMD APP SDK v2.4 

      Thanks for any hint.. 

       

      #inclue <stdio.h> #include <windows.h> #include <CL\opencl.h> typedef struct _PICBUF { unsigned Width; unsigned Height; unsigned BytesPerPixel; unsigned Components; unsigned BitDepth; unsigned Stride; unsigned char *Data; }PICBUF, *PPICBUF; #define SIZEOF_PICBUF(picbuf) \ ((picbuf).Stride * (picbuf).Height) #define COUNTER_TO_MS(Count) ((Count)/1000000.0) const char *KernelSource = "__kernel void threshold(__global uchar *input, __global uchar *output, const uchar thres)\n"\ "{\n"\ " size_t id = get_global_id(0);\n"\ " output[id] = input[id] < thres ? input[id] : thres;\n"\ "}\n"\ "\n"; unsigned GetCounter(double *pdTime); //------------------------------ int main() { cl_context context; cl_context_properties properties[3]; cl_kernel kernel; cl_command_queue command_queue; cl_program program; cl_int err; cl_uint num_of_platforms = 0; cl_platform_id platform_id[2]; cl_device_id device_id; cl_uint num_of_devices = 0; cl_mem input, output; size_t global; cl_uchar thres; PICBUF picIn = {0,}; PICBUF picOut = {0,}; double now, prev; int i = 0; ImageApiReadImage(L"Image.jpg", &picIn); ImageApiAllocPicBufEx(&picOut, &picIn); if(clGetPlatformIDs(2, platform_id, &num_of_platforms) != CL_SUCCESS) { printf("Could not read the platform id\n"); return 1; } //I have two platforms, the index 1 is AMD Accelerated parallel processing if(clGetDeviceIDs(platform_id[1], CL_DEVICE_TYPE_GPU, 1, &device_id, &num_of_devices) != CL_SUCCESS) { printf("Could not read the device id\n"); return 1; } properties[0] = CL_CONTEXT_PLATFORM; properties[1] = (cl_context_properties) platform_id[1]; properties[2] = 0; context = clCreateContext(properties, 1, &device_id, NULL, NULL, &err); command_queue = clCreateCommandQueue(context, device_id, 0, &err); program = clCreateProgramWithSource(context, 1, (const char **)&KernelSource, NULL, &err); if(err = clBuildProgram(program, 1, &device_id, NULL, NULL, NULL) != CL_SUCCESS) { printf("Could not compile the program.\n"); return 1; } kernel = clCreateKernel(program, "threshold", &err); GetCounter(&prev); input = clCreateBuffer(context, CL_MEM_READ_ONLY|CL_MEM_COPY_HOST_PTR, SIZEOF_PICBUF(picIn), picIn.Data, NULL); output = clCreateBuffer(context, CL_MEM_WRITE_ONLY, SIZEOF_PICBUF(picOut), NULL, NULL); GetCounter(&now); printf("clCreateBuffer: %Lf\n", COUNTER_TO_MS(now - prev)); GetCounter(&prev); thres = 128; err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &input); err = clSetKernelArg(kernel, 1, sizeof(cl_mem), &output); err = clSetKernelArg(kernel, 2, sizeof(cl_uchar), &thres); GetCounter(&now); printf("clSetKernelArg: %Lf\n", COUNTER_TO_MS(now - prev)); global = SIZEOF_PICBUF(picIn); GetCounter(&prev); if(err = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &global, NULL, 0, NULL, NULL)!=CL_SUCCESS) { printf("clEnqueueNDRangeKernel failed\n"); return 0; } GetCounter(&now); printf("clEnqueueNDRangeKernel: %Lf\n", COUNTER_TO_MS(now - prev)); GetCounter(&prev); if(clFinish(command_queue)!= CL_SUCCESS) { printf("clFinish failed\n"); return 1; } GetCounter(&now); printf("clFinish: %Lf\n", COUNTER_TO_MS(now - prev)); GetCounter(&prev); if(clEnqueueReadBuffer(command_queue, output, CL_TRUE, 0, SIZEOF_PICBUF(picOut), picOut.Data, 0, NULL, NULL)!=CL_SUCCESS) { printf("clEnqueueReadBuffer failed\n"); return 1; } GetCounter(&now); printf("clEnqueueReadBuffer: %Lf\n", COUNTER_TO_MS(now - prev)); ImageApiWriteImage(L"ImageOcl.jpg", &picOut, &par); ImageApiFinishPicBuf(&picIn); ImageApiFinishPicBuf(&picOut); clReleaseMemObject(input); clReleaseMemObject(output); clReleaseProgram(program); clReleaseKernel(kernel); clReleaseCommandQueue(command_queue); clReleaseContext(context); } //------------------------------ unsigned GetCounter( double *pdDouble ) { unsigned res = 0; static BOOL perfSupported = TRUE; static LARGE_INTEGER freq; LARGE_INTEGER count; static BOOL first = TRUE; if(first) { perfSupported = QueryPerformanceFrequency(&freq); first = FALSE; } if(perfSupported) { BOOL ret = QueryPerformanceCounter(&count); if(ret) { *pdTime = (1000000000. * (count.QuadPart / (double)freq.QuadPart)); } else { res = -1; } } else { DWORD ticks = GetTickCount(); *pdTime = (double)(ticks * 1000000); } return res; }

        • Very poor OpenCL performance
          nou

          try vectorize that kernel so you read and write uchar4 at once. or maybe even uchar16.

          • Very poor OpenCL performance
            genaganna

             

            Originally posted by: petr.machacek Hello,

             

            I'm new to OpenCL, so I guess I'm doing some silly mistake..

             

            I tried to create simple program for image thresholding using OpenCL. Simplified version of the source code source is attached to this post.  The program work well, but execution time is very poor.

             

            When I execute the program on RGB image 5760x3240, the program output is [time is in miliseconds]:

             

            clCreateBuffer: 25.649242 clSetKernelArg: 0.001205 clEnqueueNDRangeKernel: 0.536059 clFinish: 66.903236 clEnqueueReadBuffer: 17.060545 

             

            When I use IPP's threshold (Intel performance primitives) on the same picture, then the threshold takes 40 ms. 

             

            Why is my OpenCL program running on GPU so slow? I expected it to be much more faster.

             

            My hardware:

             

            Win7 Home, 64bit CPU AMD Phenom II X4 965, 3.4GHs (4 cores) 6GB RAM GPU: GIGABYTE, ATI Radeon 5750  AMD APP SDK v2.4 

             

            Thanks for any hint.. 

             

            Please go through the chapter 4 of Programming guide to understand all these.  You can ask questions if you have any from chapter 4.

              • Very poor OpenCL performance
                LeeHowes

                Those times don't look so unreasonable. Especially if it's a single kernel run and you have compilation time to consider.

                With such a short kernel you're going to be a long way off peak for the GPU - just dispatching the wavefronts to do that work will have a latency, then you have the kernel transitions to enqueue work to consider (something we can't fix fully for a while yet). The buffer read and write are obvious unavoidable overhead that the CPU does not need to do.

                With such a trivial kernel the CPU is probably the better place to do it unless it's going to be part of a long OpenCL pipeline that's on the GPU from then on. I would expect that even if you'd written that kernel using TBB or OpenMP, but it's even more true if you use IPP. You're comparing hand crafted CPU code to naive GPU code.

                Use this as a learning execise and see if you can get it much faster by vectorising, making sure you copy data in advance, maybe try some zero copy tricks to shift the overhead elsewhere and finally try the kernel as part of a longer sequence and see if the execution time goes down. That way you'll have a good idea what does work when you go to implement larger operations that are more likely to work well on the GPU.

                  • Very poor OpenCL performance
                    petr.machacek

                    Thanks to all! 

                    I'll try all the advices.

                    The threshold is a learning example based on OpenCL's "hello world" program. I thougt at first that it'd be good task to parallelize but as I see, it's not. Good experience.

                    Thanks again.

                      • Very poor OpenCL performance
                        LeeHowes

                        It's a good example to parallelise, it just has a very low level of arithmetic intensity. So it's going to be hard to get any benefit out of a hardware accelerator from it. It's good experience for learning the OpenCL interfaces it's just you're not likely to get exceptional performance increases from it :)

                        • Very poor OpenCL performance
                          laobrasuca

                           

                          Originally posted by: petr.machacek Thanks to all! 

                           

                          I'll try all the advices.

                           

                          The threshold is a learning example based on OpenCL's "hello world" program. I thougt at first that it'd be good task to parallelize but as I see, it's not. Good experience.

                           

                          Thanks again.

                           

                          have you tried running the very same example on a recent GeForce card? As LeeHowes says, you have lots of memory access for very few arithmetic operations. I suspect that non-vectorized memory reads/writes runs faster on NVIDIA cards than AMD's. If you cant use GeForce, try vectorizing your code, as Nou suggests. Than, we can retalk about performance.

                          The low level of arithmetic intensity applies only if you consider the whole process, meaning data transfer and kernel execution. If you consider only kernel execution (clEnqueueNDRangeKernel + clFinish), you should have good improvement in performance even for this kind of simple operation. But you've got know that current AMD architecture runs way faster with a vectorized code than otherwise.