21 Replies Latest reply on Oct 19, 2011 9:03 AM by sourcery

    kernel overhead too high (again)

    sourcery

      Am running a modified templatec.cpp together with the APP profiler.

      Calling my own version of runkernels which writes a small data area, runs a kernel and reads back 8000 bytes.

      All the cl_mem addresses, data areas and arguments have been set up.

      Running on an HD6850 with APP

      Calling the runkernels code 1300 times gives me a writebuffer time of around 0.08 millisecond, a kernel time of around 0.3 milliseconds. and a read buffer time of around 0.16 milliseconds (for each iteration).

      Yet 1300 calls takes 54.0 seconds !

      Is it not possible to utilise OPENCL with quick kernels ?

       

       

       

        • kernel overhead too high (again)
          antzrhere

          It's difficult to understand exactly which bits your referring to without seeing your modified code...

          What your experiencing is not kernel overhead, but overhead arising from calling OpenCL functions on the host.

          Firstly, it is very inefficient to call runlkernels() 1300 times. The runkernels example only calls 256 work items by default which is not even enough to keep most graphics cards busy (even processing several hundread thousand similar work items would take less than a millisecond).

          If you intend to do more work, you would simply increase the global worksize ("width" in this example) for your kernel code (say 1300*256), rather than calling runkernel() 1300 times. Overhead for each work item is very low (we're talking hundreads of millions of work items per sec).  Runkernels() is probably spending more time retrieving device info every call (which is more waste) than actually crunching numbers.

           

            • kernel overhead too high (again)
              sourcery

              Unfortunately the only way this can work for me is if I can call the kernel routine and get a quick turnaround. I was trying to use the GPU to speed up a heavily used "routine" in a much larger application, lets call it Myapp.

              The routine takes an input data point and compares itself in a complex way against 1000 to 10000 other library data points (all preloaded into the GPU).

              Myapp's paths depend on the result of the kernel call, it will make some rather complicated decisions about the result, possibly retry with slightly different arguments or move on to another input data point.

              Rewriting Myapp would be months if not years of work.

              It probably does not need too much detail to replicate the problem

              How many times per second can Opencl stuff a single small input buffer, run a kernel and return say 8kb results data ?

              I seem to be seeing a 99% performance degradation over what the APP profiler session is reporting.

                • kernel overhead too high (again)
                  antzrhere

                  So one iteration (write, execute, call) takes 0.54 milliseconds in total, but 1300 of these takes 54 seconds (54000 milliseconds)? So that's 76x less efficient than a single call (relative per iteration)? that seems odd as you would expect this to increase linearly...

                  Would you be able to post *some* of your Host OpenCL code and maybe *some* of your CL kernel code? There maybe something amiss?

                   

                    • kernel overhead too high (again)
                      sourcery

                      I agree there may well be something amiss, the kernel overhead cannpt be that bad surely ?

                      Ok, code much simplified and hopefully attached.

                      Thanks for the interest.

                       

                      // Very simple Kernel, replacement for templatec.cl typedef float cl_float; typedef unsigned short cl_ushort; typedef signed short cl_short; typedef signed int cl_int; typedef unsigned int cl_uint; typedef unsigned char cl_uchar; typedef struct tagGPUMETRICS { cl_uint data[512*6]; } GPUMETRICS; typedef struct tagGPUXSM { cl_uint data[32]; } GPUXSM; typedef struct tagGPUCALCX { cl_uint data[32]; } GPUCALCX; __kernel void templateKernel( __global const GPUMETRICS *gpumetrics, __global GPUXSM *gpuxsm, __global cl_ushort *gpuresults, __global GPUCALCX *gpucalcx, __global const cl_ushort *gpusmlist, const cl_int listlen) { // Do nothing - it demonstrates the problem just as well return; } //Replacement Templatec.cpp #include <CL/cl.h> #include <string.h> #include <cstdlib> #include <iostream> #include <string> #include <fstream> #include <time.h> #include <conio.h> /*** GLOBALS ***/ #define tablim 256 #define kernel_count 256 #define mapydim 16 typedef float cl_float; typedef unsigned short cl_ushort; typedef signed short cl_short; typedef signed int cl_int; typedef unsigned int cl_uint; typedef unsigned char cl_uchar; typedef struct tagGPUMETRICS { cl_uint data[512*6]; } GPUMETRICS; typedef struct tagGPUXSM { cl_uint data[32]; } GPUXSM; typedef struct tagGPUCALCX { cl_uint data[32]; } GPUCALCX; cl_int gpuxsmlen; // Length of gpuxsm array cl_int gpuresultslen; cl_int gpusmlistlen; // Lengths of gpusmlist/gpuxsmloc arrays GPUMETRICS *gpumetrics; // readonly by GPU fixed for a given initgtab GPUCALCX gpucalcx; // updated by GPU GPUXSM *gpuxsm; // readonly by GPU fixed for a given setupasm cl_ushort *gpusmlist; // readonly by GPU fixed for a given setupasm cl_ushort *gpuresults; // data updated by GPU cl_mem cl_gpumetrics; cl_mem cl_gpuxsm; cl_mem cl_gpusmlist; cl_mem cl_gpuresults; cl_mem cl_gpucalcx; cl_context context; cl_device_id *devices; cl_command_queue commandQueue; cl_program program; cl_kernel kernel; // This program uses only one kernel and this serves as a handle to it cl_uint maxDims; size_t maxWorkGroupSize; size_t maxWorkItemSizes[3]; /*** FUNCTION DECLARATIONS ***/ int initializeCL (void); int convertToString(const char * filename, std::string& str); int cleanupCL (void); int gpu_setupargs (void); int gpu_runKernel (long); /* * Converts the contents of a file into a string */ int convertToString(const char *filename, std::string& s) { size_t size; char* str; std::fstream f(filename, (std::fstream::in | std::fstream::binary)); if(f.is_open()) { size_t fileSize; f.seekg(0, std::fstream::end); size = fileSize = (size_t)f.tellg(); f.seekg(0, std::fstream::beg); str = new char[size+1]; if(!str) { f.close(); return NULL; } f.read(str, fileSize); f.close(); str[size] = '\0'; s = str; delete[] str; return 0; } fprintf(stderr, "GPU Error: Failed to open file %s\n", filename); return 1; } /* * \brief OpenCL related initialization * Create Context, Device list, Command Queue * Create OpenCL memory buffer objects * Load CL file, compile, link CL source * Build program and kernel objects */ int initializeCL(void) { cl_int status = 0; size_t deviceListSize; /* * Have a look at the available platforms and pick either * the AMD one if available or a reasonable default. */ cl_uint numPlatforms; cl_platform_id platform = NULL; status = clGetPlatformIDs(0, NULL, &numPlatforms); if(status != CL_SUCCESS) { fprintf(stderr, "GPU Error: Getting Platforms. (clGetPlatformsIDs)\n"); return 1; } if(numPlatforms > 0) { cl_platform_id* platforms = (cl_platform_id *)malloc(numPlatforms*sizeof(cl_platform_id)); status = clGetPlatformIDs(numPlatforms, platforms, NULL); if(status != CL_SUCCESS) { fprintf(stderr, "Error: Getting Platform Ids. (clGetPlatformsIDs)\n"); return 1; } for(unsigned int i=0; i < numPlatforms; ++i) { char pbuff[100]; status = clGetPlatformInfo( platforms[i], CL_PLATFORM_VENDOR, sizeof(pbuff), pbuff, NULL); if(status != CL_SUCCESS) { fprintf(stderr, "Error: Getting Platform Info. (clGetPlatformInfo)\n"); return 1; } platform = platforms[i]; //if(!strcmp(pbuff, "Advanced Micro Devices, Inc.")) //{ // break; //} } delete platforms; } if(NULL == platform) { fprintf(stderr, "GPU Error: NULL platform found so Exiting Application.\n"); return 1; } /* * If we could find our platform, use it. Otherwise use just available platform. */ cl_context_properties cps[3] = { CL_CONTEXT_PLATFORM, (cl_context_properties)platform, 0 }; ///////////////////////////////////////////////////////////////// // Create an OpenCL context ///////////////////////////////////////////////////////////////// context = clCreateContextFromType(cps, CL_DEVICE_TYPE_GPU, //*** Only interested if GPU present NULL, NULL, &status); if(status != CL_SUCCESS) { context = clCreateContextFromType(cps, CL_DEVICE_TYPE_ALL, NULL, NULL, &status); if(status != CL_SUCCESS) { fprintf(stderr, "GPU Error: Creating Context. (clCreateContextFromType)\n"); return 1; } fprintf (stderr, "Using CPU\n"); } else { fprintf (stderr, "Using GPU\n"); } /* First, get the size of device list data */ status = clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &deviceListSize); if(status != CL_SUCCESS) { fprintf(stderr, "GPU Error: Getting Context Info \ (device list size, clGetContextInfo)\n"); return 1; } ///////////////////////////////////////////////////////////////// // Detect OpenCL devices ///////////////////////////////////////////////////////////////// devices = (cl_device_id *)malloc(deviceListSize); if(devices == 0) { fprintf(stderr, "GPU Error: No devices found.\n"); return 1; } /* Now, get the device list data */ status = clGetContextInfo( context, CL_CONTEXT_DEVICES, deviceListSize, devices, NULL); if(status != CL_SUCCESS) { fprintf(stderr, "GPU Error: Getting Context Info \ (device list, clGetContextInfo)\n"); return 1; } char buffer[1024]; cl_uint buf_uint; cl_uint buf_ulong; clGetDeviceInfo(devices[0], CL_DEVICE_NAME, sizeof(buffer), buffer, NULL); fprintf(stderr, " DEVICE_NAME = %s\n", buffer); clGetDeviceInfo(devices[0], CL_DEVICE_VENDOR, sizeof(buffer), buffer, NULL); fprintf(stderr, " DEVICE_VENDOR = %s\n", buffer); clGetDeviceInfo(devices[0], CL_DEVICE_VERSION, sizeof(buffer), buffer, NULL); fprintf(stderr, " DEVICE_VERSION = %s\n", buffer); clGetDeviceInfo(devices[0], CL_DRIVER_VERSION, sizeof(buffer), buffer, NULL); fprintf(stderr, " DRIVER_VERSION = %s\n", buffer); clGetDeviceInfo(devices[0], CL_DEVICE_MAX_COMPUTE_UNITS, sizeof(buf_uint), &buf_uint, NULL); fprintf(stderr, " DEVICE_MAX_COMPUTE_UNITS = %u\n", (unsigned int)buf_uint); clGetDeviceInfo(devices[0], CL_DEVICE_MAX_CLOCK_FREQUENCY, sizeof(buf_uint), &buf_uint, NULL); fprintf(stderr, " DEVICE_MAX_CLOCK_FREQUENCY = %u\n", (unsigned int)buf_uint); clGetDeviceInfo(devices[0], CL_DEVICE_GLOBAL_MEM_SIZE, sizeof(buf_ulong), &buf_ulong, NULL); fprintf(stderr, " DEVICE_GLOBAL_MEM_SIZE = %llu\n", (unsigned long long)buf_ulong); ///////////////////////////////////////////////////////////////// // Create an OpenCL command queue ///////////////////////////////////////////////////////////////// commandQueue = clCreateCommandQueue( context, devices[0], 0, &status); if(status != CL_SUCCESS) { fprintf(stderr, "GPU Creating Command Queue. (clCreateCommandQueue)\n"); return 1; } ///////////////////////////////////////////////////////////////// // Load CL file, build CL program object, create CL kernel object ///////////////////////////////////////////////////////////////// const char * filename = "templatec_kernels.cl"; std::string sourceStr; status = convertToString(filename, sourceStr); if(status != CL_SUCCESS) return 1; const char * source = sourceStr.c_str(); size_t sourceSize[] = { strlen(source) }; program = clCreateProgramWithSource( context, 1, &source, sourceSize, &status); if(status != CL_SUCCESS) { fprintf(stderr, "GPU Error: Loading Binary into cl_program \ (clCreateProgramWithBinary)\n"); return 1; } /* create a cl program executable for all the devices specified */ status = clBuildProgram(program, 1, devices, NULL, NULL, NULL); if(status != CL_SUCCESS) { fprintf(stderr, "GPU Error: Building Program (clBuildProgram)\n"); return 1; } else fprintf(stderr, "GPU Succesfully Built Program (clBuildProgram)\n"); /* get a kernel object handle for a kernel with the given name */ kernel = clCreateKernel(program, "templateKernel", &status); if(status != CL_SUCCESS) { fprintf(stderr, "GPU Error: Creating Kernel from program. (clCreateKernel)\n"); return 1; } return 0; } int gpu_setupargs (void) { cl_int status; cl_int i; /** * Query device capabilities. Maximum * work item dimensions and the maximmum * work item sizes */ status = clGetDeviceInfo( devices[0], CL_DEVICE_MAX_WORK_GROUP_SIZE, sizeof(size_t), (void*)&maxWorkGroupSize, NULL); if(status != CL_SUCCESS) { fprintf(stderr, "GPU Error: Getting Device Info. (clGetDeviceInfo)\n"); return 1; } fprintf (stderr, "maxworkgroupsize = %d\n", maxWorkGroupSize); status = clGetDeviceInfo( devices[0], CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS, sizeof(cl_uint), (void*)&maxDims, NULL); if(status != CL_SUCCESS) { fprintf(stderr, "GPU Error: Getting Device Info. (clGetDeviceInfo)\n"); return 1; } status = clGetDeviceInfo( devices[0], CL_DEVICE_MAX_WORK_ITEM_SIZES, sizeof(size_t)*maxDims, (void*)maxWorkItemSizes, NULL); if(status != CL_SUCCESS) { fprintf(stderr, "GPU Error: Getting Device Info. (clGetDeviceInfo)\n"); return 1; } cl_long maxmem; status = clGetDeviceInfo( devices[0], CL_DEVICE_MAX_MEM_ALLOC_SIZE, sizeof(cl_long), (void*)&maxmem, NULL); if(status != CL_SUCCESS) { fprintf(stderr, "GPU Error: Getting Device Info. (clGetDeviceInfo)\n"); return 1; } fprintf (stderr, "device info %d, %d, %d, %d\n", maxWorkGroupSize, maxDims, maxWorkItemSizes, maxmem); // Release OpenCL memory buffers if (cl_gpuxsm) { status = clReleaseMemObject(cl_gpuxsm); if(status != CL_SUCCESS) { printf("Error: In clReleaseMemObject (gpuxsm)\n"); return 1; } } if (cl_gpumetrics) { status = clReleaseMemObject(cl_gpumetrics); if(status != CL_SUCCESS) { printf("Error: In clReleaseMemObject (gpumetrics)\n"); return 1; } } if (cl_gpuresults) { status = clReleaseMemObject(cl_gpuresults); if(status != CL_SUCCESS) { printf("Error: In clReleaseMemObject (gpuresults)\n"); return 1; } } if (cl_gpusmlist) { status = clReleaseMemObject(cl_gpusmlist); if(status != CL_SUCCESS) { printf("Error: In clReleaseMemObject (gpusmlist)\n"); return 1; } } cl_gpumetrics = clCreateBuffer( context, CL_MEM_READ_ONLY, sizeof(GPUMETRICS), NULL, &status); if(status != CL_SUCCESS) { fprintf(stderr, "GPU Error: clCreateBuffer (gpumetrics)\n"); return 1; } status = clEnqueueWriteBuffer( commandQueue, cl_gpumetrics, CL_TRUE, 0, sizeof(GPUMETRICS), gpumetrics, 0, NULL, NULL); if(status != CL_SUCCESS) { fprintf(stderr, "GPU Error: clEnqueueWriteBuffer (gpumetrics)\n"); return 1; } cl_gpuxsm = clCreateBuffer( context, CL_MEM_READ_ONLY, sizeof(GPUXSM) * gpuxsmlen, NULL, //gpuxsm, &status); if(status != CL_SUCCESS) { fprintf(stderr, "GPU Error: clCreateBuffer (gpuxsm)\n"); return 1; } status = clEnqueueWriteBuffer( commandQueue, cl_gpuxsm, CL_TRUE, 0, sizeof(GPUXSM) * gpuxsmlen, gpuxsm, 0, NULL, NULL); if(status != CL_SUCCESS) { fprintf(stderr, "GPU Error: clEnqueueWriteBuffer (gpuxsm)\n"); return 1; } cl_gpucalcx = clCreateBuffer( context, CL_MEM_READ_WRITE, sizeof(GPUCALCX), NULL, &status); if(status != CL_SUCCESS) { fprintf(stderr, "GPU Error: clCreateBuffer (gpucalcx)\n"); return 1; } cl_gpusmlist = clCreateBuffer( context, CL_MEM_READ_ONLY, sizeof(cl_ushort) * gpusmlistlen, NULL, &status); if(status != CL_SUCCESS) { fprintf(stderr, "GPU Error: clCreateBuffer (gpusmlist)\n"); return 1; } status = clEnqueueWriteBuffer( commandQueue, cl_gpusmlist, CL_TRUE, 0, sizeof(cl_ushort) * gpusmlistlen, gpusmlist, 0, NULL, NULL); if(status != CL_SUCCESS) { fprintf(stderr, "GPU Error: clEnqueueWriteBuffer (gpuxsm)\n"); return 1; } cl_gpuresults = clCreateBuffer( context, CL_MEM_READ_WRITE, sizeof(cl_ushort) * gpuresultslen, NULL, &status); if(status != CL_SUCCESS) { fprintf(stderr, "GPU Error: clCreateBuffer (gpuresults)\n"); return 1; } /*** Set appropriate arguments to the kernel ***/ status = clSetKernelArg( kernel, 0, sizeof(cl_mem), (void *) &cl_gpumetrics); if(status != CL_SUCCESS) { fprintf(stderr, "GPU Error: Setting kernel argument. (0)\n"); return 1; } status = clSetKernelArg( kernel, 1, sizeof(cl_mem), (void *) &cl_gpuxsm); if(status != CL_SUCCESS) { fprintf(stderr, "GPU Error: Setting kernel argument. (1)\n"); return 1; } status = clSetKernelArg( kernel, 2, sizeof(cl_mem), (void *) &cl_gpuresults); if(status != CL_SUCCESS) { fprintf(stderr, "GPU Error: Setting kernel argument. (2)\n"); return 1; } status = clSetKernelArg( kernel, 3, sizeof(cl_mem), (void *) &cl_gpucalcx); if(status != CL_SUCCESS) { fprintf(stderr, "GPU Error: Setting kernel argument. (3)\n"); return 1; } status = clSetKernelArg( kernel, 4, sizeof(cl_mem), (void *) &cl_gpusmlist); if(status != CL_SUCCESS) { fprintf(stderr, "GPU Error: Setting kernel argument. (4) %d\n", 0); return 1; } status = clSetKernelArg( kernel, 5, sizeof(cl_uint), (void *) &gpusmlistlen); if(status != CL_SUCCESS) { fprintf(stderr, "GPU Error: Setting kernel argument. (5)\n"); return 1; } else { //fprintf(stderr, "GPU Error: Setting kernel argument. (5) = %d\n", gpusmlistlen[0]); } fprintf(stderr, "GPU setup args done\n"); return 0; } /* * \brief Run OpenCL program * * Bind host variables to kernel arguments * Run the CL kernel */ int gpu_runKernel (cl_int ilist) { cl_int status; size_t globalThreads[1]; size_t localThreads[1]; status = clEnqueueWriteBuffer( commandQueue, cl_gpucalcx, CL_TRUE, 0, sizeof(GPUCALCX), &gpucalcx, 0, NULL, NULL); if(status != CL_SUCCESS) { fprintf(stderr, "GPU Error: clEnqueueWriteBuffer (gpucalcx)\n"); return 1; } int globalworksize = ((gpusmlistlen+kernel_count-1)/kernel_count)*kernel_count; globalThreads[0] = globalworksize; localThreads[0] = kernel_count; //fprintf(stderr, "maxgroups = %d\n", globalworksize/kernel_count); if(localThreads[0] > maxWorkGroupSize || localThreads[0] > maxWorkItemSizes[0]) { fprintf(stderr, "GPU Unsupported: Device does not support requested number of work items."); return 1; } clFinish (commandQueue); /* * Enqueue a kernel run call. */ status = clEnqueueNDRangeKernel( commandQueue, kernel, 1, NULL, globalThreads, localThreads, 0, NULL, NULL); if(status != CL_SUCCESS) { fprintf(stderr, "GPU Error: Enqueueing kernel onto command queue. \ (clEnqueueNDRangeKernel)\n"); return 1; } clFinish (commandQueue); // Enqueue readBuffer status = clEnqueueReadBuffer( commandQueue, cl_gpucalcx, CL_TRUE, 0, sizeof(GPUCALCX), &gpucalcx, 0, NULL, NULL); if(status != CL_SUCCESS) { fprintf(stderr, "GPU Error: clEnqueueReadBuffer (gpucalcx)\n"); return 1; } clFinish (commandQueue); return 0; } /* * \brief Release OpenCL resources (Context, Memory etc.) */ int cleanupCL(void) { cl_int status, i; status = clReleaseMemObject(cl_gpucalcx); if(status != CL_SUCCESS) { printf("Error: In clReleaseMemObject (gpucalcx)\n"); //return 1; } status = clReleaseMemObject(cl_gpuxsm); if(status != CL_SUCCESS) { printf("Error: In clReleaseMemObject (gpuxsm)\n"); //return 1; } status = clReleaseMemObject(cl_gpumetrics); if(status != CL_SUCCESS) { printf("Error: In clReleaseMemObject (gpumetrics)\n"); //return 1; } status = clReleaseMemObject(cl_gpuresults); if(status != CL_SUCCESS) { printf("Error: In clReleaseMemObject (gpuresults)\n"); //return 1; } status = clReleaseMemObject(cl_gpusmlist); if(status != CL_SUCCESS) { printf("Error: In clReleaseMemObject (gpusmlist)\n"); //return 1; } status = clReleaseKernel(kernel); if(status != CL_SUCCESS) { fprintf(stderr, "GPU Error: In clReleaseKernel \n"); //return 1; } status = clReleaseProgram(program); if(status != CL_SUCCESS) { fprintf(stderr, "GPU Error: In clReleaseProgram\n"); //return 1; } status = clReleaseCommandQueue(commandQueue); if(status != CL_SUCCESS) { fprintf(stderr, "GPU Error: In clReleaseCommandQueue\n"); //return 1; } status = clReleaseContext(context); if(status != CL_SUCCESS) { fprintf(stderr, "GPU Error: In clReleaseContext\n"); //return 1; } if(devices != NULL) { free(devices); devices = NULL; } return 0; } int gpu_open (void) { // Initialize OpenCL resources if(initializeCL() == 1) return 1; //gpucalcx = (GPUCALCX *) malloc (sizeof(GPUCALCX)); return 0; } int gpu_close (void) { // Releases OpenCL resources if(cleanupCL() == 1) return 1; return 0; } int gpu_setuptabs (void) { if (gpumetrics) { free (gpumetrics); } gpumetrics = (GPUMETRICS *) malloc (sizeof(GPUMETRICS)); // for test purposes just zero it memset (gpumetrics, 0, sizeof(GPUMETRICS)); return 0; } int gpu_setupasm (void) { long ic, is, kv; cl_ushort *pgpusmlist; GPUXSM *pgpuxsm; cl_ushort index; if (gpuxsm) { free (gpuxsm); gpuxsmlen = 0; } if (gpuresults) free (gpuresults); if (gpusmlist) free (gpusmlist); gpusmlistlen = 0; gpuxsmlen = 16000; gpuxsm = (GPUXSM *) malloc (gpuxsmlen * sizeof(GPUXSM)); // for test purposes just zero it memset (gpuxsm, 0, gpuxsmlen * sizeof (GPUXSM)); pgpuxsm = gpuxsm; fprintf (stderr, "gpuxsmlen = %d, %d\n", gpuxsmlen, gpuxsmlen * sizeof(GPUXSM)); gpusmlistlen = 4000; gpuresultslen = 4000; gpusmlist = (cl_ushort *) malloc (gpusmlistlen * sizeof (cl_ushort)); fprintf (stderr, "gpusmlistlen = %d\n", gpusmlistlen); for (index = 0; index<gpusmlistlen; index++) gpusmlist[index] = index<<2; gpuresults = (cl_ushort *) malloc (gpuresultslen * sizeof (cl_ushort)); return 0; } int main(int argc, char **argv) { clock_t t0; gpu_open (); gpu_setupasm (); gpu_setuptabs (); gpu_setupargs (); fprintf (stderr, "sizeofs %d\n", sizeof(GPUMETRICS)); fprintf (stderr, "sizeofs %d\n", sizeof(GPUXSM)); fprintf (stderr, "sizeofs %d\n", sizeof(GPUCALCX)); if (gpumetrics == 0) fprintf (stderr, "gpumetrics not set up\n"); if (gpuxsm == 0) fprintf (stderr, "gpuxsm not set up\n"); if (gpusmlist == 0) fprintf (stderr, "gpusmlist not set up\n"); if (gpuresults == 0) fprintf (stderr, "gpuresults not set up\n"); memset (&gpucalcx, 0, sizeof(GPUCALCX)); t0 = clock (); cl_uint i; for (i=0; i<1300; i++) gpu_runKernel (0); t0 = clock () -t0; fprintf (stderr, "time %d\n", t0); gpu_close (); getch (); exit (0); }

                        • kernel overhead too high (again)
                          antzrhere

                          I *think* this may be related to your problem, but I'm not sure, but have a try:

                          Your not calling clSetKernelArg() every time before executing your kernel code (clSetKernelArg() should be in your gpu_runKernel() code so you set the arguments each and every time you call the kernel, that's how I see it, but I haven't tested it). This may result in your 2nd, 3rd etc. kernel iterations not having the correct arguments, hence why these are taking much longer than just one iteration.

                          This is just a guess - your code may be in fact perfectly legal. I know clSetKernelArg() cannot retain memory objects specified as arguments according to the docs, but I'm not sure if it retains the reference/pointer to memory objects?

                          [EDIT] I've tried not specifiy the arguments every kernel call on my system using AMD SDK 2.5 and a 5870 and it's fine, no difference, so maybe I'm wrong - but who knows maybe on your setup things may behave differently??

                            • kernel overhead too high (again)
                              antzrhere

                              According to the spec, "The argument data pointed to by arg_value is copied and the arg_value pointer can therefore be reused by the application after clSetKernelArg returns"

                              so what I said is incorrect and there should be no problem with your code, however give it a try, it might be different on your setup?

                                • kernel overhead too high (again)
                                  sourcery

                                  Resetting clSetKernelArg every time the kernel is invoked, doesnt help unfortunately.

                                    • kernel overhead too high (again)
                                      sourcery

                                      Ok my understanding has improved a little.

                                      The very long run times when profiling  Templatec must be due to the profiling data collection.

                                      When not running the profiler I get an average runkernel execution time of 0.3 milliseconds. It always seems to take 0.3 milliseconds even if the kernel simply returns on entry.

                                      The original problem of too high overhead remains though.

                                      If I bypass Opencl entirely, and write the equivalent kernel code as a subroutine, I get an execution time of 0.123 milliseconds (single threaded).

                                      In practice I would need the Opencl kernel run time to come down to 0.01 milliseconds to be useful (My machine has 12 cores).

                                       

                                        • kernel overhead too high (again)
                                          antzrhere

                                          Yes, enabling profiling does adversely effect performance in my experience.

                                          Maybe that is the sort of overhead to expect, I always assumed things would be lower. Coming to think of things though, OpenCL on CPU always executed faster than the GPU when I used a dummy function, so maybe its a GPU thing.

                                          I guess because OpenCL is designed to handle large data sets where execution times mask any overhead. If you can some how port your decision making code in parallel on GPU (possibly by serialising output using global atomics) or let one thread on the GPU make the decision while other threads sit idle then carry on as usual?

                                            • kernel overhead too high (again)
                                              sourcery

                                              Notzed, I think some of the analogies you propose support my case! Assembly line production requires at each step, the previous one to be completed ;-)

                                              Let me try again to explain why I think the overhead time for the runkernel is excessively unreasonable. My test runkernel simply writes an input buffer (to the kernel) of around 100 bytes to the gpu, waits for completion, enqueues the kernel (which does nothing & simply returns on entry), waits for completion, reads a kernel ouput buffer of around 100 bytes and waits for completion. Average throughput 2000 runkernels/sec.

                                              On the CPU and outside Opencl, the routine I wanted to run on the GPU can run 8000 times/sec single threaded. Each routine (lets call it findnearestpoint) run compares a point (described in around 80 bytes), to 4000 others. Lets say 20 * 20 * 4000 uint comparisons/lookups and 1.1 million lines of C code executed.

                                              How come I can do that on the CPU 4 times in the time OpenCL takes to write 100 bytes, run an empty kernel and read 100 bytes ?  I think there is some very bad code responsible for the overhead.

                                               

                                               

                                               

                                               

                                                • kernel overhead too high (again)
                                                  Raistmer

                                                  How come I can do that on the CPU 4 times in the time OpenCL takes to write 100 bytes, run an empty kernel and read 100 bytes ?  I think there is some very bad code responsible for the overhead.


                                                  Though bad code could be involved too, even with perfectly optimized runtime code you will face big slowdown in your usage case.
                                                  The reason is: you need to communicate between 2 "devices", CPU+host(system) memory and GPU+device memory. This communication goes over very slow (compared with CPU<->system memory and GPU<->device memory) paths. And this is unavoidable (unless you use something like AMD's APU).
                                                  You can treat it as unavoidable "hardware delays" and not as "bad programming".
                                                  To make good use of discrete GPU you should supply work for it in big chunks. If your algorithm can't be restructured for this, then probably GPU processing is inappropriate for it.
                                                  Some (actually, quite a lot) algorithms can't show good speedup on GPU, it's quite specific aclelerator...
                                                    • kernel overhead too high (again)
                                                      corry

                                                      Let me toss my $0.02 in here for your amusement.

                                                      First off, Raistmer is almost correct.  Future APUs will likely be able to do this for you with no problem.  Current ones, from my understanding, are still acting as 2 devices, and as such, still involve memory transfers, mapping virtual pages, etc.  In the future, (sooner the better IMHO) they have on their roadmap to support coherent caches, coherent memory, same virtual address space, GPU context swtiching, and preemtion.  When even most of that is in place, GPGPU will disappear and the only term we will use will be heterogeneous computing.  I can't wait...

                                                      I think for your case, it sounds like you have a problem that just can't be accelerated by GPUs.  Not all problems fit.  Just because you can write C/C++ like code, it doesn't mean it is going to be faster.  You have to see if you *can* map your problem to a massivly parallel system.  That said, you may have just missed a model that can be used.  All threads on the GPU share global memory, so you can have a many working in a single problem type model, there is the local data share which allows for groups of threads to work on the same problem, and the not quite as often thought of model, all threads working on seperate, unassociated data model.  There are others too, but they are mostly going to be deriviatives of the above.  See if you can fit more work on the GPU at once.  You'll make far better use of it.  Like any external hardware/software interface, once you have code executing in it, you want to stay there as long as possible, and do as much as possible!

                                                      That said, see if you can only slightly rearchitecht your myapp to be able to handle more data.  It may be you can see a performance increase by modifying your massivly parallel strategy AND your host app strategy.  Often times, the slightest tweak makes all the difference in the world!

                                                        • kernel overhead too high (again)
                                                          sourcery

                                                          Raistmer I checked the hardware possible slowdown by running on the CPU only, it ran even slower so I guess it has to be some software startup/exit overhead.

                                                          Thanks for all your comments and I do take on board the importance of gving the GPU as much to do as possible at once. Its just I know how difficult that is going to be in Myapp. The routine gets called in around 20 places, it was never faster on a CPU to do more than one at once so there was no consideration given to aggregating calls. Each CPU core was given an independent thread so no need to aggregate there either.

                                                          The next step (had I got the speedup), would have been to get it to work for multiple threads and if/when that was working put a graphics card into every PC on the company network (50-100).

                                                          That said, Opencl should not have to behave this way, 0.3-0.5 milliseconds means a lot of overhead code on todays machines.

                                                          The routine is a perfect fit for parallelism (assuming quick startup & exit) & has already been written (thought the hard part was over !). Around 75% of the execution time of Myapp is spent in the routine, and Myapp fully uitilises all the cores available to the CPU.

                                                          I cant be the only one that wants to use a GPU as co-processor rather than try & hand over the whole application to it.

                                                          Mused about the idea of having a permanently running kernel, that got blocked with semaphores which when released meant that the kernel could switch to a second input & output buffer (or back to the first). That sounds like a recipe though for a high electricity bill and a very hot GPU. Also there could be only one kernel but if Corry's GPU context switching comes along maybe its a possibility.

                                                          I was rather hoping an AMD person would drop in say, yep, known problem, should be fixed in the next release. Maybe they still will ;-) Failing that, its back to the drawing board.

                                                           

                                                            • kernel overhead too high (again)
                                                              corry

                                                               

                                                              Originally posted by: sourcery ...Also there could be only one kernel but if Corry's GPU context switching comes along maybe its a possibility.

                                                              I was rather hoping an AMD person would drop in say, yep, known problem, should be fixed in the next release. Maybe they still will ;-) Failing that, its back to the drawing board.

                                                               



                                                              Not my GPU context switching, google FSA, or APU, etc.  You'll see the roadmap.  This is AMD's plan according to the roadmap.  I think the GPU context switching is towards the end of the roadmap though...

                                                              Yes, we all want to use the GPU as a coprocessor, that's what FSA seems to be about.  Right now though, it is just not possible.

                                                              As I said before, you might take a step back from myapp, and see if there isn't some tweak you can do to seperate things out to make it aggregateable.  Oftentimes, its the little things that make the biggest difference.

                                                              If your architect was an overly OOP guy, yeah, you're probably out of luck...I inherited one such project once...a plotting library which included a class CPoint, which had CXYPoint, CPolarPoint, C3DPoint, and a few others derived from it. The overhead of pulling out the data and stuffing it into a vertex buffer was so immense, that unless I was plotting over 10,000,000, yes 10M points, it simply wasn't worth it.  So trust me, I know there are situations that things just can't be rearchitected easily to make it compatible with accellerators.  Just worth a big picture look!

                                                                • kernel overhead too high (again)
                                                                  LeeHowes

                                                                  Launching a single kernel is very slow. You need to wait for 2 or 3 context swiches thanks to the way the driver architecture works, do multiple copies to move data into kernel space to pass parameters to the device and so on.

                                                                  This will be fixed, but not by a driver update on current devices. At best there we can hope for small progressive improvements; many of which have already been done, on batched launches windows is now pretty fast, for single launches there is probably still some overhead, linux lags a little behind this.

                                                                    • kernel overhead too high (again)
                                                                      sourcery

                                                                      LeeHowes is the HD7xxx series likely to be any different ?

                                                                      I dont mind waiting 2/3 months for a new card if it solves the problem.

                                                                        • kernel overhead too high (again)
                                                                          sourcery

                                                                          If the kernel startup time is a difficult fix, have another possible solution.

                                                                          At the end of the kernel the kernel could optionally have a special barrier command, something like

                                                                          barrier  restart a, b, c

                                                                          a, b, c are global memory locations that when changed from 0 to 1  are treated as events.

                                                                          a might be an event set by the kernel that tells the host the kernel run has finished, which allows the host to read results data and refill the input data.

                                                                          b might be an event that when set by the host means the whole kernel restarts across all compute units

                                                                          c might be an event that when set by the host means the kernel should exit.

                                                                          This allows the kernel to run forever without consuming all the GPU resources.

                                                                          Such a kernel would only ever need to be Enqueued once.

                                                              • kernel overhead too high (again)
                                                                notzed

                                                                 

                                                                Originally posted by: sourcery Notzed, I think some of the analogies you propose support my case! Assembly line production requires at each step, the previous one to be completed ;-)

                                                                Ahh but they only work efficiently if you also have many partially compelted items in the pipeline.

                                                                To stretch the analogy a bit far, your case is akin to running a single item throug a entire production line to completion, shipping it to a retailer, and then selling it to a happy customer, before even starting to buy the parts for the next one. 

                                                                Obviously this can easily be many orders of magnitude slower.

                                                                 

                                                                Let me try again to explain why I think the overhead time for the runkernel is excessively unreasonable. My test runkernel simply writes an input buffer (to the kernel) of around 100 bytes to the gpu, waits for completion, enqueues the kernel (which does nothing & simply returns on entry), waits for completion, reads a kernel ouput buffer of around 100 bytes and waits for completion. Average throughput 2000 runkernels/sec.

                                                                Maybe a bit slow but sounds ok enough considering how you're doing it.  There are physical laws that determine how fast a signal can transition between 0 and 1 across a copper wire, it's not just down to software. 

                                                                This is why the api is asynchronous - queing multiple jobs amortizes this overhead and also allows (unavoidable) latencies to be hidden.

                                                                 

                                                                 

                                                                On the CPU and outside Opencl, the routine I wanted to run on the GPU can run 8000 times/sec single threaded. Each routine (lets call it findnearestpoint) run compares a point (described in around 80 bytes), to 4000 others. Lets say 20 * 20 * 4000 uint comparisons/lookups and 1.1 million lines of C code executed.

                                                                 

                                                                How come I can do that on the CPU 4 times in the time OpenCL takes to write 100 bytes, run an empty kernel and read 100 bytes ?  I think there is some very bad code responsible for the overhead.



                                                                Well if the code is in the same file, the compiler can even remove the function call overhead and inline it completely.  i.e. its all very well saying you can reduce the call overhead, it doesn't mean you're comparing apples to apples.

                                                                Even if the overhead was much smaller this data-flow design will still be inefficient.  e.g. if the overhead was zero, you will still have the cpu idling while the co-processor executes, then the co-processor idling whilst the cpu executes.  This is going to be a lot of wasted flops assuming the cpu is doing any work (and if it isn't, why is it involved?).

                                                                If your problem can't be made to fit (or rather, it will simply cost too much to make it fit - as a co-worker of mine always says 'anything is possible: it's only software'), then you might be out of luck.

                                                                The only other alternative is just running lots of separate jobs concurrently once if there is any point at which the problems become indepedent.

                                                                Thing is, you have a system with specific parameters, you have to align to those parameters if you want to get the best out of it.  And even if the call overhead is reduced the basic system parameters are not really going to change as they are determined by physics.

                                                                 

                                                                 

                                                                 

                                                                  • kernel overhead too high (again)
                                                                    sourcery

                                                                    notzed, thanks for your comments.

                                                                    Myapp has completely independent threads apart from sharing some read only data areas. The plan was for each thread to probably have its own Opencl context and enqueue commands independently to the GPU. To make sure all the CPU cores were fully utilised I would have adjusted up the number of threads to suit.

                                                                    The real problem it seems is that my kernel is too short or Opencl kernel latency too long. A loop inside the kernel of 100 iterations would give me a throughput 10 times as great as that outside it, using the real kernel and real data. A 0.025 millisecond average turnround means the project has a future. The difficulty for me is giving the kernel anything like 100 input data items at once.

                                                                    Myapp is a recognition process that is designed to self-correct, retry already tested items and utilise lots of what if? scenarios to find better solutions and it proceeds on a data item by item basis. As you say it's only software, it can be done but it's an expensive major rewrite.

                                                                    Have not given up yet, have plenty of time to think while myapp performs its endless optimisations ;-)

                                                  • kernel overhead too high (again)
                                                    notzed

                                                    You're trying to do something that can't possibly be fast.  You need to either move your decision code to the gpu or re-arrange your host code so that the call overhead his hidden by overlapped processing.

                                                    Synchronous calls like this are a sure way to kill performance, both the cpu and gpu will be doing nothing for most of the time.

                                                    Computer architecture, cpu architecture, network architecture, operating system design, assembly line production, just-in-time warehousing and transport, hosting a dinner party ... pretty much every process you care to think of that requires work by specialised equipment needs to pipeline the work to keep the high cost devices busy in order to make them efficient.

                                                     

                                                • kernel overhead too high (again)
                                                  nou

                                                  most overhead come from memory transfers over PCIe bus and synchronization.

                                                  try experiment with AMD persistent mem and move that 8000 byte buffer into main memory. refer to MAD OpenCL programing guide.