noxnet

Performance drops on iterative kernel calls

Discussion created by noxnet on Apr 15, 2010
Latest reply on Apr 22, 2010 by noxnet

I've written a function to interatively execute a kernel. The function handels buffer declarations/release,
setting kernel args and calls clEnqueNDRangeKernel iteratively (1000 iterations) and so on.

When calling the same function several times for the same kernel
the total execution time of all iterations raises on each function call (not kernel call).

I'm using OpenCL Profiling to measure average kernel execution time and the average kernel execution
time is constant on all function calls. So the kernel seems to behave right. I guess this is a memory/buffer
issue.

I've discovered this issue when running a kernel of ATI Image Convolution Sample (invariants).
Using a 512x512 Matrix and a filter of 3x3 running 1000 iterations the average kernel execution time on an HD5450
is about 6.22 ms. 6.22 * 1000 = 6220 sec when executing the function 4 times in a row i got the following times

1st call: 8,1 sec
2st call: 9,2 sec
3st call: 10,0 sec
4st call: 11,0 sec

The recorded times where measured on CODE 1, so no buffer declarations and similar stuff included.

Also already mentioned i guess it is a memory buffer issue.

Any ideas?


Another problem i'm facing is performance problems on an HD5750, I already posted that in another thread.
On an HD5750 the average kernel execution time is about 5.5 ms which is just slighty faster compared
to the 6.25 ms of the HD5450.

I'm using Windows 7-64 Bit, Catalyst 10.3.

CODE 1: //start total time measurement for(int x = 0; x < iterations; x++) { err = clEnqueueNDRangeKernel(cmd_queue, kernels[kernel_id], 2, NULL, global_work_size, local_work_size, 0, NULL, &kernel_event); err |= clFlush(cmd_queue); err_chk(err); assert(err == CL_SUCCESS); time_kernel_exec += get_event_exec_time(kernel_event); } clFinish(cmd_queue); //end total time measurement ################################ MEMORY BUFFERS in_mem = clCreateBuffer(context,CL_MEM_READ_WRITE, buff_size_padded, NULL, &err); err = clEnqueueWriteBuffer(cmd_queue, in_mem, CL_FALSE, 0, buff_size_padded, input, 0, NULL, NULL); filter_mem = clCreateBuffer(context,CL_MEM_READ_ONLY, buff_size_filter, NULL, &err); err = clEnqueueWriteBuffer(cmd_queue, filter_mem, CL_FALSE, 0, buff_size_filter, filter, 0, NULL, NULL); out_mem = clCreateBuffer(context, CL_MEM_WRITE_ONLY, buff_size_unpadded, NULL, &err); ################################ KERNEL kernel void Convolve_Inv(__global float * pInput, __constant float * pFilter, __global float * pOutput, const int input_width, const int filter_width) { const int nWidth = get_global_size(0); const int xOut = get_global_id(0); const int yOut = get_global_id(1); const int xInTopLeft = xOut; const int yInTopLeft = yOut; float sum = 0; for (int r = 0; r < FILTER_WIDTH; r++) { const int idxFtmp = r * FILTER_WIDTH; //current row in pFilter const int yIn = yInTopLeft + r; //current row in pInput const int idxIntmp = yIn * input_width + xInTopLeft; //current pos in pInput for (int c = 0; c < FILTER_WIDTH; c++) { const int idxF = idxFtmp + c; const int idxIn = idxIntmp + c; sum += pFilter[idxF]*pInput[idxIn]; } } //for (int r = 0... const int idxOut = yOut * nWidth + xOut; pOutput[idxOut] = sum; }

Outcomes