Archives Discussions

mux85 · ‎11-02-2010

hi, i'm working on a simple background extraction algorithm. the input is formed by a series of frames (images composed by pixels, each pixel is in rgb format) and is passed to the kernel as a simple array. the bg extraction is done by computing the median of the pixels in corrisponding position in all the frames (eg. the pixel in position (0,0) of the bg is computed as median of all the pixels at position (0,0) of all the input frames). the result is also downscaled in a very simple way (dimension halved in both directions), every pixel of the result is computed as the average of 4 pixels (2x2). everything works fine and now i want to do some optimization to get better perfomance and maybe to eliminate the for loop which make the kernel only work on CPU (as discussed in this thread http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=133698&enterthread=y).

thanks in advance for all the help you can give me.

#define MAX_FRAMES 32 void swap(uchar3 * a, uchar3 * b) { uchar3 t=*a; *a=*b; *b=t; } float rgbToLum(uchar3 pix) { return 0.3f*pix.x+0.59f*pix.y+0.11f*pix.z; } void sort(uchar3 * v, uint n) { bool swapped=true; while(swapped) { swapped=false; for(uint i=0; i<n-1; i++) { if(rgbToLum(v)>rgbToLum(v[i+1])) { swap(&v,&v[i+1]); swapped=true; } } n=n-1; } } uchar3 median(uchar3 * v, uint n) { sort(v, n); return v[n/2]; } uchar3 average(uchar3 * m) { return m[0]/4+m[1]/4+m[2]/4+m[3]/4; } kernel void BackgroundKernel( global read_only uchar * buf_in, global write_only uchar * buf_bg, read_only uint n) { ushort i = get_global_id(0); ushort j = get_global_id(1); ushort h_bg = get_global_size(0); ushort w_bg = get_global_size(1); ushort h_in = h_bg*2; ushort w_in = w_bg*2; uint frame_size = w_in*h_in; uint pos_in = w_in*i*2+j*2; uint pos_bg = w_bg*i+j; //down-scaling of the frames uchar3 matr[4]; uchar3 temp[MAX_FRAMES]; for(uint k=0; k<n; k++) { matr[0] = vload3(pos_in, buf_in); matr[1] = vload3(pos_in+1, buf_in); matr[2] = vload3(pos_in+w_in, buf_in); matr[3] = vload3(pos_in+w_in+1, buf_in); temp = average(matr); pos_in += frame_size; } //median of the frames used as bg vstore3(median(temp, n), pos_bg, buf_bg); }

mux85 · ‎11-02-2010

i forgot to say that the parameter n is the number of frames and that the global size is the size of the image already halved so in each work item the kernel access the 2x2 pixels used for downscaling

cjang · ‎11-02-2010

One comment is that the arithmetic intensity of the kernel is low. Memory access and operations are both O(N). In my experience, optimization for such kernels has a limited effect as kernel overhead dominates. Optimization does have an effect but is more about avoiding the highly unoptimal that runs very slow rather than trying to make a kernel run fast (because that is not possible with such low arithmetic intensity).

IMO, the main reason for low arithmetic intensity kernels is to avoid bus data transfers around high arithmetic intensity kernels in a computational pipeline.

DTop · ‎11-02-2010

CJang,

Could you please elaborate a bit: I thought it is a right thing to place transfers clause around high computational parts of kernel, trying to hide latency.

cjang · ‎11-02-2010

> transfers clause around high computational parts of kernel

Yes, there is a natural pattern or gather/compute/scatter. My experience is that performance is relatively low unless the compute part is doing a very large number of operations relative to the amount of data. This is hiding memory latency but is also amortizing out any inherent kernel overhead costs. For example, a kernel that does no work is not infinitely fast because there is some amount of cost in running it. That overhead cost is not insignificant. So if the kernel is not really doing that much work in absolute terms, it's measured performance will be low. It's Amdahl's Law.

I think there are different points of view when it comes to performance. I tend to think in terms of FLOPS. So performance means high throughput in the sense of very many operations per unit time. For other algorithms, other measures may be more natural (i.e. frame rate).

There's actually more to this. I didn't look at the algorithm close enough. It's actually doing a bubble sort in the inner loop. It's nothing but data movement. Everything I know says this will be terrible (not only a GPU but also on a CPU). That's how it calculates the l1 average (median). I'm not familiar with fast calculation in l1. But one idea might be to try a trimmed mean. That means a l2 average (Euclidean, what we usually think of as the average) after throwing out the high and low values (or some other outlier trimming). Statistically, I believe this is a hack. But it often works well enough for it to be commonly used in production applications.

mux85 · ‎11-03-2010

hi, thanks for the responses. I'm measuring performance in term of time (needed to elaborate one frame, and then frame rate) so one possible way to get performance is having higher FLOPS but another way is also reducing the number of operations needed. i've no preferences about this and my objective is only to get something like a 20-40% improvement (performance are already good enough with resolution up to 640x480).

about the trimmed mean i'm not sure how can it help me. to eliminate higher and lower values i still need to sort the array. also the result wouldn't be the median. i've already tried to use average instead of median for that purpose but results are not very good.

i was mainly wondering if there was a way to use local memory and/or the third work dimension to get some performance improvement

cjang · ‎11-03-2010

You can calculate a trimmed mean with O(N) memory reads and no writes. Sum all values and maintain the highest and lowest values during the iteration. Then subtract these extreme values at the end and divide. There is no need to fully sort the values. That is wasted computation.

For a bubble sort, in the best case (already sorted) it is O(N) memory reads and no writes. In the worst case, it will be quadratic with reads and writes. Sometimes, for a small collection, simple sorting algorithms can win even though the big-O complexity is of a higher order as the constant factors are small. Bubble sort is simple so for a very small collection, it might be good.

However. calculating the trimmed mean does not require fully sorting a collection. It will be cheaper to calculate.

Another issue that I would worry about, even more than memory bandwidth, is control flow divergence. My understanding is that GPUs are SIMT devices which means conditional branching in kernels is best avoided. A sorting algorithm like this potentially means that every pixel might have a different flow of control depending on the data. That would serialize execution as IDs in the work group must execute different instructions.

One more observation. I believe that the implicitly private arrays

> uchar3 matr[4];
> uchar3 temp[MAX_FRAMES];

may actually be in global memory. They are not in registers as subscripted access is not supported. The private memory (correct me if I am wrong) looks like a set of registers rather than an addressable memory space. I seem to remember Micah Villmow saying that the compiler may change in the future to support arrays.

mux85 · ‎11-04-2010

the array that i'm sorting will always have less than 32 elements so it is very small.

i'm aware of the problem with control flow divergence but i don't know how to avoid it in this case.

i'm not sure i understand your point about private arrays. how do i know if they are in global memory? how can i avoid it?

mux85 · ‎11-06-2010

any suggestion?

thanks

tanq · ‎11-08-2010

Try merge sort. I will have about 2x overhead, it's not too much. And it is possible to vectorize it - sort 4 or 5 arrays in one vliw unit.

cjang · ‎11-08-2010

Here are some suggestions that do not change the algorithm.

Make sure registers are used for matr by declaring it as four variables instead of as an array. This does not change any logic as the average() function accesses the array using literal constants as subscripts (i.e. matr[0], matr[1], matr[2], matr[3]). The compiler may be smart enough to do this for you. But rewriting your code to use four variables is safer.

The temp array could be in local memory. This will almost certainly be faster than having it in global memory. As mentioned earlier, I believe that the compiler will put the temp[MAX_FRAMES] array in global memory as registers are not an addressable memory space. It does not support arrays.

Here are some ideas that involve changing the algorithm.

You can build a histogram to calculate the median. This avoids any sorting. The image has an 8 bit luminosity channel so an array of 256 buckets can accumulate a histogram. The median is the bucket that best divides the histogram into equal halves (i.e. .5 of the distribution).

The luminosity channel could also be quantized down to 128 or 64 values (perhaps less).

Your filter is a median in the time domain (up to MAX_FRAMES at each pixel taken in isolation). What about a spatial domain filter? A spatial domain filter facilitates re-use of data between neighboring pixels so will make more efficient use of any cache.

I will also mention the trimmed mean again. This is a reduction that avoids sorting and using lots of memory (as the histogram approach does).

Really, this all depends on the statistical qualities of your data. Perhaps the filter you have now, the median of MAX_FRAMES in the time domain, is only one of several approaches that can work. The only way to know is to experiment on the data.

tanq · ‎11-08-2010

Generally, that background detection algo can be better implemented using SSE vectored instructions, because SSE can directly operate with 8bit values without intermediate conversions. Algorithm is very simple:

1. Substract 2 consecutive frames.

2. Find absolute values of difference

3. Compare it with constant threshold

4. Set difference bytes to 1 or 0 depending on compare result

5. Make logical OR with prevous bitmask

Finally you'll get zeros at background pixels and 1-s at foreground.

The final result can be refined by despecle run on foreground pixels

As you can see algorithm is actually simple, so GPU can't accelerate much

mux85 · ‎11-08-2010

thanks, i'm trying some of the things you suggested

Archives Discussions

optimization of a background extraction algorithm