cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

cadorino
Journeyman III

Sharing pointers between kernels

Hi to everybody.

I'm developing a computation that may require to precompute some pointers inside a matrix to avoid recomputing them for each element of an input stream.

In porting this computation to OpenCL I'm stucking in the correctness of the following OpenCL program.

In particular, I guess if a kernel can store inner pointers to a buffer A inside a buffer B (eventually masking them as int) and if another kernel can successively recover these pointers to write at particular locations of buffer A.

kernel void preComputeKernel(global uint* data, global int* pointers) {

     int offset = // do some offset calculation

      pointers[get_global_id(0)] = (int)(data + offset);

}

kernel void usePointersKernel(global uint* data, global int* pointers) {

      uint* my_pointer = (uint*)pointers[get_global_id(0)];

     my_pointers[0] = // do some calculation of the value

}

I know there are some constraints in using pointers inside kernels and that memory allocation doesn't behave like in CPU.
Any suggestion?

Thank you!

0 Likes
8 Replies
LeeHowes
Staff

Store the offsets into buffer A inside buffer B. Don't store the actual pointers because you have no control of the base address of buffer A will be on the next kernel launch and hence no idea of whether the pointers will make sense - it will depend entirely on the representation of the objects that the compiler uses and where the memory is allocated.

It's not clear from your little example why you would want to do this at all, just move the offset addition to the use kernel, like this:

kernel void preComputeKernel(global uint* data, global int* pointers) {

     int offset = // do some offset calculation

      pointers[get_global_id(0)] = offset;

}

kernel void usePointersKernel(global uint* data, global int* pointers) {

      uint* my_pointer = (uint*)(data + pointers[get_global_id(0)]);

     my_pointers[0] = // do some calculation of the value

}

You should really try to avoid use of pointers as much as possible, they just hurt compiler optimisation. If data is obviously relative it's easier for the compiler to track what's going on.

0 Likes

Thank you for your answer.

Oh, I presumed that was a problem.
From the example it is not clear why I should compute pointers instead of storing offsets, but I can explain the reason with bit more details.
The algorithm is actually a boosted HAAR object detection. On the CPU, the main performance limitation is the calculation of the elements on the (integral) image to be accessed. This calculation is generally done like this:

foreach(subwindow)

     foreach(stage)

          foreach(classifier)

               foreach(feature)

                    // compute elements to be accessed

The computation of the elements to be accessed is something like the following:

#define matp(matrix,stride,x,y) (matrix + ((stride) * (y)) + (x))

#define mate(matrix,stride,x,y) (*(matp(matrix,stride,x,y)))

#define mats(matrix,stride,x,y,w,h) \

    (mate(matrix,stride,x,y) - mate(matrix,stride,x+w,y) - mate(matrix,stride,x,y+h) + mate(matrix,stride,x+w,y+h))

mats(integral_image,

        integral_image_stride,

        subwindow_x + feature.x,

        subwindow_y + feature.y,

        feature.width,

        feature.height)

Since there are tons of subwindows, stages and, in particular, features (mainly in the latest stages), avoiding a sum in the inner loop can make the difference

Fortunately, the computation of a matrix element is something like "subwindow.xy + f(feature)", so we can think to precompute the f(feature) and store the resulting pointer to the matrix element.

foreach(stage)

     foreach(classifier)

          foreach(feature)

               // precompute

foreach(subwindow)

     foreach(stage)

          foreach(classifier)

               foreach(feature)

                    // compute elements to be accessed using precomputation

In this case the element computation is simply:

#define matsp(lefttop,righttop,leftbottom,rightbottom) \

    (*(lefttop) - *(righttop) - *(leftbottom) + *(rightbottom))

matsp(feature.pointer_to_left_top + offset,

          feature.pointer_to_right_top + offset,

          feature.pointer_to_left_bottom + offset,

          feature.pointer_to_right_bottom+ offset)

I roughly avoid 4 * (1 multiplication + 2 additions (matp)) per feature, which can improve performance of x3 factor.

On the GPU, I though to put the precomputation on a kernel and the "real" computation on another one. The problem is that I cannot pass pointers to the image (es. feature.pointer_to_left_top) from the first to the second.

0 Likes

I think the best you can do is precompute at the start of the kernel rather than in an inner loop. It's not great, and I realise that means you may end up recomputing across the workgroup and even the same value across the entire dispatch. It's an unfortunate consequence of the way OpenCL is defined.

This is a problem i've spent considerable time on, although I was looking at a different part of the problem.  How to parallelise it was the main issue I focused on.

I've subsequently done a few CPU versions which provide some other ideas (the offset table in particular), but i'm not sure they'll have much effect on a GPU because the GPU has such abundant ALU resources that I think the limiting factor is memory (and the branchy nature of the algorithm).  And i'm pretty much sick of it too!  I even came up with an algorithm more suited to modern hardware based on LBP codes but despite some very promising early results it still needs more development.

There are some big impedements to performance of viola-jones cascades on gpu hardware, and I am pretty sure the address calculations aren't the important ones:

  • integral images.  they might look pretty good on paper if your cpu model is from around 1980, but they're just not efficient on modern hardware.  Even the simplest feature test needs 6 sparse lookups, and you're talking thousands per cascade.
  • AMD queueing overhead.  I went from an nvidia gtx 480 to an amd 7970 and the same tuned code ran half the speed.  Most of that was the overhead from running 10 kernels?  I really couldn't work out what was going on and by process of elimination that's what i ended up with - it was a lot, like upward of 1ms per pass.  By doing some acrobatics so that I can run all windows at all scales so I could run the whole lot in one go, I doubled performance on a simple test case (and much more for finer searches).
  • The cascade algorithm itself gets its performance from relying on making an early exit from a very large number of tests.  It's designed not to run well on a SIMD architecture.

As to your original question, just store them as offsets, a handful of adds will be effectively free compared to the memory lookups - and there's no other way to implement it anyway.  Infact the only way to implement as you suggest - storing pointers - would require you to store a full set of region test for every window location - which would take gobs and gobs of memory, and not be at all fast.   All you need to do is store the offset relative to the stride of the source array, and then your kernel will just be what you list above anyway.

i.e.

offset = location of 0,0 in window for this cpu thread (i.e. image.address + windowx + windowy * image.stride) - calculated once per window

The pre-calculation just involves changing rect{left,top,w,h} into rect{left+top*stride, left+w+top*stride, left + (top+h)*stride, left+w+(top+h)*stride}.  i.e. the location offset relative to the image size, but not including any address (this is precisely what opencv does).

And then your kernel code is the same, it's just that 'offset' contains the address summand rather than feature:

matsp(offset + feature.offset_to_left_top, ... etc)

But there's no way you'll get a 3x speedup of that on a gpu vs just doing

matsp(windowx + feature.left + (windowy + feature.top) * image.stride, ...etc

Because the 4 memory lookups will be much slower than 4 multiplies.

The 'test all scales at once' on an integral image version was the fastest I ended up on a GPU, but because it had to run all scales at once it also needed to scale each region on the fly - and this was still faster (I implemented it using opencl images, so the stride multiply isn't required).  I suspect that scaling the image rather than the features might be faster in kernel time because of the better memory locality; but then you have to scale the image and generate the SAT tables at each scale too, and once you have to run even one kernel per scale i think the queue overheads will impact performance too much.

I've lost interest in it somewhat because the cascades seem to require far too 'magic pixie dust' to train and i don't have the patience nor data to create my own that are any good.

All my code and experiments from my last stint at this last summer are in socles on google code.

Wow, thank you for your suggestions!
It seems like you spent a lot of time porting the viola johnes algorithm to GPU.
I actually know there are some limitations in running the entire algorithm on GPU. That's why I chose it. In fact, I was looking for a non-trivial algorithm as a "killer app" for my Ph.D. research, which consists in demonstrating that in an heterogeneous CPU+GPU (or better, APU+discrete GPU) devices can be specialized to run different sets of computations (and, hopefully, to implement a strategy for scheduling). For example, integral image is nothing more than a separable convolution, few ops and many memory accesses. Is it more suitable to run on CPU (also saving from CPU-GPU data transfer time)?
In the latest stages on classification, we have few subwindows to analyze (few work-items). With 64 * 5 subwindows and work items, 15 of the 7970 CUs are unused, while the entire hw resources of an Llano integrated GPU are exploited. Is this situation perfect to run on the integrated GPU?
I think there are many more points we can think about.

In addition, I think that classification is an example of a quite wide set of iterative spatial algorithms where space is analyzed using blocks that evolve during time. At the beginning I have many small blocks, at the end few huge blocks. This is a situation pretty similar, for example, to the KD-tree algorithm for raytracing. I'm quite interesting in analyzing the efficiency of CPU, discrete and integrated GPU, in running a particular iteration, depending on the number of blocks and the relative size.

I'll definitely try to apply your suggestions and look at your implementation. Thank you again!

0 Likes

I ended up with a pretty decent integral image implementation on a GPU - it benefits from the parallelism (i.e. do by rows independently then by columns independently) and the high memory bandwidth, and if the data has to end up there anyway it's a smaller device copy.  It wouldn't be very efficient if it weren't for local memory though.  Having said that, a modern cpu should be able to create a SAT in a single pass and about the speed it takes to run memcpy().

AMD have reported success moving a multi-stage cascade detector to a heterogenous architecture of an apu, but as yet I haven't played with one myself.  I've been following the HSA developments with interest, but as usual all the interesting stuff always seems a year away ...

I utilised more than one work-item per window, so in the later stages the wider stages are executed in fewer iterations - it's a hard one to judge since not many work-groups get that far.  I found using 16 threads worked about the best.   The benefit of the HSA stuff is that you can shift the work to the processor most suited to it - very parallel use a parallel one, very serial use a serial one, etc.

VJ is a bit of a pain because the first test is usually very small (like 3-4 features) with the last being hundreds - there's a ton of parallelism possible there, but fitting it to a simple topology is a trade-off.  And doing more complex stuff like dynamically re-arranging the work to suit quickly gets expensive.

0 Likes

Dear notzed, did you use a per-stage implementation? I thought to use a per-stage implementation, which means I launch a kernel for each stage. The kernel executes the corresponding stage of the classification cascade, each work item running on a different subwindow. In this way I avoid wavefronts under utilized due to threads exiting at early stages and threads running throughout the classification cascade.

Moreover, what do you employ more than one work item per window?

0 Likes

I can't remember everything i tried but i might've tried per-stage.

Its all in the source, i actually checked in all my aborted attempts as well in the current version - I tried many different options.

0 Likes