cancel
Showing results for 
Search instead for 
Did you mean: 

OpenCL

Skysnake
Adept II

Re: AMD vs NVIDIA performance

I cant really believe this. You calculate more or less nothing, or not very much as far i can see.

I think you spent most of the time for the atomic accesses. I think you should be able to do most of the work inside the shared memory, with the if's.

Just try it. It is not soooo much work to do this.

0 Kudos
Reply
kcin
Adept I

Re: AMD vs NVIDIA performance

I assure you that the atomic access is not a big deal. 99.99% of all threads exit the kernel without the atomic access usage. (I'm not talking now about the slow path of memory access MicahVillmow told about. However, removing the atomics completely do not improve time significantly.)

I can't say whether or not the shared memory is useful as I have not tried this. Possibly I will do. If so, I will surely post the results of the experiment here.

0 Kudos
Reply
notzed
Challenger

Re: AMD vs NVIDIA performance

I'm not up on the performance characteristics of those devices, but to me those numbers look a bit off their potential.  Possibly by an order of magnitude although it's been a while since I wrote similar stuff so I could be out.

Apart from some basics:

a) Don't use RGB format.  At least use RGBx to you can use uchar4 (or even int), and if you need more alignment for better performance (so e.g. you can use char16), then just require that for the api.  If you must, at least use a uchar3 pointer, and pretty much never use uchar unless you're really dealing with chars (like text).  Sometimes its faster to copy a whole image and re-arrange it in a format that matches an algorithm - you can probably write more efficient re-arrangement code if that's all it does, and it can remove all the special and edge cases from the algorithm.

b) or use bitplanes - this would allow you for example to triple your parallelism since you can just do r g b separately and combine them afterwards.

c) on amd use an atomic counter.  I know it's probably no issue here, but from memory the function is the same so the code changes are minimal and clean.

d) don't use vload either.  Just use a vector type and access it as an array (well when i first dabbled, vload was slow as a wet week, but that was a long time ago).

e) don't use short-cut logic on arithmetic (i.e. use | not ||).  Presumably the compiler wont bother if the expressions have no side-effects, but it's better to be safe than sorry.

f) for problems like this specify values that don't change depending on the data but are critical to performance such as the template size as constants (i.e. defines).   Then compile a new instance of the kernel for different sizes if you need to by passing -Dblah=blah and so on via the compiler command line.

In my experience, as soon as you use a piece of data more than once, the main thing is to avoid the global memory accesses: and template matching needs to read the same location many times.  The reads you have are (mostly) coalesced, but it's still reading memory for every pixel test.  For example, it should be possible to perform 32 tests of a row with only 1 (or 2) read(s) of the source image rather than 32.

Assuming you had 64x1 threads working together on one 32 pixel wide section of source image and template (off the top of my head: i think the addressing below should work):

lx <- local.x

source.x <- global.x / 64 * 32 (64 threads work on 32 pixels of source at once)

local array 'source' is uchar4[64]

local array 'template' is uchar4[64]

1) read 64  pixels from source.x +lx location into shared memory source[lx] (1 coalesced read, each thread: 1 pixel).   use clamping/an if to make sure it doesn't go beyond the bounds of the image.

2) loop over 3 sets of two templates:

3)   read 64x1 of template (i.e. two templates, stored tiled together) to shared memory template[lx] (each thread: 1 pixel)

barrier(LOCAL_MEM_FENCE);

4)   loop (0:32) over x, and perform a test between template(x) and source((lx&31) + x) from shared memory  (note that this loop is independent of external memory address calculations so might optimise better).

  (this means the first 32 threads test the source against template 0, and the second 32 test against template 1)

(if your data is RGB, just use pixel.xyz anywhere you access it)

barrier(LOCAL_MEM_FENCE);

5) end loop 2)

If you specify the reqd_group_size(64, 1, 1) attribute (iirc,manuals not handy) on the kernel, and it is possible, the compiler will remove the barriers on amd hardware.  The inner loop can be simpler too since you're only doing 1 test and not 6 and it can just work on a register.  Obviously repeat the above to test a whole template.

So with only 1 + (ntemplates/2) = 4 global memory reads you've performed 32 location tests against 1 row of all 6 templates.

Compare this to the (6+1) * 32 = 224 global memory reads required for the non-LDS version.

The only caveat is that you cannot short-circuit execution, all tasks have to perform all operations since they work together and must all execute the same barriers: but this is usually faster anyway since the hardware is probably already doing this and you remove the unnecessary exist tests.

kcin
Adept I

Re: AMD vs NVIDIA performance

Thank you for the time you spent to write so detailed answer. Some of your recommendations were really helpful.

In fact, vload4() is extremely slow comparing to uchar4 pointers. I did not use AMD atomic counters in order to provide OpenCL compatibility for NVIDIA too.

Some more observations on suggestions from this thread:

- a usage of any() function significantly slows the program down,

- I observed no difference between | and ||

- I programmed a kernel which used a shared memory. However, I got even decreased performance. Possibly, it is because my code quality , however I think that a latency of the global memory reading is very well hidden by many threads, so we spend more time on synchronization of the local memory writing and excessive reading.

Below is the final version of the kernel for whom it may be interested in. It takes now 17 ms on HD 5850 and 44 ms on GT 240. It should be noticed that a very remarkable improvance was brought by if( noiseQt > maxNoiseQt) condition in the most inner cycle

#define maxNoiseQt 5

#define tOffset 8196

#define MAX_RESULT_BYTES 1000

__kernel void matchImage6(    __global uchar4* image,

                            int imgWidth,

                            int imgHeight,

                            int imgPitch,

                            __constant uchar4* templates,

                            int tWidth,

                            int tHeight,

                            int tPitch,

                            uchar diff,

                            __global int* result) {

    int x0 = (int)get_global_id(0);

    int y0 = (int)get_global_id(1);

    if( x0 + tWidth > imgWidth | y0 + tHeight > imgHeight)

        return;

    int noiseQt[] = {0, 0, 0, 0, 0, 0};

    for( int y = 0; y < tHeight; y++) {

        int ind = y * tPitch;

        int baseImgInd = (y0 + y) * imgPitch + x0;

        for( int x = 0; x < tWidth; x++) {

            uchar4 imgPix = image[baseImgInd];

            #pragma UNROLL 6

            for( int i = 0; i < 6; i++) {

                if( noiseQt > maxNoiseQt)

                    continue;

                uchar4 templPix = templates[(i * tOffset >> 2) + ind];

                uchar4 res = max( imgPix, templPix) - min( imgPix, templPix);

                if( res.x > diff | res.y > diff | res.z > diff)

                    noiseQt++;

            }

            ind++;

            baseImgInd ++;

        }

        if( noiseQt[0] > maxNoiseQt && noiseQt[1] > maxNoiseQt && noiseQt[2] > maxNoiseQt && noiseQt[3] > maxNoiseQt && noiseQt[4] > maxNoiseQt && noiseQt[5] > maxNoiseQt)

            return;

    }

    for( int i = 0; i < 6; i++)

        if( noiseQt < maxNoiseQt) {

            if( result[0] < MAX_RESULT_BYTES / 2) {

                int pos = atom_inc( &result[0]) * 3;           

                result[pos + 1] = i;

                result[pos + 2] = x0;

                result[pos + 3] = y0;

            }

        }

}

0 Kudos
Reply
Bdot
Adept III

Re: AMD vs NVIDIA performance

Not sure what the compiler can optimize about it, but both pos and *result are of type int. atom_inc, however operates on (u)long values. The way you use it, there should be two unnecessary conversions from int to long and back. atomic_inc is probably what you want.

BTW, I think it's weird that most of the stuff that involves different data types is handled by overloading. But here you have to use different functions ...

0 Kudos
Reply
LeeHowes
Staff
Staff

Re: AMD vs NVIDIA performance

The reason it's not overloaded is that atom_inc is the extension version, atomic_inc is the core. Before atomics were pulled into core the cl_khr_global_int32_base_atomics contained atom_inc for int.

http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/cl_khr_global_int32_base_atomics.html

0 Kudos
Reply