cancel
Showing results forΒ 
Search instead forΒ 
Did you mean:Β 

Archives Discussions

kcin
Adept I

AMD vs NVIDIA performance

First, I must say, it isn't a holy war post.

I am a newbie with AMD GPUs (previously worked with NVIDIA) so I need help with understanding the unexpected performance lack.

I need to implement on AMD GPUs a simple kernel which searches template images on the base one. I was able to test the kernel on NVIDIA GT 240, AMD HD 6670 and AMD HD 5850. The results were discouraged for me:

GT 240 - 100 ms,

HD 6670 - 200 ms,

HD 5850 - 69 ms.

It seems strange to me as both AMD GPUs are more powerful than GT 240. It looks like I don't get the best of AMD.

Below is the kernel I'm talking about. It looks for 6 equally sized template images in the base one. Every pixel of the base image is considered as a possible origin of one of the templates and is processed by a separate thread. The kernel compares R, G, B values of each pixel of the base image and of the template image, and if the difference for at least one color exceeds diff parameter, the corresponding pixel is counted nonmatched. If the number of nonmatched pixels is less than maxNonmatchQt the corresponding template is hit.

__constant int tOffset = 8196; // one template size in memory (in bytes)
__kernel void matchImage6( __global unsigned char* image, // pointer to the base image
                                     int imgWidth, // base image width
                           int imgHeight, // base image height
                                     int imgPitch, // base image pitch (in bytes)
                                     int imgBpp, // base image bytes per pixel
                                     __constant unsigned char* templates, // pointer to the array of templates
                                     int tWidth, // templates width (the same for all)
                                     int tHeight, // templates height (the same for all)
                                     int tPitch, // templates pitch (in bytes, the same for all)
                                     int tBpp, // templates bytes per pixel (the same for all)
                                     int diff, // max allowed difference of intensity
                                     int maxNonmatchQt, // max number of nonmatched pixels
                                     __global int* result, // results
                                                     ) {
     int x0 = (int)get_global_id(0);
     int y0 = (int)get_global_id(1);
     if( x0 + tWidth > imgWidth || y0 + tHeight > imgHeight)
          return;
     int nonmatchQt[] = {0, 0, 0, 0, 0, 0};
     for( int y = 0; y < tHeight; y++) {
          int ind = y * tPitch;
          int baseImgInd = (y0 + y) * imgPitch + x0 * imgBpp;
          for( int x = 0; x < tWidth; x++) {
               unsigned char c0 = image[baseImgInd];
               unsigned char c1 = image[baseImgInd + 1];
               unsigned char c2 = image[baseImgInd + 2];
               for( int i = 0; i < 6; i++)
                    if( abs( c0 - templates[i * tOffset + ind]) > diff ||
                              abs( c1 - templates[i * tOffset + ind + 1]) > diff ||
                              abs( c2 - templates[i * tOffset + ind + 2]) > diff)
                         nonmatchQt++;
               ind += tBpp;
               baseImgInd += imgBpp;
          }
          if( nonmatchQt[0] > maxNonmatchQt && nonmatchQt[1] > maxNonmatchQt &&

                    nonmatchQt[2] > maxNonmatchQt && nonmatchQt[3] > maxNonmatchQt &&

                    nonmatchQt[4] > maxNonmatchQt && nonmatchQt[5] > maxNonmatchQt)
               return;
     }
     for( int i = 0; i < 6; i++)
          if( nonmatchQt < maxNonmatchQt) {
               unsigned int pos = atom_inc( &result[0]) * 3;
               result[pos + 1] = i;
               result[pos + 2] = x0;
               result[pos + 3] = y0;
          }
}

Kernel run configuration: Global work size = (1900, 1200) Local work size = (32, 😎 for AMD and (32, 16) for NVIDIA.

Any remarks about my code are also highly appreciated.

Tags (1)
0 Likes
15 Replies
MicahVillmow
Staff
Staff

Re: AMD vs NVIDIA performance

A couple of quick things,

1) don't use 8bit types, you are wasting memory bandwidth, load in at least an integer. This is especially true with constant memory which loads in 16 bytes on dynamic indexing.

2) Merge your sequential loads into a single load of a larger type.

3) Don't use global atomics on the same pointer that does most of your memory writes, this forces all writes to the pointer down the slow path.

4) If you only increment atomics, use the atomic counter extension on AMD platforms, it is faster than global atomics.

5) Use vectors over scalars in calculations where possible.

6) Check your ISA to see if scratch memory is used because of nonMatchQt, if so, don't use private arrays(nonMatchQt), use local array instead.

7) Check ISA to see if loops are unrolled, if not use #pragma unroll 6 to force unrolling of inner loops.

kcin
Adept I

Re: AMD vs NVIDIA performance

Hello, I changed my kernel to:

__kernel void matchImage6(    __global unsigned char* image,

                              int imgWidth,

                              int imgHeight,

                              int imgPitch,

                              int imgBpp,

                              __constant unsigned char* templates,

                              int tWidth,

                              int tHeight,

                              int tPitch,

                              int tBpp,

                              int diff,

                              int maxNonmatchQt,

                              __global int* result

                              ) {

     int x0 = (int)get_global_id(0);

     int y0 = (int)get_global_id(1);

     if( x0 + tWidth > imgWidth || y0 + tHeight > imgHeight)

          return;

     int nonmatchQt[] = {0, 0, 0, 0, 0, 0};

     for( int y = 0; y < tHeight; y++) {

          int ind = y * tPitch;

          int baseImgInd = ((y0 + y) * imgPitch + x0 * imgBpp) >> 2;

          for( int x = 0; x < tWidth; x++) {

               uchar4 imgPix = vload4( baseImgInd, image);

               #pragma UNROLL 6

               for( int i = 0; i < 6; i++) {

                    uchar4 templPix = vload4( (i * tOffset + ind) >> 2, templates);

                    if( abs( imgPix.x - templPix.x) > diff ||

                              abs( imgPix.y - templPix.y) > diff ||

                              abs( imgPix.z - templPix.z) > diff)

                         nonmatchQt++;

               }

               ind += tBpp;

               baseImgInd ++;

          }

          if( nonmatchQt[0] > maxNonmatchQt

                              && nonmatchQt[1] > maxNonmatchQt

                              && nonmatchQt[2] > maxNonmatchQt

                              && nonmatchQt[3] > maxNonmatchQt

                              && nonmatchQt[4] > maxNonmatchQt

                              && nonmatchQt[5] > maxNonmatchQt)

               return;

     }

     for( int i = 0; i < 6; i++)

          if( nonmatchQt < maxNonmatchQt)

               if( result[0] < MAX_RESULT_BYTES - 2) {

                    int pos = atom_inc( &result[0]) * 3;

                    result[pos + 1] = i;

                    result[pos + 2] = x0;

                    result[pos + 3] = y0;

               }

}

However, it became only slower (73 ms instead of 69) as it required one more VGPR to store vectors. Scratch memory is not used by the kernel and atomic operations are not the case since only few threads reach there. The code change that gives a really big improvement is the usage of vector operations in the most inner cycle:

uchar4 res = imgPix - templPix;

if( res.x > diff || res.y > diff || res.z > diff)

     nonmatchQt++;

The execution time reduced almost twice (to 45 ms). But I don't know how to effectively implement abs(uchar - uchar)operation for vectors which is necessary. Hence this change is seemed to be impossible. It looks like my problem can't be vectorized effectively.

0 Likes
tzachi_cohen
Staff
Staff

Re: AMD vs NVIDIA performance

If you like to vectorize abs(uchar - uchar) you can write:

uchar4 c = max(a,b);

uchar4 d = min(a,b);

uchar4 e = c-d;

Tzachi Cohen, AMD

--------------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.

kcin
Adept I

Re: AMD vs NVIDIA performance

It works! Now 50 ms instead of 73.

0 Likes
MicahVillmow
Staff
Staff

Re: AMD vs NVIDIA performance

kcin,

Try the abs_diff(a, b), which reproduces the function abs(a - b). See 6.12.3 of the OpenCL spec.

For atomics, it doesn't matter how many threads reach atomics, atomic operations on a pointer cause all memory accesses on that pointer to go down the slow path. It is a compile time decision and not a runtime decision. If you do atomic counters or an atomic operation to a different pointer, your memory writes should be faster. However, if your program is not memory bound this might not make any different in performance, but is something to keep in mind.

This code:

if( abs( imgPix.x - templPix.x) > diff ||

                              abs( imgPix.y - templPix.y) > diff ||

                              abs( imgPix.z - templPix.z) > diff)

                         nonmatchQt++;

Might be written faster this way:

nonmatchQt += (uint)(any(abs_diff(imgPix, templPix).xyz > char3(diff)));

0 Likes
kcin
Adept I

Re: AMD vs NVIDIA performance

Thank you, Micah, for useful information. I tried abs_diff( a, b). Surprisingly, it was slower than max(a, b) - min( a,b). At least in my case. I will try your another suggestion later.

Another feature I can't understand is why an offset in vloadn() function is multiplied by n inside the function. It is strange for char datatype which should not be aligned.

0 Likes
MicahVillmow
Staff
Staff

Re: AMD vs NVIDIA performance

kcin,

That is weird that it is slower.... abs_diff(a, b) is implemented as max(a, b) - min(a, b) and I did a small test to compare the generated code for the two and the abs_diff(a, b) on a uchar4 data type abs_diff(a, b) uses 4 fewer instructions. What this might be doing is causing other parts of your code to become a bottleneck(i.e. your memory writes using the slow path), slowing down the entire program.

0 Likes
Skysnake
Adept II

Re: AMD vs NVIDIA performance

Ok,

i hope i can help you to optimize your code.

How far i can see, you do not use the shared memory. This have very often a strong performance impact. You also do a lot of atomic accesses. So this is also a good reason to use shared memory. So you can work per Workgroup more without a atomic access into the global memory.

To see if you are memory bound, you should sum up the Loads/Writes from the global memory and compare it with the runtime. I think you should be memorybandwith limited, because you more or less do not reuse your loaded Data from the global memory, as far i can see.

So you can do the following.

Load a stripe of data into the shared memory, and than access the data from there, so you can save lots of memory bandwith to the global memory.

As next part, please do the atomic acceses first also in the shared memory and then write only one value back to the global memory. This should give you also a big speedup.

I think there are a lot of possebilities to speed up your program bye a factor of 10 or someting like this.

0 Likes
kcin
Adept I

Re: AMD vs NVIDIA performance

Thank you for your answer.

APP SDK Profiler shows that I'm strongly limited with ALU operations. That is true, because removing the most inner if operator drastically improves performance. Again, according to Profiler, memory bandwidth is high enough, I believe because many workgroups hide latency well. So I'm not sure the shared memory helps.

I tried to delete the last for cycle. The time change was really small. Additionally, only a few threads reach this point, so it isn't bottleneck, as far as I can see.

0 Likes