cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

guenthernoack
Journeyman III

Performance measuring issues

Depending on the order of my performance measurements, the results differ

Hi!

I wrote a couple of different versions of the one-dimensional "minimal index" reduction kernel that was discussed earlier in this forum.

To measure the performance of my three versions, I wrote a C-macro MEASURETIME2 (using the Linux clock_gettime() function) and the functions measure_minindex2(), measure_minindex3() and measure_minindex4() (see below).

The question is: Depending on the order in which the measure_minindex*() functions are run, the results differ. Is that because of lazy buffer deallocation on the GPU when Stream objects are destroyed?

When executing the measure functions in order 2,3,4, minindex2 takes 4.7 msec for arrays of size 1024*64. When executing in the order 4,3,2, it takes 10.62 msec!

Best regards,

Günther

 


void measure_minindex2()
{
  puts(" - measuring minindex2");
 
  const unsigned maxsize = 1024*64;
  const unsigned minsize = 1024*2;
  const unsigned times = 100;
 
  float *arr = (float*) malloc(sizeof(float) * maxsize*2);
  srandom(99);
  for (unsigned i=0; i<maxsize*2; i++) arr = random() % 1000;
  for (unsigned size=minsize; size<=maxsize; size*=2) {
    Stream<float> numbers(1, &size);
    Stream<float2> numbersWithIndices(1, &size);
    numbers.read(arr);
    MEASURETIME("minindex2", size, {
    for (unsigned i=0; i<times; i++) {
      float2 result(INFINITY, -1337.0f);
      create_indices(numbers, numbersWithIndices);
      minimal_index2(numbersWithIndices, result);
      /*numbersWithIndices.write(arr);*/
    }
      });
    if (numbersWithIndices.error()) {
      printf("ERROR: %s\n", numbersWithIndices.errorLog());
    }
  }
  free(arr);
}

0 Likes
6 Replies
gaurav_garg
Adept I

That is a huge performance difference. Brook+ runtime implements lazy buffer allocation, also it implements caching techniques to avoid buffer allocation-deallocation, looks like in this particular case these optimizations are going wrong. Could you send your complete test-case to streamdeveloper@amd.com for further investigation.

0 Likes

Could you reproduce the problem with the code I sent you?

Am I making a mistake when filling the stream just once and then telling the kernel 100 times to use it as input?

0 Likes

I also had that slowdown problem you already discussed in this thread, and which -- from my naive outsider point of view -- looks the same:

http://forums.amd.com/devforum/messageview.cfm?catid=328&threadid=105386

For normal kernels, I could fix the slowdown by calling error() on the output streams, as discussed in the other thread. (This behaviour is totally weird.)

However, the minindex reduction kernel outputs to a value, not a stream. Is there a way to "call error()" (or something similar) for the reduction kernel?

 

0 Likes

Yes, you are right, it is the same issue. Reduction is implemented in multiple passes and as every pass causes slowdown, it is not really possible for you to control this slow-down from application. This will be fixed in next release.

Thanks for your bug report.

0 Likes
gaurav_garg
Adept I

Sorry for repost.

0 Likes
gaurav_garg
Adept I

Sorry for repost.

0 Likes