Performance measuring issues

Discussion created by guenthernoack on Feb 3, 2009
Latest reply on Feb 6, 2009 by gaurav.garg
Depending on the order of my performance measurements, the results differ


I wrote a couple of different versions of the one-dimensional "minimal index" reduction kernel that was discussed earlier in this forum.

To measure the performance of my three versions, I wrote a C-macro MEASURETIME2 (using the Linux clock_gettime() function) and the functions measure_minindex2(), measure_minindex3() and measure_minindex4() (see below).

The question is: Depending on the order in which the measure_minindex*() functions are run, the results differ. Is that because of lazy buffer deallocation on the GPU when Stream objects are destroyed?

When executing the measure functions in order 2,3,4, minindex2 takes 4.7 msec for arrays of size 1024*64. When executing in the order 4,3,2, it takes 10.62 msec!

Best regards,



void measure_minindex2()
  puts(" - measuring minindex2");
  const unsigned maxsize = 1024*64;
  const unsigned minsize = 1024*2;
  const unsigned times = 100;
  float *arr = (float*) malloc(sizeof(float) * maxsize*2);
  for (unsigned i=0; i<maxsize*2; i++) arr = random() % 1000;
  for (unsigned size=minsize; size<=maxsize; size*=2) {
    Stream<float> numbers(1, &size);
    Stream<float2> numbersWithIndices(1, &size);;
    MEASURETIME("minindex2", size, {
    for (unsigned i=0; i<times; i++) {
      float2 result(INFINITY, -1337.0f);
      create_indices(numbers, numbersWithIndices);
      minimal_index2(numbersWithIndices, result);
    if (numbersWithIndices.error()) {
      printf("ERROR: %s\n", numbersWithIndices.errorLog());