Hi!
I wrote a couple of different versions of the one-dimensional "minimal index" reduction kernel that was discussed earlier in this forum.
To measure the performance of my three versions, I wrote a C-macro MEASURETIME2 (using the Linux clock_gettime() function) and the functions measure_minindex2(), measure_minindex3() and measure_minindex4() (see below).
The question is: Depending on the order in which the measure_minindex*() functions are run, the results differ. Is that because of lazy buffer deallocation on the GPU when Stream objects are destroyed?
When executing the measure functions in order 2,3,4, minindex2 takes 4.7 msec for arrays of size 1024*64. When executing in the order 4,3,2, it takes 10.62 msec!
Best regards,
Günther
void measure_minindex2()
{
puts(" - measuring minindex2");
const unsigned maxsize = 1024*64;
const unsigned minsize = 1024*2;
const unsigned times = 100;
float *arr = (float*) malloc(sizeof(float) * maxsize*2);
srandom(99);
for (unsigned i=0; i<maxsize*2; i++) arr = random() % 1000;
for (unsigned size=minsize; size<=maxsize; size*=2) {
Stream<float> numbers(1, &size);
Stream<float2> numbersWithIndices(1, &size);
numbers.read(arr);
MEASURETIME("minindex2", size, {
for (unsigned i=0; i<times; i++) {
float2 result(INFINITY, -1337.0f);
create_indices(numbers, numbersWithIndices);
minimal_index2(numbersWithIndices, result);
/*numbersWithIndices.write(arr);*/
}
});
if (numbersWithIndices.error()) {
printf("ERROR: %s\n", numbersWithIndices.errorLog());
}
}
free(arr);
}
That is a huge performance difference. Brook+ runtime implements lazy buffer allocation, also it implements caching techniques to avoid buffer allocation-deallocation, looks like in this particular case these optimizations are going wrong. Could you send your complete test-case to streamdeveloper@amd.com for further investigation.
Could you reproduce the problem with the code I sent you?
Am I making a mistake when filling the stream just once and then telling the kernel 100 times to use it as input?
I also had that slowdown problem you already discussed in this thread, and which -- from my naive outsider point of view -- looks the same:
http://forums.amd.com/devforum/messageview.cfm?catid=328&threadid=105386
For normal kernels, I could fix the slowdown by calling error() on the output streams, as discussed in the other thread. (This behaviour is totally weird.)
However, the minindex reduction kernel outputs to a value, not a stream. Is there a way to "call error()" (or something similar) for the reduction kernel?
Yes, you are right, it is the same issue. Reduction is implemented in multiple passes and as every pass causes slowdown, it is not really possible for you to control this slow-down from application. This will be fixed in next release.
Thanks for your bug report.
Sorry for repost.
Sorry for repost.