I accidentally discovered while testing my application that printing my kernel was improving the performance of my HD 3870...
I was also able to reproduce this behavior in the simple_matmul example from the CAL SDK. If you run the sample with the default values, the output should be something like this:
Matrix Size GPU Only Total
(0256x0256) 65.5152 1.1904
If I put some printf at the start of the main() function, i get the following output:
Matrix Size GPU Only Total
(0256x0256) 151.7902 0.4275
Is anyone else able to reproduce this behavior?
To acheive this result you actually have to put quite a fiew printf... I print 100 times about 2000 characters:
for(int i=0; i<100; i++) printf("11111111111111111111.......11");
You don't actually have to count 2000 character, just look at the column count in your editor to get 2000 "1"...
Ok thanks, but can you tell if this is a bug or is it really improving performances?
I've noticed it doesn't change anything for big matrices (4096), the peak seems to be around 200 Gflops...
Nexis,
I am wondering how you could say peak peformance is 200Gflops. What was the input for it?
Well, just try the simple_matmult sample from the CAL SDK with a big matrice an you should get around 200 Gflops with a HD 3870.
To modify the matrix size you have to add an argument to the executable like this:
simple_matmult.exe -m 4096