Hi there,
I've been looking at the float and double matrix multiply examples and have noticed that brcc does not seem to emit either mad or dmad il instructions very often. This made me worry that these and similar programs in brook+ might only reach 1/2 or 2/3 of peak performance.
However, with the help of the shader analyser it seems that the cal compiler does optimize the float case into a MULADD assembly instruction, but won't do the same for the double case. Instead it does two MUL_64's followed by an ADD_64. So for a pair of doubles this seems to take 3 instruction groups rather than the 2 it would need if it was using MULADD_64 twice. From
here we know that all instruction groups take one cycle, so we seem to be getting only 2/3 theoretical performance. In time will this be changed or is there another reason I'm missing why MULADD_64's are not being used here?
Best,
Steven.