I'm trying to understand the performance of some code and I'm wondering it it's limited by the texture sampling rate. While I have found some "graphics" information on the web, that is normally expressed in of "numbers of n-bit pixels per second" or something and I'm not quite sure how this converts to what is available in CAL/IL. So, on an HD3870, could someone tell me how many samples per second (or cycle) I should expect (assuming everything is in L1 cache) using a sequence of instructions like
i.e. what is the peak bit rate for reading in float4's in cal/il?
And does this change under certain circumstances? e.g. without the unnorm declaration, or with an _aoffimmi addition? How about if I read from a float or float2 instead? (And how about for one of the newer cards?)
Then, should I gather sample_resource's in groups of 8 to maximize performance? And spread such groups throughout alu instructions or group them up (I seem to be finding the latter)? And if anyone could advise if there is an optimal order to the _aoffimmi indices to read in a small array (e.g. 8x1 or 4x4) of float4's, or if it makes no difference, I'd be very grateful.