In the AMD Acceleratd Parallel Processing OpenCL Programming Guide
at page 119 section 4.13.1 table 4.14 it shoes at integer inst rates
a total throughput of 1 mul for each 5 PE's at Cypress.
Now i'm interested in this same table for Cayman and most specifically i'm interested in the aggregated throughput of mulhi + mul each cycle at a streamcore.
This as i received contradictary information there. It was my understanding it is possible to schedule 2 mul's per cycle per streamcore at Cayman. Is that correct?
If not, is the aggregated number of mul+mulhi maybe 2 then for the 6900 series?
This as this is a huge difference for multiplication code, namely 32 bits output per cycle per streamcore versus 64 bits output per cycle per streamcore.