Hello,I just wrote a parallel program using OpenCL,and ran it on my HD7990 card.When I tried to optimize my program , I read the GCN architecture whitepaper and saw this words:
"Double precision and 32-bit integer instructions run at a reduced rate within a SIMD. The GCN Architecture is flexible and double precision performance varies from 1/2 to 1/16 of single precision performance, increasing the latency accordingly. The double precision and 32-bit integer performance can be configured for a specific GCN implementation, based on the target application."
As my program is based on integer computing,so what and how can I do to configure my card or my program for integer computing?
They 'configure' it when they develop a specific GCN chip. On HD7990 then DP/SP ratio is 1/4. There is a 24 bit multiplier circuit in every stream core for the single precision float math, so the 32bit integer multiplication will be slow as double precision math.
You can optimize your program with using 24bit integer mul/mad instead of 32bit int mul/mad in places where 24 bit precision is enough. Try the mul24() or the mad24() instructions!
I would add that the performance hit is just for mult, div, and similar calcs -- all basic integer ops (bitwise calcs, comparison, add, sub, etc) are done in single ticks. Avoid division at all costs -- there is no idiv command in GCN assembly.... the compiler will convert to float, do the float reciprocal, do a float mult, then convert back. The isa file is best place to look -- optimal is to have SIMDs fed with no memory latency (that is, all data in registers) and the SIMDs keeping the PEs fed (if possible, the code is in blocks of 16 of the same commands operating on separate registers).