5 Replies Latest reply on Oct 11, 2012 5:47 AM by Bdot

    Integer operations in GCN

    Myrmecophagavir

      How does the GCN architecture handle integer operations? Do the vector ALUs have an integer path as well as float, or is it all done on that single scalar thing that each compute unit apparently has?  The reason I'm asking is I saw worse performance for an integer workload on a 7970 compared to a GeForce GTX 580 - the 7970 takes over twice as long to run the kernel. IIRC the Fermi architecture has integer & float paths in its main ALUs.

        • Re: Integer operations in GCN
          realhet

          Hi,

           

          On GCN most integer operations (and, or, shl, add, addc, cmp, ...) works as fast as the single precision float operations.

          32bit integer multiply works on double precision rate which is 1/4 single precision rate on higher models (like 79xx).

          There's a special 24bit MAD instruction which works on SP rate.

           

          IMO that performance difference you mentioned is not because the lack of integer performance but because of the GCN has some extra needs compared to previous architectures:

          In general it needs 4x more threads, and the optimal register limit is dropped from 128 down to 84 or even better: 64 regs, in order to get close to nominal performance.

           

          What was that test you've tried btw?

          1 of 1 people found this helpful
            • Re: Integer operations in GCN
              ankhster

              This is good to know as I mainly work with integers.

               

              So providing I'm not using more than 12 bit unsigned integers, I'd be better off with

               

              z = mad24(x, y, 0);

               

              than I would with

               

              z = x * y;

               

              ?

                • Re: Integer operations in GCN
                  realhet

                  Here are some actual instructions:

                   

                  v_add_i32, v_sub_i32 : they are producing carry

                  v_addc_i32, v_subb_i32 : these are producing carry, and also has an input carry

                  v_mad_i32_i24,  v_mad_u32_u24 : d:=s0*s1+s2    multiplication is 24bit, but addition is 32 bit

                  v_mul_i32_i24, v_mul_u32_u24 : d:=s0*s1     mul is 24bit

                  v_mul_hi_i32_i24, v_mul_hi_u32_u24 : high part of a 24bit mul, produces 16bit result

                  --------------------- those were the fast ones: (1 cycle/instruction)

                   

                  v_mul_lo_i32, v_mul_lo_u32 : 32bit mul

                  v_mul_hi_i32, v_mul_hi_u32 : 32bit mul, high part

                  --------------------- and those are run on DP rate, eg. 4 cycle/instruction

                   

                  In OpenCL it's not yet possible to use carry, but all the other instructions have its equivalents.

                    • Re: Integer operations in GCN
                      Bdot

                      v_mul_hi_i32_i24, v_mul_hi_u32_u24 : high part of a 24bit mul, produces 16bit result

                       

                      In OpenCL it's not yet possible to use carry, but all the other instructions have its equivalents.

                      Oh, cool, I did not know that! Could you let me know how to use v_mul_hi_u32_u24 from OpenCL? (I searched the forum for these instructions and it returned nil, but when I posted the question, it showed this "similar" thread )

                  • Re: Integer operations in GCN
                    Myrmecophagavir

                    Thanks. The program in question is a C++ AMP sample from Microsoft: http://blogs.msdn.com/b/nativeconcurrency/archive/2012/10/01/string-search-sample-with-c-amp.aspx  Obviously as sample code it has other priorities than being super-optimised, but the performance difference between these cards is interesting.