cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Integer operations in GCN

How does the GCN architecture handle integer operations? Do the vector ALUs have an integer path as well as float, or is it all done on that single scalar thing that each compute unit apparently has?  The reason I'm asking is I saw worse performance for an integer workload on a 7970 compared to a GeForce GTX 580 - the 7970 takes over twice as long to run the kernel. IIRC the Fermi architecture has integer & float paths in its main ALUs.

0 Likes
5 Replies
realhet
Miniboss

Hi,

On GCN most integer operations (and, or, shl, add, addc, cmp, ...) works as fast as the single precision float operations.

32bit integer multiply works on double precision rate which is 1/4 single precision rate on higher models (like 79xx).

There's a special 24bit MAD instruction which works on SP rate.

IMO that performance difference you mentioned is not because the lack of integer performance but because of the GCN has some extra needs compared to previous architectures:

In general it needs 4x more threads, and the optimal register limit is dropped from 128 down to 84 or even better: 64 regs, in order to get close to nominal performance.

What was that test you've tried btw?

This is good to know as I mainly work with integers.

So providing I'm not using more than 12 bit unsigned integers, I'd be better off with

z = mad24(x, y, 0);

than I would with

z = x * y;

?

0 Likes

Here are some actual instructions:

v_add_i32, v_sub_i32 : they are producing carry

v_addc_i32, v_subb_i32 : these are producing carry, and also has an input carry

v_mad_i32_i24,  v_mad_u32_u24 : d:=s0*s1+s2    multiplication is 24bit, but addition is 32 bit

v_mul_i32_i24, v_mul_u32_u24 : d:=s0*s1     mul is 24bit

v_mul_hi_i32_i24, v_mul_hi_u32_u24 : high part of a 24bit mul, produces 16bit result

--------------------- those were the fast ones: (1 cycle/instruction)

v_mul_lo_i32, v_mul_lo_u32 : 32bit mul

v_mul_hi_i32, v_mul_hi_u32 : 32bit mul, high part

--------------------- and those are run on DP rate, eg. 4 cycle/instruction

In OpenCL it's not yet possible to use carry, but all the other instructions have its equivalents.

0 Likes

v_mul_hi_i32_i24, v_mul_hi_u32_u24 : high part of a 24bit mul, produces 16bit result

In OpenCL it's not yet possible to use carry, but all the other instructions have its equivalents.

Oh, cool, I did not know that! Could you let me know how to use v_mul_hi_u32_u24 from OpenCL? (I searched the forum for these instructions and it returned nil, but when I posted the question, it showed this "similar" thread )

0 Likes

Thanks. The program in question is a C++ AMP sample from Microsoft: http://blogs.msdn.com/b/nativeconcurrency/archive/2012/10/01/string-search-sample-with-c-amp.aspx  Obviously as sample code it has other priorities than being super-optimised, but the performance difference between these cards is interesting.

0 Likes