Archives Discussions

Biaowang · ‎07-21-2014

Hi,

I am working on video codec using GPUs. Since most the operation in video encoding/decoding is integer, I would like to know what is the peak performance on AMD GPUs.

I expect some table like this in nvidia Programming Guide :: CUDA Toolkit Documentation

for AMD I failed to find such table. and it is time consuming/boring work to benchmark each GPU

any ideas?

Best regards

dipak · ‎07-22-2014

Please see the following section in AMD APP OpenCL programming guide.

7.8 Instruction Selection Optimizations

7.8.1 Instruction Bandwidths

Table 7.10 lists the throughput of instructions for GPUs.

Regards,

View solution in original post

nou · ‎07-21-2014

search for AMD OpenCL programing guide. it contain table with instruction throughput.

Biaowang · ‎07-22-2014

Just a feedback to AMD document maintaince group

I found the web version of instruction throughtput table in http://developer.amd.com/tools-and-sdks/opencl-zone/opencl-tools-sdks/amd-accelerated-parallel-proce...

They are different from the table 7.10 in section 7.8.1 in the AMD OpenCL programing guide

I test the peak performance on a GCN device, and can confirm the table in the programming guide is correct,

integer peak performance is 1/5 of floating point, not 1/4 as shown in the link.

please correct it

Thanks for all of you

pinform · ‎07-23-2014

Dear Biaowang,

There are two instruction throughput tables in the link (and in the guide). One is for Evergreen/NI devices; the other is for GCN devices. You were comparing the Evergreen/NI table in one to the GCN table in the other.

Table 10 in section 7.8.1 of the AMD OpenCL programming guide (rev 2.7) corresponds to Table 3.1 at the link you have provided:

3.10 Instruction Throughput (Operations/Cycle for Each Stream Processor) 3-41

--Prasad

realhet · ‎07-23-2014

These 1/1, 1/2 and 1/4 throughputs are only measure-able if you make ideal conditions to the Vector ALU and it has not to stall a single cycle.

For example: ADD on GCN takes 1 cycle only, but it not just writes into the Vector registers but it also writes the 64bit carry into a Scalar register. So if there is a scalar instruction right after the ADD then it will take one more cycle. It depends on your particular example code that how many scalar operations needed to be interleaved in between the vector instruction stream.

The same ADD instruction on Evergreen will not have this issue, but if your program have to work with carry (eg.: multiprecision integer addition), then you must use another instruction to get the carry bits, and yet another to accumulate them. Hoewer on GCN this big integer carry-adding could be done in 1 cycles / 32 bits of data.

realhet · ‎07-22-2014

Hi,

The only problematic int op is 32bit multiply (_hi and _low). Those are 4x slower as they run on double precision units. All else (add, mul24, mad24, bitwise, bitfind, bitinsert, sad) ar run as fast as simple float operations. You can effectively use mad24 wich is like 24bit*24bit+32bit and you can get the upper or lower 32 bits of the result.

dipak · ‎07-22-2014

Please see the following section in AMD APP OpenCL programming guide.

7.8 Instruction Selection Optimizations

7.8.1 Instruction Bandwidths

Table 7.10 lists the throughput of instructions for GPUs.

Regards,

Archives Discussions

The peak performance of integer operation