Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- AMD Community
- Communities
- Developers
- Devgurus Archives
- Archives Discussions
- Re: HD7970ghz Peak TFLOPS calculation

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

04-10-2013
04:55 PM

HD7970ghz Peak TFLOPS calculation

In the document 'AMD Accelerated Parallel Processing OpenCL Programming Guide' provided here

Table 5.3 gives the instructions per cycle (IPC) ratings for various instructions from which we may calculate the peak FLOPS for both single and double precision calculations. Using the table I calculate the double precision peak FLOPS as

dp_add_flops = total_alu_count * clock_rate * dp_add_ipc

= 2048 * 1.05 GHz * 0.5

= 1.0752 TFLOPS

which is roughly in line with the advertised performance, however for single precision I have

sp_add_flops = total_alu_count * clock_rate * sp_add_ipc

= 2048 * 1.05 GHz * 4

= 8.6016 TFLOPS

which is exactly double the advertised performance. What am I missing here? If the single point add IPC is reduced to 2 then the numbers are spot on, however, that does not agree with the specs provided in the document identified above. Also is there a place where I can find very detailed hardware specifications for my card specifically?

Thanks

~ry

Solved! Go to Solution.

1 Solution

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

04-11-2013
04:03 AM

I think the table is probably correct, but a little interpretation is needed.

Cards like Tahiti 7970, 7950 are "Full Speed Double Precision devices", so they are in the right column.

Full Speed only means the best the architecture can do, no specific speed.

The word "cycle" here means 4 clocks. Stream processors issue a wave in 4 parts with a minimum

4 clock latency which is considered 1 cycle. However, instructions can be issued

on each clock of a cycle thus 4 insns/cycle.

Most **FP **instructions are 4/cycle (1/clock), which is impressive for FMA and the like.

The transcendentals (**rcp, sin, log, sqrt, rsqrt**) are 1/cycle.

Most all **DP **is 1/cycle except **ADD**, where they manage to squeak out 2.

(note they choose **ADD **to calculate a performance for **DP**).

Also, "peak" performance is almost always based on multiply + add insns (**MAD**)

which count as 2 *instructions per instruction*, which gives a FACTOR of 2.

Using clocks, not cycles, peak performance would be .

(1insn/clock)*(2048)*(1.0e9)*FACTOR = 4.096 TFlops/sec of most FP and Int.

(1/4 insn/clock)*(2048)*(1.0e9)*FACTOR = 1.024 TFlops/sec DP.

Using cycles would be 4 or 1 insn/cycle and a cycle speed of 0.25e9, which comes out the same.

Basic **int **operations are 4/cycle with the big exception the 32 bit mul and mad reduce to 1/cycle.

However there are 24 bit accuracy versions of **mad **and **mul **that run at 4 insns/cycle.

Presumably the reason is the 24 bit insns use the fast **FP **multipliers, which only have to

multiply 24 bit mantissas.At least that was always my guess.

Edit, fixed the ambiguous phrase

"many instructions can be issued on each clock thus 4 insns/cycle.

9 Replies

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

04-11-2013
01:47 AM

I went through the table and I am equally confused. I will ask around.

Here are some of my thoughts:

4 operations per cycle per stream processor looks like a huge task. Usually this can be done only if the instruction is operating on a vector.

Also, the guide calculates DP Peak assuming that the Tahiti is a "one-quarter double precision speed device". No idea why it is so....

Will ask around and get back to you,

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

04-11-2013
04:03 AM

I think the table is probably correct, but a little interpretation is needed.

Cards like Tahiti 7970, 7950 are "Full Speed Double Precision devices", so they are in the right column.

Full Speed only means the best the architecture can do, no specific speed.

The word "cycle" here means 4 clocks. Stream processors issue a wave in 4 parts with a minimum

4 clock latency which is considered 1 cycle. However, instructions can be issued

on each clock of a cycle thus 4 insns/cycle.

Most **FP **instructions are 4/cycle (1/clock), which is impressive for FMA and the like.

The transcendentals (**rcp, sin, log, sqrt, rsqrt**) are 1/cycle.

Most all **DP **is 1/cycle except **ADD**, where they manage to squeak out 2.

(note they choose **ADD **to calculate a performance for **DP**).

Also, "peak" performance is almost always based on multiply + add insns (**MAD**)

which count as 2 *instructions per instruction*, which gives a FACTOR of 2.

Using clocks, not cycles, peak performance would be .

(1insn/clock)*(2048)*(1.0e9)*FACTOR = 4.096 TFlops/sec of most FP and Int.

(1/4 insn/clock)*(2048)*(1.0e9)*FACTOR = 1.024 TFlops/sec DP.

Using cycles would be 4 or 1 insn/cycle and a cycle speed of 0.25e9, which comes out the same.

Basic **int **operations are 4/cycle with the big exception the 32 bit mul and mad reduce to 1/cycle.

However there are 24 bit accuracy versions of **mad **and **mul **that run at 4 insns/cycle.

Presumably the reason is the 24 bit insns use the fast **FP **multipliers, which only have to

multiply 24 bit mantissas.At least that was always my guess.

Edit, fixed the ambiguous phrase

"many instructions can be issued on each clock thus 4 insns/cycle.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

04-11-2013
10:19 AM

Nice explanation of this confusing topic.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

04-11-2013
10:57 AM

Noting of course that ADD alone and MAD with the double multiplier give the same quarter-clock-rate throughput of DP ops.

If your explanation is correct (and it seems sound as you write it, though I'd have to read thoroughly to find where else 'cycle' is used in that way, then I think the problem is that the parts of the programming guide I rewrote use "cycle" to mean "clock cycle", and like most people here I'm naturally reading the table the same way. Some clarification is in order to make that table more consistent with the rest of the chapter I will make a note.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

04-11-2013
12:34 PM

Thanks for the feedback.

Yes, I only *assume* that meaning for cycles from looking at the table and already knowing the answers.

It is the only meaning that makes sense. But just a few lines below the table states

... Table 5.3, a Tahiti device can perform one double-precision ADD operations/2 cycles in each stream core.

where as the table clearly says 2 DP ADDs per "cycle" for each stream core.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

04-11-2013
12:36 PM

Hi drallan,

Thank you very much for the explanation, it clears things up quite a bit. A few points of clarification though. When you say that many instructions can be issued on each clock thus 4 insns/cycle, did you mean 1ins/clock and thus 4ins/cycle or is there actually something superscalar in nature going on here? Also, in your calculations I think it is TFLOPS and not GFLOPS, (1)*(2048)*(1E9)*(2)/(10**12) = 4.096.

thanks

~ry

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

04-11-2013
12:43 PM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

04-11-2013
09:57 PM

A few points of clarification though. When you say that

many instructions can be issued on each clockthus 4 insns/cycle, did you mean 1ins/clock and thus 4ins/cycle or is there actually something superscalar in nature going on here? [...] I think it is TFLOPS and not GFLOPS, (1)*(2048)*(1E9)*(2)/(10**12) = 4.096.

Oops, fixed both, thanks.

It should be "most instruction types can be issued on each clock" or just "instructions can be issued on...."

(I can see an award for ambiguity here)

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

04-12-2013
03:37 AM

thank you for share.