Originally posted by: LeeHowes *up to 10* The current design allows up to some high number per macro sequencer (ie per 10 SIMDs... 128 ,256, something like that. I forget). This is saying up to 10 per SIMD, or up to 40 per CU. Another way to look at that is that each micro sequencer tracks 40 program counters - if you use too many registers just like now the number you actually have state for can be lower.
Thank you for the response, onde more, what the minimum number of wavefronts to fill up compute resources on the new chip (2, in the case of Cayman)? Or, asking in another way, what's the latency of each instruction in cycles (8 in the case of Cayman)?
As I understood it from comments made by the guy who presented the GCN session (the first one in the parallel sessions, not the keynote speaker), GCN will require four wavefronts per CU to keep it fully occupied (not considering memory latencies of course). The question I asked was how many more work-items I would need to feed the CU with to keep it fully occupied, and the answer was that it doubled from Cayman so I don't think I misunderstood, but another confirmantion here would be nice.
What I'm still wondering though, is how this affects global memory latencies? Basically my question is: If we feed a Cayman CU and a GCN CU with four wavefronts, will the GCN be more strangled by global memory latencies than Cayman? With Cayman only a single wavefront is actually executing at any one time, so it does have others to switch to when waiting for global memory. With GCN all four wavefronts are actually executing at the same time, and so there is nothing to switch to (other than within the wavefronts). Would this lead to us needing more wavefronts per GCN CU to hide global memory latencies than we do on Cayman? I find this interesting since needing more wavefronts per CU in practice increases pressure on both LDS and registers. The LDS has doubled so that's fine, but the registers have stayed the same size per CU.
It would be very interesting if someone from AMD could clear this up, as it matters a great deal when designing kernels how many registers I can use without being totally screwed by global memory latencies 🙂
Edit: BTW was the move away from VLIW4 generally known before the GCN parallel session? It was actually mentioned in an earlier parallel session on the JIT compiler (a session with much fewer attendants). It wasn't given much attention, just a "oh, by the way, the next architecture is no longer VLIW". My jaw literally dropped when I saw that slide :-P.
I would ve very much interested what the DP throughput of this architecture is. It sometimes comes across my mind... "Maybe on the new 28nm somebody pulls off a native 64-bit ALU."
Or will it link 2 processors on the same SIMD to perform a DP operation similar to Cayman? Will DP performance be yet again 1/4 of SP, 1/2?
Well the statement was that DP (double precision) performance was 1/2, 1/4 or 1/16 depending on product (and all GCN products will have DP support). It wasn't entirely clear to me whether they meant that DP would be a mix of 1/2 and 1/4 (like today), or 1/2 on some products and 1/4 on others. However Anandtechs article states 1/2, but of course they could have misunderstood, I don't know.
Originally posted by: dravisher Well the statement was that DP (double precision) performance was 1/2, 1/4 or 1/16 depending on product (and all GCN products will have DP support). It wasn't entirely clear to me whether they meant that DP would be a mix of 1/2 and 1/4 (like today), or 1/2 on some products and 1/4 on others. However Anandtechs article states 1/2, but of course they could have misunderstood, I don't know.
What do you mean that "mix of 1/2 and 1/4 (like today)"? How is it today a mix of these? As far as I know on VLIW4, 2-2 processors link to perform a DP operation, and since in linked mode they cannot perform FMAD (by which GFLOPS is measured) performance is divided again to a total of 1/4 = 1 / 2 (link) / 2 (FMAD inability). But it is not a mix of 1/2 and 1/4. Cayman has 1/4, period.
1/2 on new architecture would ROCK, but I would be curious how it is achieved. 🙂
Meteorhead: There's some confusion on this point, but see for example table 4.14 in the AMD APP OpenCL Programming Guide 1.2d. For Cypress (but it basically stays the same for Cayman except we have one less unit from what I know) we have the following capabilities per processing element per clock (DP in parentheses):
FMA: 4 (1)
MAD: 5 (1)
ADD: 5 (2)
MUL: 5 (1)
So the DP performance for Cypress is 1/5 for MAD and MUL, 2/5 for ADD. For Cayman the equivalent numbers are 1/4 and 1/2, QED 😛
As I understand it, in the professional version (FirePro) -1 / 2 , Hi-End gaming version 1 / 4, the other 1 / 16.
The restriction is likely to be as in Nvidia, specially made ?(in the driver software)