Archives Discussions

Meteorhead · ‎09-06-2010

Questions about upcoming tech

Hi, I have opened this topic to have a place for everyone to post questions about always the actual upcoming HW and SDK capabilities and proprties.

Meteorhead · ‎09-06-2010

My first questions would be:

- What CL_DEVICE_TYPE will be the GPU inside the upcoming Liano APUs?

I ask because I fancy the thought of being able to write applications (and being able to see games) that run regularly on CPU, calculate physics, AI and other highly-paralell parts on the IGP inside the CPU, and use the GPU solely for graphics. Since APU stands for Accelerated Proccessing Unit, will the GPU inside Liano be a CL_DEVICE_TYPE_ACCELERATOR? It would be wise to make a distinction to devices that share their __global memory physically with the host (as Liano will do).

- Will either radeon 6xxx cards or the new APUs support out-of-order exec?

Out of order execution on GPU-s is useful, although hard to harness, but inside the APU it would be most useful, where if one uses OpenCL events smart, they could create massively optimal engines to games, where memory-handling, window-management, AI, physics, etc. could run wickedly fast.

- How much effort would it take to have higher DP capacity and/or support for QP?

I read somewhere how Radeon cards deal with DP operations, namely that 2 Stream cores are linked inside a vector-processor for the duration of the operation and the remaining 3 are non-operational for the time being. This is the reason DP capacity is 1/5 of SP. I do not know how NVIDIA implements DP, but since each CUDA core has a single INT and FP unit, I suspect there are 2 ways: some CUDA cores are native 64-bit, while others are not ; OR 32-bit INT and FP units do 64-bit operations at the cost of hidden register use. Since OpenCL inherently is able to query preferred vector widths at certain precisions, and Radeon SIMD engines are inherently capable of doing 64 (or even 128) bit operations with 32-bit shader processors via this linking, the question is the following: I know linking Stream Cores to do 64-bit operations takes up space inside the die, but how much more would it take, to have 4*1 for SP, 2*2 dor DP, and 1*4 for QP operations? Quadrupole precision might be something that is a lot harder to implement on NVIDIA cards with the usage of single execution units, and AMD could win quite a few customers in the GPGPU segment being first to support a healthy QP capacity on GPUs, but the same goes for solely the ability to link 2*2 Stream Cores to reach double the DP capacity. Radeon 6xxx series might not, but future 28nm GPUs might have the space on the SIMD engines to do the extra linking.

nou · ‎09-06-2010

maybe they make new type CL_DEVICE_TYPE_APU.

IMHOout-of-order is just SW implementation of queue. concurent running of multiple kernel is another story.

each 5D unit can do MADD instruction which is counted as two FLOP. and with DP two and two units are linked together to perform two DP +-* operation. so one 5D unit can do 10 SP op/clock and 2 DP op/clock.

bubu · ‎09-06-2010

Originally posted by: Meteorhead

- What CL_DEVICE_TYPE will be the GPU inside the upcoming Liano APUs?

I bet Llano will expose 2 OpenCL devices, one typed as CPU and other typed as DX11 GPU.

- Will either radeon 6xxx cards or the new APUs support out-of-order exec?

I hope, as well as DMA transfers...

nou · ‎09-07-2010

IMHO again DMA transfer is just limitation of current implementation. even 4xxx can do DMA transfer under CAL. IIRC some AMD stated that they are working on it.

Meteorhead · ‎09-07-2010

This is the part in the ATI OpenCL Computing Guide I have mentioned. So do I have it right, that when linking is done, no MADD operations are available, so one operation cannot be counted as 2 FLOPs. This quote is misleading in some way, it says "two or four are linked... to perform a SINGLE DP operation". Shouldn't it be 1 DP FLOP when linking two, and 2 DP FLOP when linking four?

But if this last is true, than DP capacity could only be increased by adding MADD capability under linked Processing Element mode. QP needs a little more linking, perhaps also the ability to deal with MADD operations.

If this is true though, that 2 DP operations can be dealt with at once, why does OpenCL report preferred DP vector width to be 1 with 5970?

A stream core is arranged as a five-way very long instruction word (VLIW) processor. Up to five scalar operations can be coissued in a VLIW instruction, each of which are executed on one of the corresponding five processing elements. Processing elements can execute single-precision floating point or integer operations. One of the five processing elements also can perform transcendental operations (sine, cosine, logarithm, etc.) Double-precision floating point operations are processed by connecting two or four of the processing elements (excluding the transcendental core) to perform a single double-precision operation. The stream core also contains one branch execution unit to handle branch instructions.

himanshu_gautam · ‎09-08-2010

hi all,

Nice to hear your thoughts.

meteorhead,

I confirm the bug in document.But i hope the issue has been clarified by nou very well.

malcolm3141 · ‎09-08-2010

I believe this is referred to in the Optimisation Guide - a DP add or sub requires two pipes (in other words two can be scheduled in one bundle), but a DP mul or fma takes all four pipes (and hence only one can be scheduled in each bundle).

Talking of future hardware, I would love to see AMD include 32bit multipliers in each of the xyzw pipes, and I could also see them provide enough hardware between two pipes to perform at least a DP mad or even better a full precision DP fma. To be able to claim >1TFlops DP performance from a single GPU would be amazing!

Malcolm

Meteorhead · ‎09-10-2010

If I'm not mistaken, I recall AMD stating that it wishes to follow the APU approach on the Opteron front-line beside desktop solutions. It would be nice to hear some bits (or even more) information from these products. Is it only a plan to integrate the IGP into the CPU to reduce energy consumtion, or will there be processors with higher SIMD capacity?

I am very much interested in every way parallel computing hardware can be neatly integrated into HPC clusters. I think all supercomputer owners (as well as those looking for HPC solutions) would welcome a way to have upgradeable HW, meaning an Opteron would include a maximum of 4 cores, and the rest of the die would be SIMD engines (and some cache). This way existing 1U racks could be reused for major upgrade in computing power.

Right now the most neat and compact way of creating a GPU cluster would be the solutions offered by *beep*, where 1U rackmount can hold 2 double width GPUs. Only problem is, that the half width motherboard offered holds 2 processor slots. GPU clusters (in my opinion) don't need very powerful processors, only fast in RAM access, and mediocre computing power. Having 1 quad-hexa-octa core processor per GPU is a waste of money and computing power.

If anyone has anything to add, or correct me at points, please do.

Meteorhead · ‎06-17-2011

Instead of opening a new topic, let me post to a previous one:

I know AMD employees will not speak about unreleased HW, so let me ask a theoretical question purely based on news, or information publicly available:

Some future GPU of AMD (most likely top Southern Islands) will feature a brand new architecture designed from scratch, having kept in mind the needs of APU integration.

http://wccftech.com/2011/06/15/amd-slides-detail-upcoming-radeon-hd-79-series-gpu-architecture/

There is one thing I do not understand. How come that they advertise this architecture as being another step toward GPGPU applications, but I really cannot see how SIMD-vector process is "general". VLIW architecture excelled at being the sweetspot between graphics and GPGPU. Graphics used VLIW architecture as a vector processor, and GPGPU applications leveraged the compiler to vectorize scalar code. Having 16 wide SIMD, which 4 threads may share seems to me that one thread has the minimum of 4 wide SIMD. One thread simply cannot utilize 4 wide SIMD, unless it is vectorized code.

As it seems to me:

1) say good bye to cross-vendor OCL code. Scalar OCL code will utilize 25% of the card (35% max). Hail to HPC and scientific use, where we'll have to develop two seperate host- and kernel-side code.

2) applications where vectorization cannot be done efficiently, will simply greatly underperform expectance on AMD HW.

The new architecture seems awesome, I really like all the new stuff packed into this and big gratz to AMD for that. However, VLIW seemed like the strength of AMD to me, and I thought that as soon as superscalar architecture, or VLIW is left behind, all that will remain is an architecturally inferior Tesla. Architecture greatly developed, superscalar design remains, but SIMD is far inferior to VLIW.

Please, someone tell me that I am wrong at some point. How will this be GPGPU?

maximmoroz · ‎06-17-2011

Meteorhead, what's the problem with new architecture? From "instruction point of view":

- Current architecture (VLIW): 16 stream cores, each contains 4 processing elements.

- New one: 4x16 stream cores, each contains 1 processing element.

You will no longer need to be frustrated about low ALU Packing number 🙂 New architecture is similar to NVidia one, but with more processing elements per compute unit (and more compute units I guess).

Looking at the pictures you gave link to... I have other concerns. What if single wavefront might be executed only at single 16-wide SIMD? It would mean that to be efficient the kernel should provide 4 or even 8 wavefornts per compute unit?

Meteorhead · ‎06-17-2011

My concern is that on Fermi there is 32 Processing Elements (CUDA cores) inside a Compute Unit, and each PE has a scalar FP and INT units and one PE processes one thread only.

Cayman has 16 Processing Elements (Stream cores) all of them a 4-VLIW. Each PE runs a single thread and can co-issue different operations down each VLIW lane!! 1MUL-2ADD-1CMP for eg. If scalar colde was written, this packing was done by the compiler.

New architecture has inside one CU (I do not know if this CU is identical with OpenCL CU terminology, but I suspect not) a 16-wide SIMD, which can be utilized by different threads, or a single thread on the CU, but at most by 4 with the most granular 4 (thhreads) * 4 (wide SIMD), and different threads can co-issue different ALU operations down the SIMD lane, BUT inside one thread, the 4-wide SIMD must have all identical operations. 4ADD or 4MUL for eg.

From now one, if you do not have your code vectorized to at least 4-wide vectoroperations in every single thread, you will heavily underutilize the HW.

This contradiction of new architecture being GPGPU capable is only true, if the CU is the same in terminology as the OpenCL CU, thus there would have to be at least 96 Compute Units to result in at least the same amount of real processing elements (Stream Processors) as there is on a Cayman. If this is the case, that one Compute Unit will look like this, and having 16-wide SIMD is only the implementation of all threads inside a workgroup MUST do identical operations at all times, then it is OK. If this is true, then having 4*4-wide SIMD approach is an extension, namely that one Compute Unit can process different kernels at the same type, which would be very much useful for the asynchronous thread dispatch processor. If this is true, then this architecture is really close to being black magic.

Wavefront size will still remain 64 threads, (if my speculation is true) because it will still be 4 cycles to reach a register, and with 16 ALUs (not counting the 17th scalar) the hardware will create 64-wide threadgroups to hide register latency.

If my speculation is true, then it is true that this approach is closer to NV, but it might be even more flexible, with the ability to multitask on a single Compute Unit.

maximmoroz · ‎06-17-2011

Originally posted by: Meteorhead

New architecture has inside one CU (I do not know if this CU is identical with OpenCL CU terminology, but I suspect not) a 16-wide SIMD, which can be utilized by different threads, or a single thread on the CU, but at most by 4 with the most granular 4 (thhreads) * 4 (wide SIMD), and different threads can co-issue different ALU operations down the SIMD lane, BUT inside one thread, the 4-wide SIMD must have all identical operations. 4ADD or 4MUL for eg.
From now one, if you do not have your code vectorized to at least 4-wide vectoroperations in every single thread, you will heavily underutilize the HW.

Meteorhead, I am sure that "16-wide SIMD" doesn't mean that 4 work-items at max will be executed at any given moment at this "16-wide SIMD"... block. No way. Instead one wavefront at any given time will occupy this block entirely executing 16 workitems out of its 64.

It seems you think that VLIW4 stream core from Cayman architecture is widened to SIMD-16? Well, I think the opposite is true: This stream core is narrowed to single scalar processing element.

Meteorhead · ‎06-17-2011

Originally posted by: maximmoroz
Originally posted by: Meteorhead

New architecture has inside one CU (I do not know if this CU is identical with OpenCL CU terminology, but I suspect not) a 16-wide SIMD, which can be utilized by different threads, or a single thread on the CU, but at most by 4 with the most granular 4 (thhreads) * 4 (wide SIMD), and different threads can co-issue different ALU operations down the SIMD lane, BUT inside one thread, the 4-wide SIMD must have all identical operations. 4ADD or 4MUL for eg.

From now one, if you do not have your code vectorized to at least 4-wide vectoroperations in every single thread, you will heavily underutilize the HW.

Meteorhead, I am sure that "16-wide SIMD" doesn't mean that 4 work-items at max will be executed at any given moment at this "16-wide SIMD"... block. No way. Instead one wavefront at any given time will occupy this block entirely executing 16 workitems out of its 64.

It seems you think that VLIW4 stream core from Cayman architecture is widened to SIMD-16? Well, I think the opposite is true: This stream core is narrowed to single scalar processing element.

There were two possibilities:

(CU != OCL_CU) && (4VLIW >> 16SIMD)

OR

(CU == OCL_CU) && (4VLIW >> 1SISD)

If the first is true, it will be a big mess. If the second is true, than it will be very capable, but there must be significantly more CUs in a GPU then there are now. (Roughly ~100)

One wavefront occupying the 16-wide SIMD is logical enough, and will most likely be the default case. But since a wavefront is ALWAYS 64 wide, even if your workgroup is only 16 threads, the thread dispatch processor will create dummy threads for you to fill it up to 64 with operations masked from making output. Therefore there would be no sense in allowing 4*4 breakup of the SIMD array if it not were for different kernels being able to run on the same comput unit.

Can you think of any other scenario where it is useful?

maximmoroz · ‎06-17-2011

Meteorhead, sorry, I have completely lost track of the discussion.

Let me state what I got from the slides you linked to: AMD leaves VLIW architecture behind.

- Advantage: no more ALU Packing issue

- Possible disadvantage: While in VLIW architecture 2 wavefronts are enought to hide register access and ALU latency, the new architecture MIGHT require more wavefronts (4 or even 😎 to hide that latency.

LeeHowes · ‎06-17-2011

Might, however you have to remember that if you take a VLIW-5 packet and flatten it you get 5 issue slots in time instead of space. That's 5 (instruction) cycles worth of latency hiding 🙂

The architecture described in the talks had four 16-wide SIMD units per CU. It issues 4 waves over four cycles per CU - that's the same number of instructions as Cayman but laid out in time rather than space for each vector instruction.

Cayman has a 16 wide SIMD unit. The discussed architecture has a 16 wide SIMD unit. I'm not sure where the confusion is coming from?

Remember, Cayman, the discussed architecture, and Fermi are all vector architectures - there are just subtle differences in how they issue vector instructions and how wide the vectors are.

ETA: remember that a thread as far as Fermi is concerned is a 32-wide vector, as far as Cayman is concerned it is a 64-wide vector. This is slightly different from the width of the hardware SIMD unit and also different from the way the word thread is used in CUDA. For the purposes of discussion not using the word thread at all might be clearer 😉

eduardoschardong · ‎06-17-2011

Lee, I'm a bit confused too by those blocks and arrows... Can you help?

There is a diagram of the SIMDs with arrows between them but only in one direction, what does those arrows mean?

Also, it's now 10 wavefronts per SIMD and still 4 cycles per wavefront per instruction, and scalar, was single-threaded performance on compute bound kernels sacrified a LOT?

Meteorhead · ‎06-18-2011

OK, I think I got it. Basically exactly the same amount of processors are found inside a CU, but instead of vectorizing scalar code in VLIW manner, all vector code is being "serialized".

The drawback is that more wavefronts are needed to keep the ALU busy. If I'm not mistaken, computation is done in the following manner (same amount of work done in the given time:

Cayman: Tick01: 00-15 threads of Wavefront 0 compute 4 instructions at once in VLIW manner. Tick02: 16-31 threads of Wavefront 0 compute 4 instructions at once in VLIW manner. Tick03: 32-47 threads of Wavefront 0 compute 4 instructions at once in VLIW manner. Tick04: 48-63 threads of Wavefront 0 compute 4 instructions at once in VLIW manner. Tick05: 00-15 threads of Wavefront 1 compute 4 instructions at once in VLIW manner. Tick06: 16-31 threads of Wavefront 1 compute 4 instructions at once in VLIW manner. Tick07: 32-47 threads of Wavefront 1 compute 4 instructions at once in VLIW manner. Tick08: 48-63 threads of Wavefront 1 compute 4 instructions at once in VLIW manner. Tick09: 00-15 threads of Wavefront 2 compute 4 instructions at once in VLIW manner. Tick10: 16-31 threads of Wavefront 2 compute 4 instructions at once in VLIW manner. Tick11: 32-47 threads of Wavefront 2 compute 4 instructions at once in VLIW manner. Tick12: 48-63 threads of Wavefront 2 compute 4 instructions at once in VLIW manner. Tick13: 00-15 threads of Wavefront 3 compute 4 instructions at once in VLIW manner. Tick14: 16-31 threads of Wavefront 3 compute 4 instructions at once in VLIW manner. Tick15: 32-47 threads of Wavefront 3 compute 4 instructions at once in VLIW manner. Tick16: 48-63 threads of Wavefront 3 compute 4 instructions at once in VLIW manner. Southern Island: Tick01: 00-15 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick02: 16-31 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick03: 32-47 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick04: 48-63 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick05: 00-15 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick06: 16-31 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick07: 32-47 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick08: 48-63 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick09: 00-15 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick10: 16-31 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick11: 32-47 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick12: 48-63 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick13: 00-15 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick14: 16-31 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick15: 32-47 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick16: 48-63 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line.

LeeHowes · ‎06-18-2011

There is a diagram of the SIMDs with arrows between them but only in one direction, what does those arrows mean?

Not a clue 🙂

Also, it's now 10 wavefronts per SIMD and still 4 cycles per wavefront per instruction, and scalar, was single-threaded performance on compute bound kernels sacrified a LOT?

*up to 10* The current design allows up to some high number per macro sequencer (ie per 10 SIMDs... 128 ,256, something like that. I forget). This is saying up to 10 per SIMD, or up to 40 per CU. Another way to look at that is that each micro sequencer tracks 40 program counters - if you use too many registers just like now the number you actually have state for can be lower.

Lee, I am able to efficiently load Cayman compute unit with just 2 ALU-intensive wavefronts. Would it be possible in new architecture? Only if new compute unit is able to execute single wavefront at several 16-wide SIMD blocks at the same time and the ALU and register access latency is 2 cycles. I doubt it. My guess is that it would require 4 or 8 ALU-intensive wavefronts to efficiently load single compute unit in new architecture (by the way, it is similar to NVidia's 6 wavefronts).

Right. But remember that when you load a Cayman unit with 2 wavefronts you are giving each wavefront 4 instructions - so in that same time with this archictecture you can issue 4 arbitrary instructions. And you'd never reach peak that way because every thread switch would leave a 40 cycle bubble in the pipeline. You'd need at least a third to cover those gaps. So yes, to fully specify the machine, assuming the same interleaving as Cayman (which I haven't asked anyone about so may or may not be the case) you would need 4x the number of wavefronts to keep it busy - but of course the arithmetic density in terms of time would go up as you spread the instructions out.

GPUs are throughput architectures. Over 24 cores Cayman tends to need a couple of hundred threads to keep it busy - you can imagine needing more with this design, but in either case you're getting no efficiency if you run single threaded scalar code anyway so the same rough programming rules apply. Think of it as nothing but a bonus.

maximmoroz · ‎06-18-2011

Lee, I see no problem with new architecture targeting large tasks 🙂

eduardoschardong · ‎06-19-2011

Originally posted by: LeeHowes *up to 10* The current design allows up to some high number per macro sequencer (ie per 10 SIMDs... 128 ,256, something like that. I forget). This is saying up to 10 per SIMD, or up to 40 per CU. Another way to look at that is that each micro sequencer tracks 40 program counters - if you use too many registers just like now the number you actually have state for can be lower.

Thank you for the response, onde more, what the minimum number of wavefronts to fill up compute resources on the new chip (2, in the case of Cayman)? Or, asking in another way, what's the latency of each instruction in cycles (8 in the case of Cayman)?

dravisher · ‎06-19-2011

As I understood it from comments made by the guy who presented the GCN session (the first one in the parallel sessions, not the keynote speaker), GCN will require four wavefronts per CU to keep it fully occupied (not considering memory latencies of course). The question I asked was how many more work-items I would need to feed the CU with to keep it fully occupied, and the answer was that it doubled from Cayman so I don't think I misunderstood, but another confirmantion here would be nice.

What I'm still wondering though, is how this affects global memory latencies? Basically my question is: If we feed a Cayman CU and a GCN CU with four wavefronts, will the GCN be more strangled by global memory latencies than Cayman? With Cayman only a single wavefront is actually executing at any one time, so it does have others to switch to when waiting for global memory. With GCN all four wavefronts are actually executing at the same time, and so there is nothing to switch to (other than within the wavefronts). Would this lead to us needing more wavefronts per GCN CU to hide global memory latencies than we do on Cayman? I find this interesting since needing more wavefronts per CU in practice increases pressure on both LDS and registers. The LDS has doubled so that's fine, but the registers have stayed the same size per CU.

It would be very interesting if someone from AMD could clear this up, as it matters a great deal when designing kernels how many registers I can use without being totally screwed by global memory latencies 🙂

Edit: BTW was the move away from VLIW4 generally known before the GCN parallel session? It was actually mentioned in an earlier parallel session on the JIT compiler (a session with much fewer attendants). It wasn't given much attention, just a "oh, by the way, the next architecture is no longer VLIW". My jaw literally dropped when I saw that slide :-P.

bubu · ‎06-20-2011

So you're killing the VLWI and SIMD approach and adopting a scalar SMT arch finally?

Jawed · ‎07-09-2011

Originally posted by: dravisherWhat I'm still wondering though, is how this affects global memory latencies? Basically my question is: If we feed a Cayman CU and a GCN CU with four wavefronts, will the GCN be more strangled by global memory latencies than Cayman? With Cayman only a single wavefront is actually executing at any one time, so it does have others to switch to when waiting for global memory. With GCN all four wavefronts are actually executing at the same time, and so there is nothing to switch to (other than within the wavefronts). Would this lead to us needing more wavefronts per GCN CU to hide global memory latencies than we do on Cayman? I find this interesting since needing more wavefronts per CU in practice increases pressure on both LDS and registers. The LDS has doubled so that's fine, but the registers have stayed the same size per CU.

GCN will no longer waste registers like the VLIW chips do. Register allocation on the current chips is terrible, hence all the complaints about register spill.

So GPRs will prove to be less of a constraint on the number of hardware threads per SIMD as the compiler won't be so profligate (fingers-crossed).

Of course if your algorithm wants to use a small number of hardware threads per SIMD due to a large workgroup size or large local memory allocation per work item, then you're stuck.

settle · ‎09-07-2011

Originally posted by: LeeHowes Might, however you have to remember that if you take a VLIW-5 packet and flatten it you get 5 issue slots in time instead of space. That's 5 (instruction) cycles worth of latency hiding 🙂

In current AMD GPUs each SIMD unit has 4 ALUs (plus possibly 1 SFU depending on the model). I still can't understand how work-items, vector types, etc. get mapped to the ALUs in AMD GPUs (and CPUs).

Does AMD APP SDK perform implicit vectorization for the GPU? How about the CPU? If not, any plans of providing it in the near future?
How are VLIW-4 (or VLIW-5) packets formed, from 4 independent operations within a single work-item or 4 independent operations among 4 contiguous work-items? What happens in a kernel that doesn't have 4 independent operations but only has one operation like an fma or mad in saxpy with scalar float--one float saxpy per work-item? Will only 1/4 of ALUs be utilized?
How are current VLIW-4 packets executed within a SIMD, using all 4 ALUs at once (issue slots in space), or using 1 of the 4 ALUs over several cycles (issue slots in time)? Or do I have that reversed?

I guess I'm looking for a simple and clear statement (I do scientific computing but don't have a formal CS background) from AMD similar to the following from "Writing Optimal OpenCL Code with Intel OpenCL SDK" in section 2.5 Benefiting from Implicit Vectorization:

"Vectorization module transforms scalar operations on adjacent work-items into an
equivalent vector operation. When vector operations already exist in the kernel source
code, they are scalarized (broken down into component operations) and re-vectored."

Thanks for your help clarifying these issues for me.

himanshu_gautam · ‎09-07-2011

Question: Does AMD APP SDK perform implicit vectorization for the GPU? How about the CPU? If not, any plans of providing it in the near future?

Answer: AMD APP SDK binds 4/5 independent instructions to VLIW4/VLIW5 if it is able to find them. On CPU vectorization is done in similar situation.

Question: How are VLIW-4 (or VLIW-5) packets formed, from 4 independent operations within a single work-item or 4 independent operations among 4 contiguous work-items?

Answer: 4 independent instructions within a work-item.

Question: What happens in a kernel that doesn't have 4 independent operations but only has one operation like an fma or mad in saxpy with scalar float--one float saxpy per work-item? Will only 1/4 of ALUs be utilized?

Answer: Yes.

Question: How are current VLIW-4 packets executed within a SIMD, using all 4 ALUs at once (issue slots in space), or using 1 of the 4 ALUs over several cycles (issue slots in time)? Or do I have that reversed?

Anwer: Instructions inside VLIW4/5 packets are executed simultanously on a SIMD. VLIW packets cannot be created from multiple work-items. All instructions in a VLIW packet must be from same work-item.

settle · ‎09-08-2011

Himanshu,

Nice answers, thank you!

Meteorhead · ‎12-05-2011

Does anybody know anything official about NGC? There was supposed to be a press release Dec. 5th in London, but there's absolutely no new from it.

It would be nice to know if all those neat numbers on the web are actually true, or just some troll publishes them saying "leaked", and then everybody copies, so they don't "fall behind".

maximmoroz · ‎06-18-2011

Originally posted by: LeeHowes Might, however you have to remember that if you take a VLIW-5 packet and flatten it you get 5 issue slots in time instead of space. That's 5 (instruction) cycles worth of latency hiding 🙂

The architecture described in the talks had four 16-wide SIMD units per CU. It issues 4 waves over four cycles per CU - that's the same number of instructions as Cayman but laid out in time rather than space for each vector instruction.
Cayman has a 16 wide SIMD unit. The discussed architecture has a 16 wide SIMD unit. I'm not sure where the confusion is coming from?

Lee, I am able to efficiently load Cayman compute unit with just 2 ALU-intensive wavefronts. Would it be possible in new architecture? Only if new compute unit is able to execute single wavefront at several 16-wide SIMD blocks at the same time and the ALU and register access latency is 2 cycles. I doubt it. My guess is that it would require 4 or 8 ALU-intensive wavefronts to efficiently load single compute unit in new architecture (by the way, it is similar to NVidia's 6 wavefronts).

Well, it is a problem only when there is a small amount of wavefronts, that is the task is relatively small.

Meteorhead · ‎06-20-2011

I would ve very much interested what the DP throughput of this architecture is. It sometimes comes across my mind... "Maybe on the new 28nm somebody pulls off a native 64-bit ALU."

Or will it link 2 processors on the same SIMD to perform a DP operation similar to Cayman? Will DP performance be yet again 1/4 of SP, 1/2?

dravisher · ‎06-20-2011

Well the statement was that DP (double precision) performance was 1/2, 1/4 or 1/16 depending on product (and all GCN products will have DP support). It wasn't entirely clear to me whether they meant that DP would be a mix of 1/2 and 1/4 (like today), or 1/2 on some products and 1/4 on others. However Anandtechs article states 1/2, but of course they could have misunderstood, I don't know.

Meteorhead · ‎06-20-2011

Originally posted by: dravisher Well the statement was that DP (double precision) performance was 1/2, 1/4 or 1/16 depending on product (and all GCN products will have DP support). It wasn't entirely clear to me whether they meant that DP would be a mix of 1/2 and 1/4 (like today), or 1/2 on some products and 1/4 on others. However Anandtechs article states 1/2, but of course they could have misunderstood, I don't know.

What do you mean that "mix of 1/2 and 1/4 (like today)"? How is it today a mix of these? As far as I know on VLIW4, 2-2 processors link to perform a DP operation, and since in linked mode they cannot perform FMAD (by which GFLOPS is measured) performance is divided again to a total of 1/4 = 1 / 2 (link) / 2 (FMAD inability). But it is not a mix of 1/2 and 1/4. Cayman has 1/4, period.

1/2 on new architecture would ROCK, but I would be curious how it is achieved. 🙂

dravisher · ‎06-20-2011

Meteorhead: There's some confusion on this point, but see for example table 4.14 in the AMD APP OpenCL Programming Guide 1.2d. For Cypress (but it basically stays the same for Cayman except we have one less unit from what I know) we have the following capabilities per processing element per clock (DP in parentheses):

FMA: 4 (1)

MAD: 5 (1)

ADD: 5 (2)

MUL: 5 (1)

So the DP performance for Cypress is 1/5 for MAD and MUL, 2/5 for ADD. For Cayman the equivalent numbers are 1/4 and 1/2, QED 😛

ED1980 · ‎06-20-2011

As I understand it, in the professional version (FirePro) -1 / 2 , Hi-End gaming version 1 / 4, the other 1 / 16.

The restriction is likely to be as in Nvidia, specially made ?(in the driver software)

nou · ‎06-20-2011

IMHO high end chip like Cayman will have 1/2 in DP and mid-range will have 1/4 and low-end only 1/16.

Meteorhead · ‎06-20-2011

I highly doubt AMD wishes to insert restrictions into their HW similar to NV for the following pupose: AMD does not follow monolithic chip design (although GF116 is a viable chip). For AMD to perform in the top gaming and HPC segment, they have to create dual-GPU solutions, and up until now there were none dual-GPU professional cards, only gaming cards are dual-GPU. FirePro is optimized for CAD programs, and they are not optimized for multi-GPU applications, so most likely there will be no multi-GPU FirePros in the future also.

If they were to insert restrictions into gaming HW for the sole reason of enforcing people to buy FirePros, they cut themselves from the high-end HPC segment completely.

My guess goes with nou, it will be class-dependant how the chips will perform in DP.

laobrasuca · ‎06-20-2011

Originally posted by: MeteorheadFirePro is optimized for CAD programs, and they are not optimized for multi-GPU applications, so most likely there will be no multi-GPU FirePros in the future also.

unless they create a new line up to compete directly with Teslas. Since AMD wants to make a name on the GPGPU, it would not be so surprising. NVIDIA had maybe not all the success they expected to on the software side about CUDA (push every big software company to make plugins using CUDA), but they certainly sell lots of Teslas with the new supercomputers out there. I ignore how much dollars it represents, but I can guess that AMD would like to compete in this segment also.

All this makes me think about some people who said, like: "VLWI is the strength of AMD, it will never disappear". Well, seems not... I'm really interested on this new architecture and how it will improve things on the GPGPU side. Only wonder if it will be inside HD8000 or HD9000 series (or maybe another name, why not!)

ryta1203 · ‎06-20-2011

All this makes me think about some people who said, like: "VLWI is the strength of AMD, it will never disappear". Well, seems not...

I don't think that's necessarily the case, what it does mean is that AMD feels there is a market that they can better compete in by moving to this new architecture (ie being more similar to Nvidia). Like I said in another thread, I'd be really surprised if the 1st generation of these new cards can compete with a generation back of the VLIW cards when it comes to algorithms like MM, for example, my guess is that the peak is going to go down and since the most optimized MM is getting over 90% peak...

bubu · ‎06-20-2011

APUs are very interesting for the HPC world: they are small enough to fit in a 1U rack and they consume low power. However, without DP support the product has a big handicap.

I hope AMD could make a Fusion APU version with full DP support soon ( Opteron APU? )

laobrasuca · ‎06-20-2011

I'd be really surprised if the 1st generation of these new cards can compete with a generation back of the VLIW cards when it comes to algorithms like MM

The problem is that the MM is the perfectly parallel use-case, but most of the algorithms used by us all today are very very far way from this perfect fit case. Having an architecture that is in average better than the current one will make it sell better. How much it worth an architecture whose peak performance is the fastest but can only be achieved in a very small number of cases? Sure, games are one of these cases, for now. But even there shaders have been more and more complex and the new to come compute pipeline of opengl will make shader even more flexible, compute friendly, closer to opencl in some way. AMD architects have be seen this for some time now. It's time to move on. It would surprise me if new cards will be slower than current ones for games. New architecture will be forged using smaller form factor, so it will give us more fps, AMD marketing guys will sell it as the fastest architecture ever and gamers will be happy. Plus, it will be faster for GPGPU, compilers will fit better and OpenCL will get closer and closer to CUDA. Better yet, same kernel will have smaller performance difference between AMD and NVDIA than today, making code development easier and more general. More yet, we will have price war in almost all segments. What else could people ask for!

Archives Discussions

Future HW and SDK