cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

maximmoroz
Journeyman III

Future HW and SDK

Meteorhead, what's the problem with new architecture? From "instruction point of view":

- Current architecture (VLIW): 16 stream cores, each contains 4 processing elements.

- New one: 4x16 stream cores, each contains 1 processing element.

You will no longer need to be frustrated about low ALU Packing number 🙂 New architecture is similar to NVidia one, but with more processing elements per compute unit (and more compute units I guess).

Looking at the pictures you gave link to... I have other concerns. What if single wavefront might be executed only at single 16-wide SIMD? It would mean that to be efficient the kernel should provide 4 or even 8 wavefornts per compute unit?

0 Likes
Meteorhead
Challenger

Future HW and SDK

My concern is that on Fermi there is 32 Processing Elements (CUDA cores) inside a Compute Unit, and each PE has a scalar FP and INT units and one PE processes one thread only.

Cayman has 16 Processing Elements (Stream cores) all of them a 4-VLIW. Each PE runs a single thread and can co-issue different operations down each VLIW lane!! 1MUL-2ADD-1CMP for eg. If scalar colde was written, this packing was done by the compiler.

New architecture has inside one CU (I do not know if this CU is identical with OpenCL CU terminology, but I suspect not) a 16-wide SIMD, which can be utilized by different threads, or a single thread on the CU, but at most by 4 with the most granular 4 (thhreads) * 4 (wide SIMD), and different threads can co-issue different ALU operations down the SIMD lane, BUT inside one thread, the 4-wide SIMD must have all identical operations. 4ADD or 4MUL for eg.

From now one, if you do not have your code vectorized to at least 4-wide vectoroperations in every single thread, you will heavily underutilize the HW.

This contradiction of new architecture being GPGPU capable is only true, if the CU is the same in terminology as the OpenCL CU, thus there would have to be at least 96 Compute Units to result in at least the same amount of real processing elements (Stream Processors) as there is on a Cayman. If this is the case, that one Compute Unit will look like this, and having 16-wide SIMD is only the implementation of all threads inside a workgroup MUST do identical operations at all times, then it is OK. If this is true, then having 4*4-wide SIMD approach is an extension, namely that one Compute Unit can process different kernels at the same type, which would be very much useful for the asynchronous thread dispatch processor. If this is true, then this architecture is really close to being black magic.

Wavefront size will still remain 64 threads, (if my speculation is true) because it will still be 4 cycles to reach a register, and with 16 ALUs (not counting the 17th scalar) the hardware will create 64-wide threadgroups to hide register latency.

If my speculation is true, then it is true that this approach is closer to NV, but it might be even more flexible, with the ability to multitask on a single Compute Unit.

0 Likes
maximmoroz
Journeyman III

Future HW and SDK

Originally posted by: Meteorhead

New architecture has inside one CU (I do not know if this CU is identical with OpenCL CU terminology, but I suspect not) a 16-wide SIMD, which can be utilized by different threads, or a single thread on the CU, but at most by 4 with the most granular 4 (thhreads) * 4 (wide SIMD), and different threads can co-issue different ALU operations down the SIMD lane, BUT inside one thread, the 4-wide SIMD must have all identical operations. 4ADD or 4MUL for eg.

From now one, if you do not have your code vectorized to at least 4-wide vectoroperations in every single thread, you will heavily underutilize the HW.

Meteorhead, I am sure that "16-wide SIMD" doesn't mean that 4 work-items at max will be executed at any given moment at this "16-wide SIMD"... block. No way. Instead one wavefront at any given time will occupy this block entirely executing 16 workitems out of its 64.

It seems you think that VLIW4 stream core from Cayman architecture is widened to SIMD-16? Well, I think the opposite is true: This stream core is narrowed to single scalar processing element.

0 Likes
Meteorhead
Challenger

Future HW and SDK

Originally posted by: maximmoroz
Originally posted by: Meteorhead

 

New architecture has inside one CU (I do not know if this CU is identical with OpenCL CU terminology, but I suspect not) a 16-wide SIMD, which can be utilized by different threads, or a single thread on the CU, but at most by 4 with the most granular 4 (thhreads) * 4 (wide SIMD), and different threads can co-issue different ALU operations down the SIMD lane, BUT inside one thread, the 4-wide SIMD must have all identical operations. 4ADD or 4MUL for eg.

 

From now one, if you do not have your code vectorized to at least 4-wide vectoroperations in every single thread, you will heavily underutilize the HW.

 

 

Meteorhead, I am sure that "16-wide SIMD" doesn't mean that 4 work-items at max will be executed at any given moment at this "16-wide SIMD"... block. No way. Instead one wavefront at any given time will occupy this block entirely executing 16 workitems out of its 64.

 

It seems you think that VLIW4 stream core from Cayman architecture is widened to SIMD-16? Well, I think the opposite is true: This stream core is narrowed to single scalar processing element.

 

 

There were two possibilities:

(CU != OCL_CU) && (4VLIW >> 16SIMD)

OR

(CU == OCL_CU) && (4VLIW >> 1SISD)

If the first is true, it will be a big mess. If the second is true, than it will be very capable, but there must be significantly more CUs in a GPU then there are now. (Roughly ~100)

One wavefront occupying the 16-wide SIMD is logical enough, and will most likely be the default case. But since a wavefront is ALWAYS 64 wide, even if your workgroup is only 16 threads, the thread dispatch processor will create dummy threads for you to fill it up to 64 with operations masked from making output. Therefore there would be no sense in allowing 4*4 breakup of the SIMD array if it not were for different kernels being able to run on the same comput unit.

Can you think of any other scenario where it is useful?

0 Likes
maximmoroz
Journeyman III

Future HW and SDK

Meteorhead, sorry, I have completely lost track of the discussion.

Let me state what I got from the slides you linked to: AMD leaves VLIW architecture behind.

- Advantage: no more ALU Packing issue

- Possible disadvantage: While in VLIW architecture 2 wavefronts are enought to hide register access and ALU latency, the new architecture MIGHT require more wavefronts (4 or even 😎 to hide that latency.

0 Likes
LeeHowes
Staff
Staff

Future HW and SDK

Might, however you have to remember that if you take a VLIW-5 packet and flatten it you get 5 issue slots in time instead of space. That's 5 (instruction) cycles worth of latency hiding 🙂

The architecture described in the talks had four 16-wide SIMD units per CU. It issues 4 waves over four cycles per CU - that's the same number of instructions as Cayman but laid out in time rather than space for each vector instruction.

Cayman has a 16 wide SIMD unit. The discussed architecture has a 16 wide SIMD unit. I'm not sure where the confusion is coming from?

Remember, Cayman, the discussed architecture, and Fermi are all vector architectures - there are just subtle differences in how they issue vector instructions and how wide the vectors are.

ETA: remember that a thread as far as Fermi is concerned is a 32-wide vector, as far as Cayman is concerned it is a 64-wide vector. This is slightly different from the width of the hardware SIMD unit and also different from the way the word thread is used in CUDA. For the purposes of discussion not using the word thread at all might be clearer 😉

0 Likes
eduardoschardong
Journeyman III

Future HW and SDK

Lee, I'm a bit confused too by those blocks and arrows... Can you help?

There is a diagram of the SIMDs with arrows between them but only in one direction, what does those arrows mean?

 

Also, it's now 10 wavefronts per SIMD and still 4 cycles per wavefront per instruction, and scalar, was single-threaded performance on compute bound kernels sacrified a LOT?

 

0 Likes
Meteorhead
Challenger

Future HW and SDK

OK, I think I got it. Basically exactly the same amount of processors are found inside a CU, but instead of vectorizing scalar code in VLIW manner, all vector code is being "serialized".

The drawback is that more wavefronts are needed to keep the ALU busy. If I'm not mistaken, computation is done in the following manner (same amount of work done in the given time:

 

Cayman: Tick01: 00-15 threads of Wavefront 0 compute 4 instructions at once in VLIW manner. Tick02: 16-31 threads of Wavefront 0 compute 4 instructions at once in VLIW manner. Tick03: 32-47 threads of Wavefront 0 compute 4 instructions at once in VLIW manner. Tick04: 48-63 threads of Wavefront 0 compute 4 instructions at once in VLIW manner. Tick05: 00-15 threads of Wavefront 1 compute 4 instructions at once in VLIW manner. Tick06: 16-31 threads of Wavefront 1 compute 4 instructions at once in VLIW manner. Tick07: 32-47 threads of Wavefront 1 compute 4 instructions at once in VLIW manner. Tick08: 48-63 threads of Wavefront 1 compute 4 instructions at once in VLIW manner. Tick09: 00-15 threads of Wavefront 2 compute 4 instructions at once in VLIW manner. Tick10: 16-31 threads of Wavefront 2 compute 4 instructions at once in VLIW manner. Tick11: 32-47 threads of Wavefront 2 compute 4 instructions at once in VLIW manner. Tick12: 48-63 threads of Wavefront 2 compute 4 instructions at once in VLIW manner. Tick13: 00-15 threads of Wavefront 3 compute 4 instructions at once in VLIW manner. Tick14: 16-31 threads of Wavefront 3 compute 4 instructions at once in VLIW manner. Tick15: 32-47 threads of Wavefront 3 compute 4 instructions at once in VLIW manner. Tick16: 48-63 threads of Wavefront 3 compute 4 instructions at once in VLIW manner. Southern Island: Tick01: 00-15 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick02: 16-31 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick03: 32-47 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick04: 48-63 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick05: 00-15 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick06: 16-31 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick07: 32-47 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick08: 48-63 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick09: 00-15 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick10: 16-31 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick11: 32-47 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick12: 48-63 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick13: 00-15 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick14: 16-31 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick15: 32-47 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick16: 48-63 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line.

0 Likes
maximmoroz
Journeyman III

Future HW and SDK

Originally posted by: LeeHowes Might, however you have to remember that if you take a VLIW-5 packet and flatten it you get 5 issue slots in time instead of space. That's 5 (instruction) cycles worth of latency hiding 🙂

The architecture described in the talks had four 16-wide SIMD units per CU. It issues 4 waves over four cycles per CU - that's the same number of instructions as Cayman but laid out in time rather than space for each vector instruction.

Cayman has a 16 wide SIMD unit. The discussed architecture has a 16 wide SIMD unit. I'm not sure where the confusion is coming from?

Lee, I am able to efficiently load Cayman compute unit with just 2 ALU-intensive wavefronts. Would it be possible in new architecture? Only if new compute unit is able to execute single wavefront at several 16-wide SIMD blocks at the same time and the ALU and register access latency is 2 cycles. I doubt it. My guess is that it would require 4 or 8 ALU-intensive wavefronts to efficiently load single compute unit in new architecture (by the way, it is similar to NVidia's 6 wavefronts).

Well, it is a problem only when there is a small amount of wavefronts, that is the task is relatively small.

0 Likes
LeeHowes
Staff
Staff

Future HW and SDK

There is a diagram of the SIMDs with arrows between them but only in one direction, what does those arrows mean?


Not a clue 🙂

Also, it's now 10 wavefronts per SIMD and still 4 cycles per wavefront per instruction, and scalar, was single-threaded performance on compute bound kernels sacrified a LOT?


*up to 10* The current design allows up to some high number per macro sequencer (ie per 10 SIMDs... 128 ,256, something like that. I forget). This is saying up to 10 per SIMD, or up to 40 per CU. Another way to look at that is that each micro sequencer tracks 40 program counters - if you use too many registers just like now the number you actually have state for can be lower.

 

 

 

Lee, I am able to efficiently load Cayman compute unit with just 2 ALU-intensive wavefronts. Would it be possible in new architecture? Only if new compute unit is able to execute single wavefront at several 16-wide SIMD blocks at the same time and the ALU and register access latency is 2 cycles. I doubt it. My guess is that it would require 4 or 8 ALU-intensive wavefronts to efficiently load single compute unit in new architecture (by the way, it is similar to NVidia's 6 wavefronts).


Right. But remember that when you load a Cayman unit with 2 wavefronts you are giving each wavefront 4 instructions - so in that same time with this archictecture you can issue 4 arbitrary instructions. And you'd never reach peak that way because every thread switch would leave a 40 cycle bubble in the pipeline. You'd need at least a third to cover those gaps. So yes, to fully specify the machine, assuming the same interleaving as Cayman (which I haven't asked anyone about so may or may not be the case) you would need 4x the number of wavefronts to keep it busy - but of course the arithmetic density in terms of time would go up as you spread the instructions out.

GPUs are throughput architectures. Over 24 cores Cayman tends to need a couple of hundred threads to keep it busy - you can imagine needing more with this design, but in either case you're getting no efficiency if you run single threaded scalar code anyway so the same rough programming rules apply. Think of it as nothing but a bonus.

0 Likes