14 Replies Latest reply on May 26, 2010 3:18 PM by LeeHowes

    Max threads per SIMD engine

    mojj

      Hi!

      What is the maximun number of threads in flight per SIMD engine on the Cypress architecture? I've been trying to find this number but can't seem to find it anywhere.

      For the 4870 ive seen the value 1024 threads / SIMD engine, is that correct? Is it the same value for Cypress?

      Thanks!

       

       

        • Max threads per SIMD engine
          hazeman

          I think nothing has changed in this regard. But remember that thread count is limited by register usage - and in practice you can't run more than 8 real threads*.

          * Saying that SIMD engine can run 1024 threads is like saying that single core x86 CPU can do 4 threads simultanously ( 4 slots in SSE instructions ). This a lie fed to us by nvidia/ati for marketing purposes and has nothing to do with reality.

          Each thread in SIMD engine can do 320 (64*5) wide instruction. This is exposed in compiler as 64 "threads" with 5 wide operations.

          The same trick could be used for x86. You could write code for each slot in SSE instruction as independend thread and compiler would later merge it. Or there could be split - 2 "threads" with 2 wide operations.

          But for x86 no one would belive that this means singe core can run 4 threads simultanously !!!.

          The real problem is that those "threads" exposed by compiler trick have no properties of thread ( independed scheduling ). This leads to many problems for people starting to use gpus ( like here ).

          I understand that using this approach in compiler simplifies code development. I really think it's a good solution. But we shouldn't be fed lies. NVidia/ATI should from the beginning use some other name - IMHO 'fiber' would be really good.

          So SIMD can run max 16 threads and each thread is exposed by compiler as 64 fibers ( fake threads ).

           

            • Max threads per SIMD engine
              cjang

              While I do agree that the GPU manufacturers characterize their products in the most favorable marketing light possible, in this case I think the use of "threads" is probably the least confusing. It's just that there needs to be some explanation about what threads mean in this context.

              (By the way, thank you for your explanation. I was completely ignorant of how application/user threads are related to native device threads. Now I am less ignorant.)

              Here is my analogy.

              Before there were threads, there were processes. Threads were often first implemented in user space in the standard C library as operating systems did not have a concept of a thread. In this case, threads were an illusion to the application to increase concurrency without a complex event loop. Only later did threads become native in kernel space. And then there was Solaris with light weight processes with a thread concept in both user and kernel space.

              So although we knew the CPU might be a uniprocessor, we also thought of multi-threading. It's just that we made a distinction between user/green threads and kernel threads.

              But I hear you. It's like Erlang being able to run millions of concurrent processes. No one should believe that there are really so many processes running at the same time. It's clear because the Erlang developers are very clear about this point as well as the design goals and performance implications.

                • Max threads per SIMD engine
                  hazeman

                   

                  Originally posted by: cjang While I do agree that the GPU manufacturers characterize their products in the most favorable marketing light possible, in this case I think the use of "threads" is probably the least confusing. It's just that there needs to be some explanation about what threads mean in this context.


                  In a way you are right. This wouldn't be a problem if at every possible step we would be reminded that it's only compiler trick. But it isn't the case.

                  And here is my reasoning why there should be some other name used. Thread has one important property. It has independend scheduling from other threads. Even in the case of software threads they are scheduled independly. This isn't the case with "gpu threads". And I could make links to quite a few posts where people are mislead by this.

                  It's true that if you would use new word like 'fiber', people at the beginning wouldn't know what it is. And this is a good thing. It would force them to learn what they are dealing with. For small price ( reading few sentences of explanation ) we could have much less confusion later ( which can lead to heavy debugging or ineffective code ).

                  Now, why do I think work-item is a bad choice. It's true that for ATI cards there is 1-1 mapping between 'fibre' and workitem. But this doesn't have to be true for some other architecture. Even for ATI  it's easy to write opencl recompiler which would map 5 work-items to 1 'fibre' ( and sometimes this trick is used to improve operations slot scheduling ).

                   

                • Max threads per SIMD engine
                  LeeHowes

                  Slide 56:

                  http://developer.amd.com/gpu_assets/Heterogeneous_Computing_OpenCL_and_the_ATI_Radeon_HD_5870_Architecture_201003.pdf

                  Though as a rule I don't use the word threads in the way the slide suggests because it doesn't mean very much. OpenCL has a good reason for using the word "work item" and I tend to go for that these days.

                  For the entire Cypress device you have: 31744 "threads", if you count Phenom 2 as running 4 (or maybe 8 if they can dual issue SSE, which I think they can) or 496 threads if you count Phenom 2 as running 1.

                  Each dispatcher can manage 248 waves in flight (hence the Juniper numbers) with two dispatchers on the Cypress die. I don't know if there is a limit to how many waves will run on a SIMD short of register space but there is a maximum of 8 work groups per SIMD for other reasons. Note that on Redwood and Cedar (and in realistic cases Juniper Cypress too) the total is lower than the dispatcher can handle because it will always be register limited on the SIMDs. The actual number will of course be lower depending on what waves and work groups are allocated by the dispatcher to each SIMD. 

                  The short answer is 24 waves per SIMD.

                  Incidentally, Hazeman, I don't think your 64 fibers==16 thread numbers are any better. Each of those 16 is still not a thread in any way. 64-fibres==1 thread is more realistic. Or just use CL terms, 64 work-items == 1 thread.

                    • Max threads per SIMD engine
                      hazeman

                       

                      Originally posted by: LeeHowes Incidentally, Hazeman, I don't think your 64 fibers==16 thread numbers are any better. Each of those 16 is still not a thread in any way. 64-fibres==1 thread is more realistic. Or just use CL terms, 64 work-items == 1 thread.

                       

                      Please read more carefully. I think that 'each thread as 64 fibres' is the same as 1 thread == 64-fibres == 64-workitems. But maybe I'm wrong oO.

                        • Max threads per SIMD engine
                          LeeHowes

                          No, you're right. It was just your last sentence that I was disagreeing with. "SIMD can run max 16 threads...". The rest of your post I agree with.

                          You're right that people are easily mislead by the word thread. It's a bit of a bad habit that I ever use it, and partly thanks to Microsoft's use of the term in DX11.

                          I see your point that work-items doesn't really describe the hardware but rather the programming model.

                          Having said that, describing the hardware is a challenge anyway. If we really want to get down to it an Evergreen SIMD is simultaneously executing 10 instructions in two VLIW packets from 128 SIMD elements from 2 waves over 8 cycles through a pipelined ALU. At the same time it is executing a control flow stream in the sequencer/dispatcher and also is passing data through texture units. That's a confusing amount of simultaneous and hardware interleaved threading when you really get down to it. No simple hardware description works in all cases. At the moment OpenCL is a reasonable programming model that maps to such architectures using the term work-item.

                          I don't like 'fibre' because it's already used for 'user mode cooperative thread' fairly commonly and is basically what the AMD OpenCL CPU implementation uses to execute work-items (with an OS thread for the work-group).

                            • Max threads per SIMD engine
                              hazeman

                               

                              Originally posted by: LeeHowes I don't like 'fibre' because it's already used for 'user mode cooperative thread' fairly commonly.


                              This I have to agree with . Nontheless the 'gpu thread' is much more fitting to the fibre-thread analogy than 'cooperative threads'.

                              • Max threads per SIMD engine
                                ryta1203

                                 

                                Originally posted by: LeeHowes

                                Having said that, describing the hardware is a challenge anyway. If we really want to get down to it an Evergreen SIMD is simultaneously executing 10 instructions in two VLIW packets from 128 SIMD elements from 2 waves over 8 cycles through a pipelined ALU. At the same time it is executing a control flow stream in the sequencer/dispatcher and also is passing data through texture units.

                                This is the best description and matches the hardware. I really don't think it's a big deal as long as you keep this in mind.

                        • Max threads per SIMD engine
                          Raistmer
                          Why wavefront can't be percived as thread?
                          From what I've read each wavefront takes it's own path independent from another one, right?
                          If some control flow statement executed path chosen for whole wavefront.
                          If execution starts or stops it starts or stops for whole wavefront. Looks like wavefront = thread is good enough analogy to take scheduling into account. Nobody perceives SSE SIMD instruction as 4 threads actually. All know that 4 numbers will be added (for example), but not 2 added and 2 substracted at the same time, depending on some control flow statement. Also, no one consider multi-ALU CPU as truly simultaneously multithread-capable. Hyperthreading is lowest current level for threading, all other use 1 core=1 simultaneuos thread (as scheduling entity indeed) no matter that CPU is multiscalar with many ALU ops issued per clock.
                          Looks like hardware description more related to multiscalar level of processor, not to its scheduling abilities.
                          8 wavefronts per SIMD = 8 threads per core, number of SIMDs ~ number of cores, not good analogy enough ?

                            • Max threads per SIMD engine
                              LeeHowes

                              Yes, I view a wavefront as a thread.

                                • Max threads per SIMD engine
                                  hazeman

                                   

                                  Originally posted by: LeeHowes Yes, I view a wavefront as a thread.


                                  I hope someday ATI will correct docs and this sentence will be there ( ok I don't believe it myself ).

                                  Btw I wonder if someone could file a lawsuit for false advertismement based on your comment .

                                    • Max threads per SIMD engine
                                      LeeHowes

                                      Read the footnote of the post

                                      Anyway, it isn't false advertising. The word "thread" is flexible. The marketing people here and at nvidia define it as a programming construct, coming out of a hardware and compiler background I tend to view it as an execution concept (and given the varied views in this discussion being clear about what we really mean is what matters, not the term used).

                                      Remember that if you program your app to use 64 work-items this could execute as 1 wave on the higher end devices, two waves on Cedar and possibly 64 waves on the CPU (depending on the runtime implementation - in our case a thread and 64 fibres - is a fibre a thread? is a thread a process?). As the programmer, how many threads did you write?

                                        • Max threads per SIMD engine
                                          hazeman

                                           

                                          Originally posted by: LeeHowes Anyway, it isn't false advertising. The word "thread" is flexible. The marketing people here and at nvidia define it as a programming construct, coming out of a hardware and compiler background I tend to view it as an execution concept (and given the varied views in this discussion being clear about what we really mean is what matters, not the term used).


                                          Lets assume for a moment that there is no problem with "flexible thread" approach. So Intel can now create compiler which will expose SSE instructions as 4 "threads". And they start to advertise 4 core CPUs as "capable of running 32 (with HT) threads simultanously". What do you think would be people reaction to that ? I think it's safe bet to assume they would end up in court ( for sure in US ).

                                            • Max threads per SIMD engine
                                              LeeHowes

                                              Is the important feature of a thread that it maintains a program counter and state for each? There's no reason they can't map the SSE ALUs that way and store a vector program counter, as the GPUs (implicitly via mask stacks) do - though where the GPUs have every register stored in a vector, the x86 CPUs do not so there would have to be a single copy of state for much of the thread in the ALU-as-thread description. Thread is too vague a term if you really want to get to the bottom of what the hardware is doing in all cases.

                                              If you're working on the basis of what they could do before, then Intel doing that would be dishonest because they'd be acting as if there was a big upgrade when there wasn't, but had they tweaked the architecture a little and designed SSE in that way from the beginning then that would be far less of an issue.