2 Replies Latest reply on Jan 4, 2011 7:52 PM by diepchess

    Architecture of HD 5970

    diepchess
      Designing new concepts for prime number calculations based upon architecture given

      hi,

      I tried to search through the documentation for the latest gpu's; this as openCL defines quite a lot. Basically everything here:

      http://developer.amd.com/gpu/ATIStreamSDK/documentation/Pages/default.aspx

      Majority of that i indexed and have read. Very impressed by the improved documentation if we compare it with a few years ago (last indexation, start 2007).

      Yet there is a lot of questions. This especially because majority of everything that has to do with prime numbers is pretty much sequential. Making a parallel framework for that is not exactly easy.

      At Nvidia there is for example now a program that can do the exponent squaring for the Trial factorisation, but the hard work of sieving gets done at cpu's. In case of a 3200 core AMD that would practical mean the entire 8 gigabyte gets eaten, as it can proces more than a small nvidia card.

      In short the huge progress of gpu processing power also simply no longer allows simple forms of parallellism. An 8 gigabyte link for a card that can handle teraflops is not really useful

      Time therefore to start designing a model on paper and then if it seems to work on paper implement it. In my case we search for Wagstaff primes (Jeff Gilchrist, Paul Underwood, Tony Reix, Vincent Diepeveen - in reverse read the DRUG team).

      A big problem always with all that prime number software is that you combine a number of different things, which basically could be different programs. But the bandwidth between both programs is so huge, that there is a big need of course for being able to run different program counters. So i have a number of architecture questions.

      a) is it correct that the 5970 card has 2 gpu's, so all 5970 models,

          and that each gpu's architecture has 16 SIMD's and each SIMD

          consists out of 20 cores which exist out of 5 streamcores each?

       

      16 * 20 * 5 = 1600

       

      b) in manual i see the 5 streamcores, forming 1 compute unit, execute exactly the same instruction at the same time. Is that correct?

       

      c) Now most important question: is it possible that 2 programs or more run at the same time in the gpu, executing different instructions at the same time;, i understand a concept of openCL is workgroups. How many workgroups can execute DIFFERENT code at the same time and how many streamcores ideally form a workgroup, if i want the full 3200 cores busy?

      I see a number of 256, for me that's not needed at all, i'd go for 4 as it seems now but they HAVE to execute different code at the same time, as 1 workgroup generates small primes that get fed into the factoring workgroup.

      The speed at which small primes get generated (say up to a bit or 96) is so fast and at just a few cores, that it is impossible to decouple the problem in the manner how the CUDA project is doing it. 3000+ cores that factor eat more primes than any PCI-E link can proces, even if you'd get that in a prefetched manner.

      So in short can workgroups that executes at the same time execute different code, so the instruction stream that reaches each workgroup is a different program?

      I know historically ATI could not do this, but i do not know the status now that there is so many cores that it is becoming hard to have the same code executed at all the cores at the same time and basically openCL demands it.

      It would make of course no sense to execute at a core or 32 a program and have 3100 cores idle, then sit and wait until they generated the next batch and have 3200 cores proces the batch then.

       

      Meanwhile i can't really buffer the batch very well into the RAM, as the full RAM already gets used as a cache for other purposes and there is just a gigabyte or 2 per GPU.

       

      It is good that the manual says you can completely access the entire DEVICE ram. Very great! Really like that.

       

       

      Thanks for answerring,

      Vincent

        • Architecture of HD 5970
          lrog

           

          Originally posted by: diepchess hi,

           

           

          a) is it correct that the 5970 card has 2 gpu's, so all 5970 models,

           

              and that each gpu's architecture has 16 SIMD's and each SIMD

           

              consists out of 20 cores which exist out of 5 streamcores each?

           

           

           

          16 * 20 * 5 = 1600

          5970 chipset consist of 2 SIMD Arrays, each of which consists of 20 SIMD Engines (multiprocessors), each of wchich consists of 16 thread processors, each of which consists of 5 compute cores (4 for general purpose and 1 for special functions)
          Which gets in total:

          2 (SIMD Array) * 20 (SIMD Engines) * 16 (TP) = 640 thread processors

          640 (TP) * 5 = 3200 compute cores

           

          More info + GPU block diagram:

          http://www.techpowerup.com/reviews/ASUS/EAH5970/3.html

           

          b) in manual i see the 5 streamcores, forming 1 compute unit, execute exactly the same instruction at the same time. Is that correct?

           

          AFAIC the thread processors / compute units execute one VLIW instruction per cycle - that's why writing kernel code using vectorized instructions is more effective on ATI's GPUs

           

          As for main c) answer - there's a discussion here regarding running multiple kernels on 5xxx cards:

          http://forums.amd.com/devforum/messageview.cfm?catid=328&threadid=120378&enterthread=y

           

          Hope it helps a bit

            • Architecture of HD 5970
              diepchess

              Wow, many thanks for your crystal clear answers lrog!

               

              So basically i understand that in theory it would be possible to run multiple kernels at the hardware, but that the software support for it is not ready yet?

               

              The next question for AMD is: will it get made and if so, will it only work for openCL or also for AMD CAL?

               

              Obviously an expected time of completion would be nice also, as this feature request already is there for a while i see from this thread (and probably before that as well).

               

              Of course the reason for this request is obvious: the GPU's have such overwhelming succes at number crunching and such potential there, that the cpu's simply can't keep up feeding them fast enough over the pci-e link.

               

              So multiple kernels that run concurrently is a very important issue for serious crunchers.

               

              Thanks, Vincent