9 Replies Latest reply on Jan 15, 2011 7:12 AM by himanshu.gautam

    Cacheline size RAM, Latency of RAM and speed of L2 from 5970

      Cycles, Latency, Blocked-Reads and the obstacles to avoid to use the full raw processing power


      Suppose have 1 big array of a gigabyte where all cores must somehow read from regurarly. Now i can sort it in such manner that not all cores go read at the same cacheline. What is the ideal setup for an array to be read at the full bandwidth? How many bytes to keep 'in between' each read from the 1 gigabyte RAM?

      Which bandwidth can i achieve in this manner?

      What is the cacheline size that the L2 uses?

      Can cores align themselves onto start of a cacheline?

      How many bytes must i add 'in between' to get full bandwidth readspeed from the RAM?

      How many cycles is it to get something out of the RAM doing a blocked read if all cores read at the same time, at the above manner from the RAM, (we assume full L2 miss of course in all cases)?

      Secondly caching. Regurarly reads will be out of L2 lucky. How many cycles is the full read-latency out of L2?


      "L1 cache" question, if it is there, which i do not assume for now, but i didn't know it had a L2 read cache either which i see in the diagram. If i do a memory read, again with all cores, is it possible the L1 has it by accident or some register file (as there is plenty of registers i saw) that hides the L2 latency? What i suppose now is that if i do a read from RAM, that there is only the L2 to protect me from suffering full latency. Is that correct?


      Now important question:

      Let's suppose 3199 cores are so lucky to get their data out of L2, as L2 still had the cacheline, 1 core needs to get it out of the RAM. Do all those 3199 cores need to wait for core 3200, for half a century, or can there be atomic differences in execution speed of the instruction stream?

      Summarizing do all cores need to wait for core 3200, when they get their data from L2, or can the cores continue their job (running the SAME thread of course)?


      Oh by the way i'm making big progress on paper, the model already is there how to get things done, but above details matter a lot for what type of code gets poured.


      Thanks in advance for answerring any of the above questions,


        • Cacheline size RAM, Latency of RAM and speed of L2 from 5970


          Let me try answering the last question.

          The answer is No. 3199 cores do not wait for the last core in that case. The workitems in a GPU are executed independentely at work group level. A wavefront which is 64 workitems generally is executed on the fly and only this wavefront may have to wait untill global memory access. Other compute units are not concerned with whats happening inside this compute unit.

          All the more the global access stalls are handled by scheduling many wavefronts to a compute unit so global access is not very problematic untill ALU:FEtch ratio is high enough.

            • Cacheline size RAM, Latency of RAM and speed of L2 from 5970

              Many thanks for your quick answer!


              Do i understand it correctly that it is at most 5 streamcores that stall waiting for the RAM, and that the other 3200-5 will continue their work, as a compute unit is 3195.


              Now of course it is interesting to know when the worst case happens of that ALU-fetch ratio.


              With prime numbers you can expect of course a distribution where a few small primes will in general trigger more than bigger small primes, so do i understand it correctly that if i take care that the load balancing of the work is pretty ok, so with some hopes that the next core to possibly block is not the same core again, that in such case it will all go ok?


              In short my job for the load balancing is to take care that a possible expected worst case happens each time to another 5 streamcoress compute unit and then it will go ok.

              Is that correct?

              If so, how many instructions can a compute unit run behind maximum?



            • Cacheline size RAM, Latency of RAM and speed of L2 from 5970
              The way you are thinking about this is incorrect. You need to think in wavefronts, i.e. hardware threads, not in streaming cores. The reason is this, a single wavefront executes on a single SIMD in parallel with other wavefronts on the SIMD, and unless you are using barrier or local atomics, there is no synchronization between wavefronts on a SIMD. If you are not using global atomics, then there is no synchronization between wavefronts memory accesses on different SIMD's. A MAX of 1 wavefront per SIMD can run on the device in any given cycle, but multiple wavefronts can run in parallel on a SIMD alternating execution every 4 cycles. Each Wavefront can hold between 16 and 64 software threads(work-items) and each work-item can execute 4 or 5 instructions in parallel depending on the hardware. To fully load the chip for a memory bound application, you want 6-8 wavefronts per SIMD, for a ALU bound application, it is 3-4.

              So if you think in wavefronts, then deciding where the dependencies are is easy.
                • Cacheline size RAM, Latency of RAM and speed of L2 from 5970


                  Thanks for your explanation and indicating the correct terminology.

                  Of course i don't want to use anything that slows down massively (barriers, atomics). Maximum throughput is what matters here.

                  This is not childish type software, it competes with the utmost low level code that the best coders on planet earth could achieve at x64 cpu's

                  Most likely my parallel framework will be used for further optimizations by others for anything that has to do with sieving and trial factoring.

                  What i hadn't realized is that AMD allowed 32 x 32 multiplications with both the most significant as well as the least significant bits available, so you bet this is just the start of the project

                  This already will be a direct massive blow to nvidia as to sieve to 90 bits range they need 4 integers of 24 bits, AMD can do with 3 integers there and they're using it to move from 90 to 91 bits now.

                  (for your information a single core AMD 2.3Ghz barcelona needs rougly 25 minutes to trial factor to 61 bits, each additional bit is slightly exponential slower, 62 bits = +1 hour single core)

                  This multiplication allows also to write later on to write a fast FFT for. Not to confuse with the type of FFT's you see within the math libraries from AMD and Intel and Nvidia, that's kindergarten. The dedicated FFT's such as DWT are up to factor 8 faster (at x64) than default FFT's, and the default fft's lose too much precision and would give incorrect results (prime numbers very sensitive to round off errors).

                  So i try to tackle the simplest problem first that's trial factoring inside the cpu. After that comes the factorisation step which will happen inside the GPU and after that comes the DWT.

                  It is very important to get numbers attached to operations such as RAM operations and speed of caches. That isn't very secret i hope, as the few competitors will figure this out anyway, whereas for coders writing low level code it is very important to know all this!

                  So i'm very happy you quote that it eats 4 cycles to execute the next wavefront. Realize the code is really going mercilious towards the RAM.

                  It will with all threads at the same time flip bits there. And yes i know that means once in roughly each 200 billion operations towards the RAM, that there will be a write error causing possibly a bit somewhere not set to 0 which should be set to 0. Throughput matters, a bit of overhead sized 1/x billionth is not a problem.

                  This will run until RAM buffers have been overloaded. Already trying to order the GPU with  maximum amount of RAM for this. I saw an 5870 with 2GB somewhere. Hope the 68xx series will have as well.

                  Directly after that pass that generates small primes i intend then to trigger the next wavefront. Knowing it is just 4 cycles to switch is great, allows to use smaller buffers and bigger prime base in the RAM.

                  So mentionning to me the time it takes to context switch to the next wavefront is really important, my great thanks for that!

                  Yet knowing the latency to the RAM and from the L2 caches when all cores are busy (i deliberate avoid wavefronts now as i guess a wavefront can also use less computing resources than 100% load), so basically the chip running at maximum power, *then* you want to know the latencies of everything.

                  This was very useful info from your side, i hope to soon start coding something. Probably i'll need to write a RAM test i suppose, that's modelling things how the software works.



                • Cacheline size RAM, Latency of RAM and speed of L2 from 5970
                  Please refer to our programming guide about our hardware, there is some misconceptions still. Also, to clear up something you are misunderstanding from my previous post. The context switch is not 4 cycles. Two wavefronts execute in parallel on a single SIMD, with 1/4 of the wavefront executing every cycle. For all work-items in a wavefront to execute the ALU bundle, it takes 4 cycles, and then the odd wavefront executes its ALU bundle. Context switching latency itself is different and the number of cycles it takes is dependent on the device.
                    • Cacheline size RAM, Latency of RAM and speed of L2 from 5970

                      Thank you for your answer. Which manual and page number do you refer to?

                      This for the exact latency descriptions, and can we publicly download those?


                      What i downloaded mostly comes from this page: http://developer.amd.com/gpu/ATIStreamSDK/documentation/Pages/default.aspx


                      I don't see specific manuals describing cypress nor cayman on that homepage with latencies of each item, something very common for cpu's and HPC chips which are in supercomputers; completely crucial for programming for them low level. Both are interesting chips (obviously cayman will have things improved a tad here and there). Especially IBM documents its supercomputer chips very well, you can all download it easily, with to every detail the latencies attached.


                      It is very important to have rough estimates of the latencies to write the model. If you would need to test every detail yourself, you're busy for years, testprograms require a lot of time to produce.


                      Many thanks for answerring, much appreciated your quick responsed,


                          • Cacheline size RAM, Latency of RAM and speed of L2 from 5970


                            Originally posted by: himanshu.gautam



                            The word L2 occurs 25 times in the manual. At not a single occasion the latency from the L2 gets mentionned nor RAM. Now of course it's just 512 KB L2 in total for the 5000 series, which makes it even more crucial to know how it works.


                            Note the card i actually own since a few hours now is a 6970,

                            which is where i go program for, but didn't see a manual for it yet so i try to figure it out for the 5000 series of course.


                            For the 5000 series there is not a single word on latencies. Only on bandwidth, but bandwidth is a very theoretic definition that is not really usable for low level code. For example it doesn't mention how many bytes of that are overhead nor when you start receiving bytes.


                            So again the question, can you shine a light on LATENCIES?


                            Not refer to a manual that has basically nothing useful there.




                              • Cacheline size RAM, Latency of RAM and speed of L2 from 5970


                                The latency to memory varies depending on the access patterns, device setup, instruction and other factors. Some common latencies to memory are tens of cycles to L1, low-hundreds of cycles to L2 and from high hundreds to 10k+ cycles for memory depending on how memory is used. That being said, even these numbers don't give an accurate idea for a specific kernel, as there are cases where latencies to L1 can be in the hundred of cycles. The GPU is designed to hide latency by being massively parallel and the memory system is designed to be throughput oriented. As long as you have enough work for the GPU to process, latencies to memory should be a non-issue. To optimize a kernel around memory latencies for that kernel, experimentation is required to find the point where adding more wavefronts per SIMD does not improve kernel performance. For example, if you have 4 wavefronts executing a memory bound kernel, and adding a 5th wavefront improves performance, then memory latency most likely was not completely hidden. If adding a 6th wavefront does not improve performance, then all memory latencies are hidden by the 5 wavefronts. A presentation that was given for optimizing SGEMM on the RV670 shows an example of how this can be done. The presentation, titled ACML-GPU - SGEMM Optimization Illustration, can be found here: