6 Replies Latest reply on May 29, 2011 7:33 PM by edisonying1984

    fastpath vs completepath

    edisonying1984

      Hello

           I have a silly question about the fastpath and completepath hardware on the ATI GPU. I checked the programming guide, which says the two paths can do both "load and store" operations. However, in the profiler description (also in the paper about the OpenCL profiler published in SIGGRAPH) mentions that these two paths are just for "data written to the global memory". I got confused here and thought there might be two possibilities:

      (1) these two paths are able to support both load and store, but the profiler just count the data amount "written to the global memory"

      (2) these two paths only support store (write) operation, thus the programming guide has made a mistake on this issue.

       

      I know I might understand some key point here, so any clarification on this is welcom

        • fastpath vs completepath
          himanshu.gautam

           

          As i understand it, fast path means there is no coherency issue with the global memory. So reads from buffers should always be fast.

          Only in writes there can be coherency issues when you need global_atomics or working on less then 32bits data. So i think it is only the write where either of fast path or complete path is used. So write is what that matters for profiler.

          But I am not sure and it would be better if someone can verify this.

            • fastpath vs completepath
              hduregger

              I believe the terms fastpath and complete path are mostly only used to simplify discussion. There doesn't seem to be a single piece of hardware responsible for fastpath nor one for complete path. If you look at the diagrams in the AMD_HD_6900_Series_Instruction_Set_Architecture.pdf and the APP Programming Guide, there is special caches involved on the read and write side. And on the write side there is this hardware part responsible for atomic operations (of which some also return/read values from memory, which could also have lead to the confusion about whether it does only write or also read).

              The section "7.2 Dataflow in Memory Hierarchy"
               also tells us that "The CB [“complete-path” (color buffer or CB)]  is used for format
              conversion and atomics." And that the depth buffer DB path is termed fastpath.

              There is tons of interesting information in both documents, unfortunately for newbies like me, it seems to be spread out, but if you scan over it again and again it starts to make more sense.

                • fastpath vs completepath
                  edisonying1984

                  Thanks for your clarification on this topic, himanshu,gantam and hduregger, but I am still not quite clear...

                  I checked the APP programming guide. In figure 4.1 which shows the memory system architecture, I see the completepath and the fastpath are located between the write cache and the memory channel, so I guess these two paths are only responsible for write operations, but again I'm not sure here. hduregger, do these two paths shown in the figure refer to "the special paths on the write side" you mentioned?

                    • fastpath vs completepath
                      hduregger

                      Section "7.2 Dataflow in Memory Hierarchy" in the "HD 6900 Series Instruction Set Architecture" manual (original link, seems to be down, mirror ) shows a diagram and tells us that

                      Similarly, writes are executed through the “fast-path”  (depth buffer or DB) or “complete-path” (color buffer or  CB), which have write-only caches that are invalidated, and all update bits are sent to memory at the end of a  clause. The DB is the raw, high-speed, 32-bit only data write path. The CB is used for format conversion and atomics.
                      Global atomic operations are executed through the  complete-path; the CB caches perform the atomic. Atomic operations in which the return value is not used (“fire-and-forget”) can be pipelined, and the work-item does not have to wait for the atomic to complete before continuing. If the return value is used, the work-item must wait for the atomic to complete, the line to be flushed, and a read from global memory.

                      Image objects are limited to read-only or write-only (no concurrent r/w). Thus, on reads, the data is cached through the L2 and L1 data caches; on writes, the data is cached through the CB/DB buffers.

                      So it really seems that the terms fast-path and complete-path are only used for writes, with the exception of the complete-path also having the ability of returning values in atomic operations.

                      • fastpath vs completepath
                        genaganna

                         

                        Originally posted by: edisonying1984 Thanks for your clarification on this topic, himanshu,gantam and hduregger, but I am still not quite clear...

                         

                        I checked the APP programming guide. In figure 4.1 which shows the memory system architecture, I see the completepath and the fastpath are located between the write cache and the memory channel, so I guess these two paths are only responsible for write operations, but again I'm not sure here. hduregger, do these two paths shown in the figure refer to "the special paths on the write side" you mentioned?

                         

                        Fast path Vs Complete path

                        In complete path,  there is a extra load for each store due to this you are effected by performance.

                        Both Fast path and Complete path both talks only stores.

                        If you use complete path,  bus utilization is only 25%

                        If you use fast path, bus utilzation is 100%

                         

                        Following comes under complete path

                                 1. atomic operations

                                 2. < 32 bit operations

                                 3. Image writes