8 Replies Latest reply on Aug 19, 2010 3:47 PM by Jawed

    Stream Profiler interpretation

    philips
      Can you see, where my problem is?

      Hi.

      The code I ported from CUDA is very slow on an ATI 5870. Slower than on a NVIDIA GTX280.

      Unfortunately I'm not familar with ATI hardware, so I am not sure how to improve it.

      I have now done a testrun with the Stream Profiler. I have attached two lines from the profiler output.

      The Algorithm works in two passes. The first kernel launch is less complex than the second one. The two lines are both passes from a representative iteration of the algorithm.

      I was hoping you might be able to explain, where the problem is.

       

      Method  ExecutionOrder  GlobalWorkSize  GroupWorkSize  Time  LDSSize  DataTransferSize  GPRs  ScratchRegs  FCStacks  Wavefronts  ALUInsts  FetchInsts  WriteInsts  LDSFetchInsts  LDSWriteInsts  ALUBusy  ALUFetchRatio  ALUPacking  FetchSize  CacheHit  FetchUnitBusy  FetchUnitStalled  WriteUnitStalled  FastPath  CompletePath  PathUtilization  ALUStalledByLDS  LDSBankConflict
      renderKernel_07F0BF28 9795 {  24832       2       1}  {   32     2     1} 1,886153072 52057761680,8938,32124,997,1516,3543,8638,811304,52,711,30,020363,7501002,440,53
      renderKernel_07F0BF68 9800 { 393216       2       1}  {   32     2     1} 16,702523072 5905122881434,5339,37133,94824,8436,4441,6237614,191,072,980,1203084,2501003,671,3


       

       

      Thank you for reading

       

       

        • Stream Profiler interpretation
          n0thing

          Is your code vectorized?

          Your ALU packing efficiency is low which gives the hint that vectorizing your code will improve your performance if its not already.

           

            • Stream Profiler interpretation
              Jawed

              In the first kernel 16% for ALUBusy implies to me that you have lots of IF statements and/or loops where each work item follows a different path from its neighbours.

              The GPRs count of 52 means 4 hardware threads can be supported per SIMD core.

              Together these two things imply to me that the ALUs are mostly idle because the GPU spends most time working out which control flow path to take.

              In ATI control flow incurs additional latency - it takes 40 cycles of extra latency for each branch point (so if-then-else has 3 branch points and a basic for loop has 2 branch points). This latency cannot be hidden when there's only 4 hardware threads on the SIMD. This is because the maximum control flow latency that 4 hardware threads can hide is only 32 cycles. Also, if your kernel has lots of branching it is likely that there are only, say, 5 cycles of latency hiding per hardware thread between branch points.

                • Stream Profiler interpretation
                  dschwen

                  Where is all this documented? Is there a writeup on how to interpret the Stream KernelAnalyser results? Or do I have to dig through hardware specs to learn that 52 GPRs mean 4 hardware threads  per SIMD core (are there 128 registers per core?). Where can I read up on the whole latency hiding bussiness? 

                    • Stream Profiler interpretation
                      Jawed

                      Chapter 4 of the Stream SDK OpenCL Programming Guide rev 1.05, the most recently published version.

                        • Stream Profiler interpretation
                          philips

                          Thank you.

                          I don't quite understand all that yet, but it helps.

                           

                          Unfortunately the code is not vectorized and I have neither the time or the skill to do so. How much speed can you gain by vectorizing on the GPU?

                          Since it's a raycaster, I think it would be rather complicated to vectorize and you would have to handle a lot of those SIMD things manually (e.g. when one ray is finished and the other three are not)

                           

                           

                          Is there anything specific to the ATI architecture that would make it less suited for this kernel (just looking at the profiler info)? I mean besides 64 wide SIMD and the vector thing

                           

                           

                            • Stream Profiler interpretation
                              Jawed

                              Since a SIMD core is, in itself, a vectorised processing unit, you are already dealing with the issues of control flow incoherence.

                              There are loads of projects, published papers and discussions of GPU-based ray-tracing based techniques out there.

                              Since you don't have time, I suggest you abandon the ATI implementation and just focus on the NVidia version.

                                • Stream Profiler interpretation
                                  philips

                                   

                                  Originally posted by: Jawed Since a SIMD core is, in itself, a vectorised processing unit, you are already dealing with the issues of control flow incoherence.


                                   

                                  Sure, it's already SIMD, but if I wanted to vectorize it, I would have to do all this SIMD stuff manually, wouldn't I? (e.g. manually do branching if one of four components doesn't follow the same path)

                                   

                                  You are probably right about the focusing on something else. The goal was to assess how well the algorithm is suited to different architectures (nvidia, ati, cpu). It's unfortunate that I can't really say that much about the ati and cpu performance without vectorization.

                                   

                                  Maybe those two profiler lines above are enough to get a picture of how well the algorithm works on an Ati GPU. I would imagine vectorization would not really help with the latency hiding and the ALUs being mostly idle.

                                   

                                   

                                   

                                   

                                   

                                    • Stream Profiler interpretation
                                      Jawed

                                      Agreed, it's unlikely vectorisation of the existing kernel would help.

                                      The problem is too much control flow.

                                      Perhaps you might try with only 16 work items per work group. For your amusement you might also want to try 1 and 4 work items per work group.