11 Replies Latest reply on Mar 26, 2013 11:17 AM by realhet

    Questions about performance

    qqchose

      Hi

       

      I have questions about optimization in OpenCL. I have a kernel. What this kernel does is not interesting for my questions. Then take this small example

       

      kernel void myKernel(...)

      {

             [CODE]

       

                   output[i] = result;

      }

       

       

      My kernel needs 89.032 ms to complete. I need to call my kernel many times. Then, I thought it should be faster to loop X times in my kernel instead to call my kernel X times. Then, I tried this

       

      kernel void myKernel(...)

      {

             for(int i = 0; i < param.m_nbSample; ++i)

      {

                    [CODE]

      output[i] += result;

      }

      }

       

      I Tried to set param.m_nbSample egal to “1” to be sur evertything is fine. Everything is fine, but, my kernel need 224.633 ms to complete. 252% slower. Not what I expected. I tried to change m_nbSampe to see the result.

      If m_nbSample egal 2, each sample need 224.719 ms by sample (loop)

      If m_nbSample egal 4, each sample need 244.885 ms by sample (loop)

      If m_nbSample egal 50, each sample need 220.442 ms by sample (loop)

      I try to harcode the loop and use unroll

       

      kernel void myKernel(...)

      {

      #pragma unroll 1

             for(int i = 0; i < 8; ++i)

      {

                    [CODE]

      output[i] += result;

      }

                  

      }

       

      Each sample needs 243.236 ms by sample (loop).

      It’s is normal to have this huge difference ? it’s a lot better to call my kernel X times instead to loop X times. What can explain this?

      I tried something else.

      I move my entire kernel in a function like this

       

      void myKernelFunction(...)

      {

      [CODE]

      output[i] += result;

                     }

       

      kernel void myKernel(...)

      {

      myKernelFunction(...)

                  

      }

       

      Now my kernel need 93.7073 ms. I thought all function was “inlined” in kernel. Then I don’t understand why I already lost 4 ms just by adding a function. I tested and everything was fine. Then now I can test what happen if I call myKernelFunction X times. I tried 2 times.

       

      kernel void myKernel(...)

      {

      myKernelFunction(...)

      myKernelFunction(...)

                  

      }

       

      The result: ... 972.904 ms by sample ( then 1945.81 ms for the entire kernel). 10 times slower! Worst then before with the “for”. What can cause this? What is supposed to be the best if I want to call my kernel X times?

       

      Thanks

        • Re: Questions about performance
          realhet

          Hi!

           

          How many instructions are in your [CODE]? Are there mostly ALU insts, or lots of memory IO?

          How many workitems per kernel you launch each time?

          On what hardware?

           

          2x slower -> maybe the workitem count is not that good (for the given hardware) when you make an inner loop and decrease the workitemcnt.

          10x slower ->

            - did you accidentally forced the compiler to run out of registers(scratch regs or memory is terrible slow)?

             - Does the code fit into the instruction cache?

          You can check those last two in the disasm.

          1 of 1 people found this helpful
            • Re: Questions about performance
              qqchose

              Thank you for this information.

               

              My instruction is IO and ALU. There are maybe 50-50.

               

              my global size is (1006, 806) and my local size is (6, 32)

               

              My hardware is AMD Radeon HD 6900 Series
              Catalyst 13.3

               

              I will try to change my local size to see if I can see a difference, thanks

               

              I don't know if I run out of register,

              But in AMD APP KernelAnalyser2, I see 559 ScracthRegs. Does it mean I run out of register?

               

              I don't know if my code fit in the instruction cache, my kernel need 21442 line in the dissembly and the .bin is 1242Ko. Then it's probably the case.

               

              I will check this thanks.

                • Re: Questions about performance
                  realhet

                  - 50-50 IO/ALU ratio is really bad. 1-30 would be ok. (in dwords)

                  - Global size (1006, 806) is 810k workitems, it seems ok, but (local size 6, 32) sounds a bit weird to me (maybe just to me, I always use 64 or 128 or 256 for local, they're 100% safe). Hope the hardware can make that out from 3 wavefronts.

                  - 559 ScratchRegs -> It must be terribly slow On a 69xx you should make your kernel to use at most 128 registers. But this uses 128+559 in total, you have to reorganize your [CODE] somehow. ScratchReg usage means really slow and concurrent RAM access instead of fast registers...

                  - 21442 lines of VLIW disasm is 21442*8=160K bytes (without literal constants), the problem is that the GPU only has 48KB of instruction cache. This one adds to the Scratch Reg slowness.

                   

                  I can't imagine what can be your [CODE] but it's way too big in many aspects for a single stream processor to handle. Try to split the job to fit in above limits. It's not bad if you have 100million workitems, but stay under 128 dword regs and 40KB of code, and watch out for ALU/MEM instruction ratio.

                    • Re: Questions about performance
                      qqchose

                      I asked a colleague why we has this local size. He answer those value was optimize for NVIDIA. I did not check if he is right, but on ATI, those value are the wrost. I check with CodeXL and it's hard to find a wrost case. I change it to 256 and it's a lot better.

                       

                      I found my VGPRs are the bottleneck. Thanks, II will try to optimise them and I will try to split this kernel into smaller kernel.

                       

                      I'm not allow to tell what is inside [CODE] but thanks, I understand cleared what happen and I can optimise our kernel

                        • Re: Questions about performance
                          realhet

                          One more thing: I saw your exec times are a bit like random.

                           

                          You can have more precise exec times if you first warm up the GPU with a little dummy kernel, and then immediatelly launch your kernel.

                          For example launch your kernel twice and only take the second time!

                          It could be ok, when the whole process is completed in 0.5s.

                          And when you optimizing only for the last 1-2% then it will need 10..30 seconds long tests, to have more precise times.

                          1 of 1 people found this helpful
                            • Re: Questions about performance
                              qqchose

                              Thanks,

                               

                              Right now, I run my kernel 50 times and I do an average. But I notice the firsts iterations are always longer. I didn't know why. Now, I will ignore some iteration when I compute the average, thanks.

                                • Re: Questions about performance
                                  realhet

                                  ...50 times. Oh that's perfect then.

                                   

                                  "I didn't know why."

                                  When the card does nothing serious for a short period if time, it will reduce clock speed down to 150..300 MHz to save power.

                                  When you give it enough work to do, it will raise the clock cycles automatically.

                                  This behaviour is controlled only a few times per every second, so the first kernel will be always laggy.

                                   

                                  (On HD6970 and HD6990 there is a "Power Control Setting" slider in  Catalyst Control Center/Overclk. If you overclock the card, this could also modify the clock rate when it thinks it uses too much power.)

                      • Re: Questions about performance
                        dmeiser

                        You might also consider accumulating your results in registers, i.e. introduce a __private variable resultP that you increment each time through your loop and then write the result back to global memory once your done with your loop. Right now it seems to me that you're reading from and writing to global memory each time through your loop.

                        • Re: Questions about performance
                          himanshu.gautam

                          qqchose,

                           

                          Since all threads are reading and writing output[i] and since "i" is not a funtion of get_global_id(0) -- All threads are reading and writing into same variable...

                          This is bad on the GPU buddy....

                          Cache trash......, Inconsistent results... etc...

                          That is probably why you are seeing performance numbers that you dont want to see..

                            • Re: Questions about performance
                              qqchose

                              I did not writen, but

                               

                              i = get_global_id(0) [...] ;

                               

                              what happen to "i" was include in [CODE]. But thanks, I will check to be sure nobody to this kind of mistake. Seam obious, but we are not invulnerable.

                               

                              --EDIT--

                               

                              I undestand the confusion now. There is a mistake in my example, but not in my real code. I wrote

                               

                                   [CODE]

                                   output[i] = value;

                               

                              "i" is something like

                                   i = get_global_id(0)

                               

                              After, I added

                               

                                   for(int i ...) 

                               

                              In my example, but there is a mistake there because "i" is already use. In my real code, it look like

                               

                                   [CODE INIT]

                                   int index = get_global_id(0) [...]

                                   for(int iSample = 0; iSample < [...]; ++iSample )

                                   {

                                        [CODE LOOP1]

                                        value = [...]

                                         [CODE LOOP2]

                                        output[index] = value

                                   }

                               

                              sorry for the confusion