3 Replies Latest reply on Feb 6, 2014 9:18 AM by jaidotsh

    Bolt performance questionable

    Meteorhead

      Hello!

       

      I thought I'd give Bolt a spin and see how neat it looks in code, and how it performs. My findings about it's elegance in code is satisfactory, there are some nice feautres, although the use of macro magic always gives me the creeps, yet I know that some things are easiest to achieve by using them.

       

      I ran the MonteCarloPI sample both racing my Mobility HD5870 vs my Core-i5 430, and the GPU is only marginally faster than the STL implementation. Using TBB and running the MultiCoreCpu backend is even more disappointing. Here are my results:

       

      C:\Kellekek\TBB\bin\intel64\vc11>"C:\Users\MátéFerenc\Documents\AMD APP\samples\bolt\bin\x86_64\MonteCarloPI_TBB.exe" --
      timing --samples 1000000 --iterations 10 --device MultiCoreCpu
      **********************************************
      MonteCarloPI using BOLT
      **********************************************


      Running in multi-core cpu mode(TBB)
      Completed setup() of MonteCarloPI sample
      Completed Warm up run of Bolt code
      Executing MonteCarloPI sample over 10 iteration(s).

       

       

      Completed Run() of MonteCarloPI sample

      1. Bolt implementation using transform() and reduce()

      | Points  | Avg. Time(sec) | Points/sec |
      |---------|----------------|------------|
      | 1000000 | 2.24038        | 446353     |


      2. Bolt implementation using fused transform_reduce()

      | Points  | Avg. Time(sec) | Points/sec |
      |---------|----------------|------------|
      | 1000000 | 2.22865        | 448703     |


      3. Bolt implementation using count_if()

      | Points  | Avg. Time(sec) | Points/sec |
      |---------|----------------|------------|
      | 1000000 | 2.22959        | 448514     |


      4. STL implementation using transform() and reduce()

      | Points  | Avg. Time(sec) | Points/sec   |
      |---------|----------------|--------------|
      | 1000000 | 0.0107181      | 9.33002e+007 |


      5. STL implementation using count_if()

      | Points  | Avg. Time(sec) | Points/sec   |
      |---------|----------------|--------------|
      | 1000000 | 0.015485       | 6.45786e+007 |

       

      My question is: is there something wrong with the generated kernel? Does this sample really favor the CPU this much? How can the STL implementation clearly using 25% of the resources (2-core, 4 thread CPU), TBB using the CPU 100% is 100 times slower, while the GPU is roughly 10% faster than the STL implementation.

       

      I understand that VS2013 has an incredible auto-vectorizer and parallelizer, but this does not seem to make use of it, (otherwise the STL implementation would use 100% CPU as well), not to mention that TBB being slower than STL is yet again funny. Is Bolt really this screwed?

        • Re: Bolt performance questionable
          jaidotsh

          Hi Meteorhead,

           

          The MonteCarloPI sample demonstrates the usage of Bolt (APIs, Macro magic etc.) and its performance gain over serial implementations. You're right about the TBB vs serial CPU results in that sample. The low performance is because of the TBB calls made using bolt::cl::device_vector : a container which is primarily used for GPU computations. We replaced that with std::vector and got the numbers on AMD A6-3410MX QuadCore APU:

           

          >MonteCarloPI.exe -t -i 1000 --device MultiCoreCpu

          **********************************************

          MonteCarloPI using BOLT

          **********************************************

           

           

          Running in multi-core cpu mode(TBB)

          Completed setup() of MonteCarloPI sample

          Completed Warm up run of Bolt code

          Executing MonteCarloPI sample over 1000 iteration(s).

          Completed Run() of MonteCarloPI sample

           

          1. Bolt implementation using transform() and reduce()

           

          | Points  | Avg. Time(sec) | Points/sec  |

          |---------|----------------|-------------|

          | 1000000 | 0.00854484     | 1.1703e+008 |

           

           

          2. Bolt implementation using fused transform_reduce()

           

          | Points  | Avg. Time(sec) | Points/sec   |

          |---------|----------------|--------------|

          | 1000000 | 0.00414969     | 2.40982e+008 |

           

           

          3. Bolt implementation using count_if()

           

          | Points  | Avg. Time(sec) | Points/sec   |

          |---------|----------------|--------------|

          | 1000000 | 0.00591267     | 1.69128e+008 |

           

           

          4. STL implementation using transform() and reduce()

           

          | Points  | Avg. Time(sec) | Points/sec   |

          |---------|----------------|--------------|

          | 1000000 | 0.0198957      | 5.02621e+007 |

           

           

          5. STL implementation using count_if()

           

          | Points  | Avg. Time(sec) | Points/sec   |

          |---------|----------------|--------------|

          | 1000000 | 0.0213733      | 4.67874e+007 |

           

          In this case, you can see that TBB is clearly faster than the serial version. We'll work on fixing the sample so that the performance benefits are apparent.

           

          As for the GPU performance, Bolt is optimized for GCN-based GPUs and you can see a significant improvement in performance over serial and TBB with larger buffer sizes.

           

          Thanks,

          Jay

            • Re: Bolt performance questionable
              Meteorhead

              I don't see how can one tune such a simple kernel as in the PI sample be tuned for GCN (when any kernel compiler should be able to make some decent VLIW5 code from it), but maybe it's true. As for larger buffer sizes, I cannot make any larger buffer sizes, as 100M fails to launch, as 100M of ints are 400MB in size, and for whatever reason, my GPU don't want to allocate that much. For 50M elements, it still holds the same marginal speed-up on GPU as it does for less elements.

               

              The only thing I can imagine, is that the implicit datamovement dominates the computation. Such a simple sample must be able to provide speed-up on VLIW5 architecture also, in case it generates OpenCL kernel code in the background (which I think it does). The generated code cannot be that bad, that it cannot provide with at least a factor 5 in speedup.

                • Re: Bolt performance questionable
                  jaidotsh

                  Hi meteorhead,

                   

                  Yes, one can't tune simple kernels like that for GCN. I actually wanted to say that there are no VLIW5-specific optimizations in the code. As for the large size allocations, I face them too and looks like it's because of the GPU.

                   

                  We found the problem that was contributing to the low performance of MonteCarloPI on GPU. Apparently, the bug was introduced in the new version of the code. An old version of MonteCarloPI is still available here. It was, on an average, little over 5X faster than MultiCore and, on an average, about 17X faster than the serial code on my A6 QuadCore APU. Let me know if you're able to get a good performance on your GPU with the code on the Bolt landing page.

                   

                  Thanks,

                  Jay