AnsweredAssumed Answered

Bolt performance questionable

Question asked by Meteorhead on Jan 24, 2014
Latest reply on Feb 6, 2014 by jaidotsh

Hello!

 

I thought I'd give Bolt a spin and see how neat it looks in code, and how it performs. My findings about it's elegance in code is satisfactory, there are some nice feautres, although the use of macro magic always gives me the creeps, yet I know that some things are easiest to achieve by using them.

 

I ran the MonteCarloPI sample both racing my Mobility HD5870 vs my Core-i5 430, and the GPU is only marginally faster than the STL implementation. Using TBB and running the MultiCoreCpu backend is even more disappointing. Here are my results:

 

C:\Kellekek\TBB\bin\intel64\vc11>"C:\Users\MátéFerenc\Documents\AMD APP\samples\bolt\bin\x86_64\MonteCarloPI_TBB.exe" --
timing --samples 1000000 --iterations 10 --device MultiCoreCpu
**********************************************
MonteCarloPI using BOLT
**********************************************


Running in multi-core cpu mode(TBB)
Completed setup() of MonteCarloPI sample
Completed Warm up run of Bolt code
Executing MonteCarloPI sample over 10 iteration(s).

 

 

Completed Run() of MonteCarloPI sample

1. Bolt implementation using transform() and reduce()

| Points  | Avg. Time(sec) | Points/sec |
|---------|----------------|------------|
| 1000000 | 2.24038        | 446353     |


2. Bolt implementation using fused transform_reduce()

| Points  | Avg. Time(sec) | Points/sec |
|---------|----------------|------------|
| 1000000 | 2.22865        | 448703     |


3. Bolt implementation using count_if()

| Points  | Avg. Time(sec) | Points/sec |
|---------|----------------|------------|
| 1000000 | 2.22959        | 448514     |


4. STL implementation using transform() and reduce()

| Points  | Avg. Time(sec) | Points/sec   |
|---------|----------------|--------------|
| 1000000 | 0.0107181      | 9.33002e+007 |


5. STL implementation using count_if()

| Points  | Avg. Time(sec) | Points/sec   |
|---------|----------------|--------------|
| 1000000 | 0.015485       | 6.45786e+007 |

 

My question is: is there something wrong with the generated kernel? Does this sample really favor the CPU this much? How can the STL implementation clearly using 25% of the resources (2-core, 4 thread CPU), TBB using the CPU 100% is 100 times slower, while the GPU is roughly 10% faster than the STL implementation.

 

I understand that VS2013 has an incredible auto-vectorizer and parallelizer, but this does not seem to make use of it, (otherwise the STL implementation would use 100% CPU as well), not to mention that TBB being slower than STL is yet again funny. Is Bolt really this screwed?

Outcomes