Do you have an example of an application where multithreaded L1 and L2 BLAS functions will provide a significant performance benefit?
One question is where the threading should occur in an application. For the L1 and L2 functions, are threading choices better left to the application?
If we can demonstrate the benefit, this is something that could be added to the list of future enhancements.
I do not have any off the top of my head. My only example was the benchmark stream quoted above.
Actually I was surpised, I expect this effect to come from the location of data in memory on the numa system. Maybe this should be more of a case of memory location matters.
For large data it would be kind of nice to do threaded L1 (mostly xCOPY and xSCALE) Thus I would not thread with values that are small. Problem is, how does ACML know that data is spread across controlers and which threads should work on what data so that the performance of the multiple memory controlers could be exployted.