Hey AMD, why not to use NUMA as a workground for incorrect threads scheduling on Bulldozer core? You just need to expose 8C Bulldozer CPU as 2Cx4P system. Then by properly assigning NUMA nodes, you will inform OS how to properly assing threads to CPU blocks with shared L2 and FPU.
I belive, that this will solve problems with Turbo Boost and L2, but FPU is still a question. Probably, you can extend NUMA specification to include other resources, such as FPU.
Unfortunatelly, I dont have Bulldozer CPU with me, so guys if you have one could you please play with numactl in Linux and do some benchmarks to prove the concept?