What is the best strategy to increase multithreading scaling in case of main memory contention?

I am trying to increase the scaling of an application on a 4-processor machine (48 cores), currently the max. speed is achieved with around 32-36 active cores. No I/O except for main memory is used and the task is well suited to scaling.

Obviously the only reason for limited scaling can be the accesses to the main memory.

My question now is: How can I best use Codeanalyst to find the critical locations in the application which are responsible for the contention problems?