Well... First, you are not running a system with 8 times more cores, it's just 2 times more cores going from 2 quad-cores to 4 quad-cores, do not expect huge gains.
Second, sintect benchmarks (Sandra, SciMark, etc) won't help, they do not behave like your app and won't tell why your app isn't faster on in the new system, profile it.
Third, before profiling... Have you checked in task manager all cores are being used? Not all apps use all 16 cores (SciMark seens to be one of them), if yours don't use it's hardly to get any improvment by adding processors.
Unsure where I said 8x but I understand we are going from 8cores to 16cores. Our app is multithreaded and we throttle our threads to the number of cores so was expecting much more the 30%. Agreed, the final answer is to profile the app but was trying to be systematic in understanding if the hardware/os/vm (.net) needed any tweaking before broaching the application.
I am curious about the bandwidth and latency seen on sandra in the Multi-Core testing.
Inter-Core Bandwith 13.45GB/s
Inter-COre Latency 88ns
Inter-Core Bandwith 9.56GB/s
Inter-COre Latency 141ns
Does this necessarily imply setting thread affinity to a particular cpu will help?
I don't know how SANDRA get that data neither I care about, but it doesn't says much, the two plataforms have a very different topology.
In the Intel plataform each two cores are together in a die and share a L2 cache, each two dies are in a socket and share a FSB, then the two processors access the memory uniformly trought the chipset.
In the AMD plataform each four cores are toogether in a die have a dedicated L2 and share a L3, data transfer by L3 is slower than by L2 but faster than by FSB, there is a single die per socket and dedicated memory per socket, each socket is directly connected to two other so there is two paths longer.
All that affects what SANDRA measures, is it the bandwidth and latency between cores 0 and 1? 0 and 2? 0 and 5? etc.
One point you should look at if yout app is memory intensive, Opteron (and all other current processors) access memory in a non-uniform way (NUMA), each socket have part of the memory and, for example, if socket 0 access data in socket 2 there will be two hops to deliver data.
I'm not sure if .Net is NUMA-aware, I don't think so, in this case a solution is to create on app domain per socket (in your case 4).
If it's free you may try stting thread affinity I don't think it will improve much, Windows should alrerady do it for NUMA systems.
If you can evalute, 2008 R2 should scale a bit better and perform much better when there is locking contention.
Thanks very much for your help. We are memory intensive with lots of in memory caches.
Can you elaborate on w2k8 a little? What in particular does it do differently with locking? Is this AMD specific? thx