We have a multithreaded .net system that we run at our company. Our production system are
2x Intel(R) Xeon(R) CPU X5355 @ 2.66GHz (4C; 2.66GHz; 2x 4MB L2; 1.33GHz FSB)16GB
And we're Testing
4x Quad-Core AMD Opteron(tm) Processor 8378 (4C; 2.4GHz; 2GHz IMC; 4x 512kB L2; 6MB L3)128GB
We're noticing that our software only runs about 30% faster on the opteron system even though it carries 8 more cores then the current system. While I realize there's a lot at play here, we were expecting better performance.
Running Sandra on both systems shows that the inter-core test of the opteron system is actually slower then it's intel counterpart both in bandwidth and latency. The .Net benchmarks are on the order of 20-40% faster (keep getting different results) on AMD. Everything else seems materially better on Opteron (CPU/Cache/Memory tests).
We are running Windows 2003 64bit for Intel Based System
We are running Windows 2003 Enterprise 64bit for AMD Based SYstem
SciMark C# implementation shows:
Composite Score: 198.24 MFlops
FFT : 114.84 - (1024)
SOR : 354.37 - (100x100)
Monte Carlo : 23.99
Sparse MatMult : 208.67 - (N=1000, nz=5000)
LU : 289.33 - (100x100)
Composite Score: 174.33 MFlops
FFT : 110.32 - (1024)
SOR : 309.89 - (100x100)
Monte Carlo : 31.13
Sparse MatMult : 191.54 - (N=1000, nz=5000)
LU : 228.77 - (100x100)
A few questions:
1) Is anyone experiencing the same with their .Net application?
2) Not ruling out the need to profile our software but want to rule out hardware issues and/or hardware + os issue. Are there windows server and/or .net versions built to run optimally on amd?
3) Are there better benchmark software besides sandra, cpuz, scimark to use to compare the systems to help identify potential hardware issues?
Well... First, you are not running a system with 8 times more cores, it's just 2 times more cores going from 2 quad-cores to 4 quad-cores, do not expect huge gains.
Second, sintect benchmarks (Sandra, SciMark, etc) won't help, they do not behave like your app and won't tell why your app isn't faster on in the new system, profile it.
Third, before profiling... Have you checked in task manager all cores are being used? Not all apps use all 16 cores (SciMark seens to be one of them), if yours don't use it's hardly to get any improvment by adding processors.
Unsure where I said 8x but I understand we are going from 8cores to 16cores. Our app is multithreaded and we throttle our threads to the number of cores so was expecting much more the 30%. Agreed, the final answer is to profile the app but was trying to be systematic in understanding if the hardware/os/vm (.net) needed any tweaking before broaching the application.
I am curious about the bandwidth and latency seen on sandra in the Multi-Core testing.
Inter-Core Bandwith 13.45GB/s
Inter-COre Latency 88ns
Inter-Core Bandwith 9.56GB/s
Inter-COre Latency 141ns
Does this necessarily imply setting thread affinity to a particular cpu will help?
I don't know how SANDRA get that data neither I care about, but it doesn't says much, the two plataforms have a very different topology.
In the Intel plataform each two cores are together in a die and share a L2 cache, each two dies are in a socket and share a FSB, then the two processors access the memory uniformly trought the chipset.
In the AMD plataform each four cores are toogether in a die have a dedicated L2 and share a L3, data transfer by L3 is slower than by L2 but faster than by FSB, there is a single die per socket and dedicated memory per socket, each socket is directly connected to two other so there is two paths longer.
All that affects what SANDRA measures, is it the bandwidth and latency between cores 0 and 1? 0 and 2? 0 and 5? etc.
One point you should look at if yout app is memory intensive, Opteron (and all other current processors) access memory in a non-uniform way (NUMA), each socket have part of the memory and, for example, if socket 0 access data in socket 2 there will be two hops to deliver data.
I'm not sure if .Net is NUMA-aware, I don't think so, in this case a solution is to create on app domain per socket (in your case 4).
If it's free you may try stting thread affinity I don't think it will improve much, Windows should alrerady do it for NUMA systems.
If you can evalute, 2008 R2 should scale a bit better and perform much better when there is locking contention.
Thanks very much for your help. We are memory intensive with lots of in memory caches.
Can you elaborate on w2k8 a little? What in particular does it do differently with locking? Is this AMD specific? thx