Originally posted by: devcentral
Thanks, avk, AMD is always evaluating ways to improve performance and we appreciate input from developers like you.
Thanks for sharing all the information here !!!!!!!!!!
My wishlist:
- only 2-way L1 associativeness is PITA. I know, I know, laws of physics are bit*h, but god*amn...
- All PrefetchX currently prefetch only to L1, which is BAD, especially since L1 is just 2-way assoc.
- All that L3 cache is nice, but what is the use of it, if it is so hard to use it for intercore communications ? Was it so hard to make a mechanism to lock some L3 cacheline as unswappable and prevent it being spilled into RAM (= use it as fast internal shared RAM )
Also , ti would be nice to be able to declare some RAM area uncacheable for L1/L2 but writeback for L3, with option for not being actually backed by RAM.
As it is now, fast intercore synchronisation seems highly awkward...
- is there any possibility of being able to switch off carcinogenic x86/x86_64 ISA and use something nearer internal native RISC uOPS ?
-SSE code tends to run in small loops, but conditional jumps based on xmm register content are not cheap. Would it be expensive to reserve some smallish memory area ( say 1 to maybe 4 cachelines = 64-256bytes) for internal SSE instruction buffer, where once decoded and exectuted SSE instructions would be stored in fully decoded form and then rexecuted until loop exit conditons were met ?
That way tightloop SSE code wouldn't clog instruction decoders and general instructions could be executed in paralell. Also, there wouldn't be any need for frequent expensive GPR--XMM register communication...
One more:
Decoder unit seems to be linked to L1 through 2 x 128 bit busses each way, but L1-L2 linkage is only half that in writing direction.
Which means, there is no big harm if data that is to be burst-read, resides in L2 ( except initial latency of course), but there could be substantial delays when burst-writing some area...
Hi, Shanghai, like Barcelona, is an AMD Family 10h processor. AMD Family 10h processors support SSE4a (EXTRQ/INSERTQ and MOVNTSD/MOVNTSS), and not the SSE4.1 subset of SSE4 instructions. In addition AMD Family 10h processors implement ABM (Advanced Bit Manipulation) instructions, LZCNT and POPCNT.
The Wikipedia pages (http://en.wikipedia.org/wiki/SSE4) have a reasonable summary of this. I must say we do not map AMD Family 10h to "K10". I can't comment on your other questions.
The microarchitecture is the same as in Barcelona. The main improvement in IPC therefore comes from the larger L3 cache, which is 6 MB versus the 2 MB on Barcelona. With Shanghai, the total on-die cache increases to 8 MB (4x 512K L2 + 6 MB L3), where Barcelona was 4 MB (4x 512K L2 + 2 MB L3). That's a doubling of on-die cache. Larger cache helps improve data re-use, and therefore IPC can increase depending on the workload.
Regarding your first post I can just echo devcentral's thanks for the feedback. I am curious as to your application of these instructions; specifically what sort of code are you writing, say games, multi-media, or library routines for such. That would be very interesting to understand.
1) L1 associativity: Very doubtful. Maybe in Bulldozer.
Why is that ? C2D has it for quite some time.
2) PrefetchX into L1 is bad: Why? Because of little associativity of L1?
Amongst other things. You usually prefetch first location of next batch of data you intend to use while you are still working with current batch. If data arrives in your L1 while you are still working on that batch, your performance tanks -
1. because you usually NEED both ways of L1
2. because even without previous point there usually is some time penalty for filling L1
3) Lock L3-areas: Good idea, but its implementation will require additional L3-control instructions. Maybe if you will suggest some pseudo-code here to describe how it would works, then somebody at AMD will answer.
I don't have a clue about optimal implementation. For that, I'd have to monkey around with few test cases, for which I don't have the time. It just seem odd to me that some Motorola (= Freescale now) had the wisdom to do such thing in their ColdFire microcontrollers, while AMD didn't see as neccesarry to make such arrangements, especially after all that hype about K-10 being "true quadcore" with high intercore bandwidth. What's the use of that, if one has to train on Rubik's cube for a month in order to be able to make the code ?
4) Native uOps: Forget it. If you want to write extemely optimized assembler code, optimize it by yourself or do it with good compiler.
Such code would probably very substanitally unload the burden from instruction decoder, which cuold mean increased instruction bandwidth, lowe power consumption etc.
Also, sane, orthogonal ISA would make compiler's job much easier. x86_64 ISA is insane and mostly hollow. Many instructions are leftover from 8086/186/286/386/486/etc times and come with a tag "don't use" as it is now.
5) Small internal memory for loops: I hope that Bulldozer will have something like that, because Intel's Core 2 and Core i7 already have it (do you mean "Loop Stream Detector?").
AFAIK Intel uses it just as handy small separate cache for keeping a few lastly executed instructions in decoded form.
I had in mind one step further- having such buffer and having extra circuiitry for executing small loops in such buffer.
SSE units as they are now, are good for number cruching, but not for making decisions, based on flags or register contents. You can do it, but latencies are big.
Since GPR--SSE register communication is also expensive and since decoder bandwith is 3 instr/cycle and you have 6 units ( 3 ALU/AGU + 3 SSE/FPU), it would make sense to have SSE unit capable of autonomously executing short loops. For that, one would probably need extra register as loop counter ( if existing SSE reg couldn't play that role) and a few extra SSE opcodes.
There are a few possible implementations. One could be similar to Intel's: last few instructions en up in "loop buffer" until CPU reaches SSE LOOPcc instruction, at which point SSE contionues execution on its own until LOOP counter expires or cc breaks...
What about Phenom II X6? I know that Istanbul is a server/workstation chip, but I think that X6 desktop CPU could be attractive enough too. You know, some rich customers do like to buy expensive PC, and I can assure you that not all of them are Intel-fans . Just imagine this slogan:
Dragons are a very strong and different beasts at the same time...We've grown six-headed one recently. Dare you tame him?
AMD PLEASE RELEASE CPU WITH SHARED MULTICHANEL L1 DATA CACHE AND SETS OF SHARABLE SIMD UNITS.
It is obvious that from now CPU must have virtual cores(!!!!!!!!!!!!!!!!!!) ARRAY OF Execution Units, sharable data CACHE and as much pipelines as you can do to activate all this stuff!!!!!!
Memory bandwidth on K10.5 is very very weak (test Core i7 and Phenom II 940 resulting in x2 defeat of PII). Test code use all cores and very SSE intensive resulting in 100% CPU load. Operations like fetch from A - do some - store to B - do some - store to C is a bullet for AMD HEAD! Yes AMD first introduce on chip memory controller on x86, but it is time to modify it - FBDIMM or 4 way DDR3 🙂
Please, somebody tell me: what exactly has been improved in Shanghai on IPC-level?
Originally posted by: avk What about Phenom II X6? I know that Istanbul is a server/workstation chip, but I think that X6 desktop CPU could be attractive enough too.
Same trap as with early Opterons. Since main market for those is server market, customers that would find them compettitive and attractive at some moment on WS market might get themselves into hopeless upgrade situation.
Namely, WS-server market parity situation can ( and will ) change. And since Opterons have made much more solid name in server market, AMD can command higher prices there. If one uses them on WS machines, one might find that some later generation of Optys simply ridiculously expensive.
Been there, done that with Opterons 2xx and Tyan dual socket boards...
Originally posted by: avk I'm curious, what if Deneb core would have a slightly different cache formula: not "4x0.5 MB of L2 + 6 MB of L3", but "4x1.0 MB of L2 + 4 MB of L3" instead?
You can't do such apples-for-bicycles comparisons here.
Each byte of L1 is probably much more demanding than a byte in L2 and that one is league ahead of byte in L3, regarding speed, latency, power consumption, die area and interconnect complexity.
Problem is not so much size of the cache as its quality- foremost associativity and latency.
Maybe you're right, maybe not. But I believe that most desktop applications would benefit from my hypothetical formula "4*1 MB L2 + 4 MB L3" against the real Deneb's one "4*0,5 MB L2 + 6 MB L3" because of larger fast L2 cache.
Originally posted by: Brane2
You can't do such apples-for-bicycles comparisons here.
Each byte of L1 is probably much more demanding than a byte in L2 and that one is league ahead of byte in L3, regarding speed, latency, power consumption, die area and interconnect complexity.
Problem is not so much size of the cache as its quality- foremost associativity and latency.
About Phenom II X6: They are in AMD roadmap in Q2'2010 .