cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

avk
Adept III

Some suggestions to improve an IPC on the K10+

Perhaps, some of them are already exist in K10-45nm...

Well, of course, I'm not a one of those guys who architect chips . But reading the K10 optimization manual (#40546), especially Appendix C "Instruction Latencies", I have thought that several instructions could enlarge its throughput, if AMD will (slightly?) improve the FSTORE unit:

1) Almost all the "MOVxxx xmmreg1, xmmreg2" forms like: MOVSS/D, MOVLHPS/D, MOVHLPS/D, MOVSLDUP, MOVSHDUP. The most important instructions here are MOVSS/D, which are frequently used in a MSVC-generated code.

2) Next target is a data shuffling instructions (PACKxxxx, UNPCKxxxx, xSHUFxxx). I'm not quiet sure about a difficulty of implementation of these instructions into the FSTORE, but I think that it is somewhat easier than the whole FADD.

3) Last target is a logical 128-bit operations (xANDx, xORx, etc). Arguments are the same as in 2).
0 Likes
24 Replies

Thanks, avk, AMD is always evaluating ways to improve performance and we appreciate input from developers like you. Thanks also for being such an active contributor to the forums!
0 Likes

Originally posted by: devcentral

Thanks, avk, AMD is always evaluating ways to improve performance and we appreciate input from developers like you.

Could this topic be a wish list?

How about lower latency for simple SSE instructions? Some integer algorithms worth vectorizing on other processors but not on K-10 due to latency penalities, it's worse for instructions like PMOVMSKB and PACKxxxx wich are likely to be "serializaing" instruction, but 2 cycles for an PADDx is pain too.
And do not forgot about the 4 cycles of ADDPS
0 Likes

Thanks for sharing all the information here !!!!!!!!!!

0 Likes

My wishlist:

- only 2-way L1 associativeness is PITA. I know, I know, laws of physics are bit*h, but god*amn...

- All PrefetchX currently prefetch only to L1, which is BAD, especially since L1 is just 2-way assoc.

- All that L3 cache is nice, but what is the use of it, if it is so hard to use it for intercore communications ? Was it so hard to make a mechanism to lock some L3 cacheline as unswappable and prevent it being spilled into RAM (= use it as fast internal shared RAM )

Also , ti would be nice to be able to declare some RAM area uncacheable for L1/L2 but writeback for L3, with option for not being actually backed by RAM.

As it is now, fast intercore synchronisation seems highly awkward...

- is there any possibility of being able to switch off carcinogenic x86/x86_64 ISA and use something nearer internal native RISC uOPS ?

-SSE code tends to run in small loops, but conditional jumps based on xmm register content are not cheap. Would it be expensive to reserve some smallish memory area ( say 1 to maybe 4 cachelines = 64-256bytes) for internal SSE instruction buffer, where once decoded and exectuted SSE instructions would be stored in fully decoded form and then rexecuted until loop exit conditons were met ?

That way tightloop SSE code wouldn't clog instruction decoders and general instructions could be executed in paralell. Also, there wouldn't be any need for frequent expensive GPR--XMM register communication...

0 Likes

One more:

 

Decoder unit seems to be linked to L1 through 2 x 128 bit busses each way, but L1-L2 linkage is only half that in writing direction.

 

Which means, there is no big harm if data that is to be burst-read, resides in L2 ( except initial latency of course), but there could be substantial delays when burst-writing some area...

0 Likes
avk
Adept III

eduardoschardong: Like I said, I'm not a chip architect , but, IMHO, I think that a latency improving is (much?) more difficult work than a throughput imroving. If it is true, this means that it unlikely will happen in one CPU generation (K10). Maybe in the Bulldozer... But I must admit that I agree with you, when you talking about a nasty behaviour of PMOVMSKB and PACKxxxx instructions.
0 Likes
avk
Adept III

Could someone answer me: is it true that Shanghai doesn't support SSE4.1? What about SSSE3? How difficult to implement SSSE3, SSE4.1 and SSE4.2 in K10+? Is it possible?
0 Likes

Hi, Shanghai, like Barcelona, is an AMD Family 10h processor. AMD Family 10h processors support SSE4a (EXTRQ/INSERTQ and MOVNTSD/MOVNTSS), and not the SSE4.1 subset of SSE4 instructions.  In addition AMD Family 10h processors implement ABM (Advanced Bit Manipulation) instructions, LZCNT and POPCNT.

The Wikipedia pages (http://en.wikipedia.org/wiki/SSE4) have a reasonable summary of this. I must say we do not map AMD Family 10h to "K10". I can't comment on your other questions.

0 Likes
avk
Adept III

tracy_carver: Thank you for the answer! It is pity for me to read your words about the absence of SSE4.1 in Shanghai . Although I don't think that these extensions are critical to support, the fact itself is alert. I hope that AMD will implement them sooner or later.

BTW, could you, please, give me your own (not AMD's) short comment about my first post about the throughput improving of several SSE-instructions by implementing them into the FSTORE?

P.S. "Family 10h" is too long, and "K10" is much shorter .
0 Likes
avk
Adept III

Please, somebody tell me: what exactly has been improved in Shanghai on IPC-level?
0 Likes
avk
Adept III

One elder wise man spoke: "No news - it's a good news!" Alas, in this case I wouldn't agree with him . Is it so extremely secret information about Shanghai's improvements?
0 Likes

The microarchitecture is the same as in Barcelona.  The main improvement in IPC therefore comes from the larger L3 cache, which is 6 MB versus the 2 MB on Barcelona.  With Shanghai, the total on-die cache increases to 8 MB (4x 512K L2 + 6 MB L3), where Barcelona was 4 MB (4x 512K L2 + 2 MB L3).  That's a doubling of on-die cache.  Larger cache helps improve data re-use, and therefore IPC can increase depending on the workload.

Regarding your first post I can just echo devcentral's thanks for the feedback.  I am curious as to your application of these instructions; specifically what sort of code are you writing, say games, multi-media, or library routines for such.  That would be very interesting to understand.

0 Likes
avk
Adept III

Thanks for the answer! Of course, the large cache, the more data it can hold, but I hoped that Shanghai's IPC would be improved on core level, not only on cache level.

About code I write: it's a game .
0 Likes
avk
Adept III

Most of your suggestions, IMHO, are not applicable at K10-45nm. The maximum what can be made is a minor architectural improvements, but not major ones:

1) L1 associativity: Very doubtful. Maybe in Bulldozer.
2) PrefetchX into L1 is bad: Why? Because of little associativity of L1?
3) Lock L3-areas: Interesting idea, but its implementation will require additional L3-control instructions. Maybe if you will suggest some pseudo-code here to describe how it would works, then somebody at AMD will answer.
4) Native uOps: Forget it. If you want to write extemely optimized assembler code, optimize it by yourself or do it with good compiler.
5) Small internal memory for loops: I hope that Bulldozer will have something like that, because Intel's Core 2 and Core i7 already have it (do you mean "Loop Stream Detector?").
0 Likes

1) L1 associativity: Very doubtful. Maybe in Bulldozer.


Why is that ? C2D has it for quite some time.

 

2) PrefetchX into L1 is bad: Why? Because of little associativity of L1?


 

Amongst other things. You usually prefetch first location of next batch of data you intend to use while you are still working with current batch. If data arrives in your L1 while you are still working on that batch, your performance tanks -

1. because you usually NEED both ways of L1

2. because even without previous point there usually is some time penalty for filling L1

 

3) Lock L3-areas: Good idea, but its implementation will require additional L3-control instructions. Maybe if you will suggest some pseudo-code here to describe how it would works, then somebody at AMD will answer.


I don't have a clue about optimal implementation. For that, I'd have to monkey around with few test cases, for which I don't have the time. It just seem odd to me that some Motorola (= Freescale now) had the wisdom to do such thing in their ColdFire microcontrollers, while AMD didn't see as neccesarry to make such arrangements, especially after all that hype about K-10 being "true quadcore" with high intercore bandwidth. What's the use of that, if one has to train on Rubik's cube for a month in order to be able to make the code ?

4) Native uOps: Forget it. If you want to write extemely optimized assembler code, optimize it by yourself or do it with good compiler.


Such code would probably very substanitally unload the burden from instruction decoder, which cuold mean increased instruction bandwidth, lowe power consumption etc.

Also, sane, orthogonal ISA would make compiler's job much easier. x86_64 ISA is insane and mostly hollow. Many instructions are leftover from 8086/186/286/386/486/etc times and come with a tag "don't use" as it is now.

5) Small internal memory for loops: I hope that Bulldozer will have something like that, because Intel's Core 2 and Core i7 already have it (do you mean "Loop Stream Detector?").


AFAIK Intel uses it just as handy small separate cache for keeping a few lastly executed instructions in decoded form.

I had in mind one step further- having such buffer and having extra circuiitry for executing small loops in such buffer.

SSE units as they are now, are good for number cruching, but not for making decisions, based on flags or register contents. You can do it, but latencies are big.

Since GPR--SSE register communication is also expensive and since decoder bandwith is 3 instr/cycle and you have 6 units ( 3 ALU/AGU + 3 SSE/FPU), it would make sense to have SSE unit capable of autonomously executing short loops. For that, one would probably need extra register as loop counter ( if existing SSE reg couldn't play that role) and a few extra SSE opcodes.

There are a few possible implementations. One could be similar to Intel's: last few instructions en up in "loop buffer" until CPU reaches SSE LOOPcc instruction, at which point SSE contionues execution on its own until LOOP counter expires or cc breaks...

 

 

 

 

0 Likes
avk
Adept III

About the L1 associativity: Yes, Core 2 has a good L1-associativity, but last one was implemented during the design of this CPU. But K10 is already designed, so the L1-associativity will unlikely be improved, IMHO.

About PrefetchX into L1: Previous AMD CPU generation has an ability to prefetch into L2. Do you think that was better than now?

About locking L3-areas: Well, your point of view is interested. Let's hope that someone at AMD will look at this topic .

About native uOps: I'm agree with your words about x86 ISA. Yes, it's old, lame, insane, whatever. But it's an ISA. Market doesn't like revolutions (IA64 a.k.a "Itanium"), it does like evolutions (AMD64 a.k.a x86-64).

About small internal memory for loops: AFAIK, you can improve decisions making by using SSE4.1's PTEST instruction. Alas, current AMD CPUs don't support not SSE4.1, nor SSE4.2 .
0 Likes
avk
Adept III

I think that the Propos core would not gain a commercial success. Quad-core with no L3 cache - this formula is defective, IMHO. I think that AMD should be to design a something different: Dual-core with 4 MB of L3. This formula is much more effective (both for desktops and especially notebooks) than Propos', although the crystal could be slightly bigger. Look at the Intel's Wolfdale - this CPU is excellent, so I think AMD should to have something similar, because four cores are too much for a casual user.
0 Likes
avk
Adept III

I'm curious, what if Deneb core would have a slightly different cache formula: not "4x0.5 MB of L2 + 6 MB of L3", but "4x1.0 MB of L2 + 4 MB of L3" instead? Would this variant be more effective on the desktop market? If yes, for how many percent? There are very hard times ahead of AMD, at least two years (I mean before Bulldozer), and I think that it would be smart to make a little (I hope) redesign of Deneb core in order to improve its performance per clock.
0 Likes

What about Phenom II X6? I know that Istanbul is a server/workstation chip, but I think that X6 desktop CPU could be attractive enough too. You know, some rich customers do like to buy expensive PC, and I can assure you that not all of them are Intel-fans . Just imagine this slogan:

Dragons are a very strong and different beasts at the same time...We've grown six-headed one recently. Dare you tame him?

0 Likes
godsic
Journeyman III

AMD PLEASE RELEASE CPU WITH SHARED MULTICHANEL L1 DATA CACHE AND SETS OF SHARABLE SIMD UNITS.

It is obvious that from now CPU must have virtual cores(!!!!!!!!!!!!!!!!!!) ARRAY OF Execution Units, sharable data CACHE and as much pipelines as you can do to activate all this stuff!!!!!!

 Memory bandwidth on K10.5 is very very weak (test Core i7 and Phenom II 940 resulting in x2 defeat of PII). Test code use all cores and very SSE intensive resulting in 100% CPU load. Operations like fetch from A - do some - store to B - do some - store to C is a bullet for AMD HEAD! Yes AMD first introduce on chip memory controller on x86, but it is time to modify it - FBDIMM or 4 way DDR3 🙂

0 Likes

Please, somebody tell me: what exactly has been improved in Shanghai on IPC-level?

0 Likes

Originally posted by: avk What about Phenom II X6? I know that Istanbul is a server/workstation chip, but I think that X6 desktop CPU could be attractive enough too.


Same trap as with early Opterons. Since main market for those is server market, customers that would find them compettitive and attractive at some moment on WS market might get themselves into hopeless upgrade situation.

Namely, WS-server market parity situation can ( and will ) change. And since Opterons have made much more solid name in server market, AMD can command higher prices there. If one uses them on WS machines, one might find that some later generation of Optys simply ridiculously expensive.

Been there, done that with Opterons 2xx and Tyan dual socket boards...

 

 

0 Likes

Originally posted by: avk I'm curious, what if Deneb core would have a slightly different cache formula: not "4x0.5 MB of L2 + 6 MB of L3", but "4x1.0 MB of L2 + 4 MB of L3" instead?


You can't do such apples-for-bicycles comparisons here.

Each byte of L1 is probably much more demanding than a byte in L2 and that one is league ahead of byte in L3, regarding speed, latency, power consumption, die area and interconnect complexity.

Problem is not so much size of the cache as its quality- foremost associativity and latency.

 

 

 

 

0 Likes

Originally posted by: Brane2

You can't do such apples-for-bicycles comparisons here.

Each byte of L1 is probably much more demanding than a byte in L2 and that one is league ahead of byte in L3, regarding speed, latency, power consumption, die area and interconnect complexity.

Problem is not so much size of the cache as its quality- foremost associativity and latency.

Maybe you're right, maybe not. But I believe that most desktop applications would benefit from my hypothetical formula "4*1 MB L2 + 4 MB L3" against the real Deneb's one "4*0,5 MB L2 + 6 MB L3" because of larger fast L2 cache.

 

About Phenom II X6: They are in AMD roadmap in Q2'2010 .

0 Likes