16 Replies Latest reply on Dec 5, 2011 3:07 PM by Brane2

    Bulldozer: modules, cores, threads

    avk

      Hey AMD, there is a muddle in some people's heads. Of course, it'll be nice for you to name your own 4 module / 8 thread Bulldozer as an 8 core CPU, but I believe that this will put a lots of people in confusion. They already have a habit to compare CPUs by core-to-core method, so if you'll continue to name your 4 module / 8 thread Bulldozer as an 8 core CPU, there will be a lots of disappointment like this:

      "Heck, my 8 core Bulldozer is slower than 6 core Gulftown. WTF?!"

      My point is: please stop to name the Bulldozer cores as a cores. Yes, they are almost real cores, so yes, they deserve the "core" term, but this will be your, AMD, very bad marketing mistake. Instead, in case of Bulldozer, replace the "core" term with the "thread", just like this:

      "4 modules / 8 threads" instead of "4 modules / 8 cores"

      If you, AMD, will do this trick, people will begin to compare Bulldozer with any other CPUs by using their usual "apples-to-apples" (i.e. "cores-to-cores" or, more precisely, "threads-to-threads") method, and from this point of view, your Bulldozer will be more successful.

      BTW, if you want to invent a name for your own technology "1 module = 2 cor.. er.. threads", then what about "Double Threading?"

        • Bulldozer: modules, cores, threads
          avk

          Ok, there is another suggestion: if you don't want to rename the "cores" term to the "threads", why not to use the "integer cores" or even "ALUs?" Check it out:

          "AMD FX-8xxx CPU. 8 integer cores (ALUs) and 4 flex floating point cores (FPUs)."

          Let me remind you, AMD, that the market name you want to use ("cores" instead of "threads" or "integer cores") can easily make your consumers disappointed in case if Bulldozer will be slower then Sandy Bridge just in one test. I repeat: this would be a big mistake to do that! If you don't believe me, then ask the people on your own forums until it's not to late.

            • Bulldozer: modules, cores, threads
              Meteorhead

              I think this issue is already confusing enough without AMD having this naming convention. Since a dual-core, Hyper-threaded processors advertise themselves as full-fledged quad-core CPUs toward the operating system (and that is what most users see), the boundary between "core" and "real-core" is very slim.

              Also, many IT-related sites test and compare Intel Core-iX processors and Phenom II YX processors on a thread-core basis, which is very misleading, but somewhere rational. If a 4 threaded Intel is on par with a 4 threaded Phenom II, why not compare and say "hey, Intel also has an IGP inside."

              Performance is all that matters, and underlying HW really does not (if you truly think about it). Naming conventions will always serve marketing reasons, and it is always up to the experts to know what the words really mean. Until competition exists, companies always tend to shift the meaning of words in the direction that makes them look better, while not lying too much.

              Bulldozer is very similar to Hyper-threading (although it works differently), in naming it only differs that it doesn't distinguish real cores from virtual cores. You might say, that is unfair, but it is the same with Stream Processors vs Stream Cores. As shaders, always the Stream Processors are advertised, when in reality Stream Processors are NOT independant from each other, they imply data independency in shader binary to operate. Thus in reality, everyone knows that Cuda Cores are more flexible, and their number reflects actual performance more reliably than Stream Processor count.

              These are things we have to get used to. Competition is tough.

                • Bulldozer: modules, cores, threads
                  avk

                  Yes, I'm aware about the difference between the Radeon and GeForce graphics architectures. But I still believe that AMD should be as modest as possible in the situation with Bulldozer. People don't like to be deceived.

                    • Bulldozer: modules, cores, threads
                      avk

                      What did I say, AMD?! You've made a big mistake by naming Bulldozer CPUs as 4/6/8-"core" instead of "thread," just as I've suggested. Look at the reviews over the Net - almost all of them are saying: "Bulldozer is a crap." Is that what you wanted for your new CPU architecture?

                        • Bulldozer: modules, cores, threads
                          Brane2

                          There are much more questions with this BD stuff than just this module/core stuff.

                          Like number of modules, for example. While managing to cram as much as six Thuban cores on one 346mm2 die at 45nm, you get only 4 BD modules at 32nm.

                          At 32nm one should have 2x logic budget per same area and with one BD module ( per AMD's statement) requiring only 18% logic than one k-10 core, they should be able to easily be able to put 2X AS MANY MODULES ON BD THAN CORES ON THUBAN.

                          Even more, had they sacrifised some cache- which for core intensive architecture would make perfect sense. 

                          I would LOVE to see chip with 12 or more modules, evein if lean with cache, say 512K of L2 per module and perhaps 1 or 2MB of L3...

                          BD, as it is, makes no sense. First they go for multithreading performance and then half the way they change their mind and put on shitload of cache to catch-up with unithread performance. 

                          As old proverb goes, its neither cat nor mouse....

                           

                            • Bulldozer: modules, cores, threads
                              avk

                              It seems that some software (mostly, games) work on Bulldozer slightly faster when they are using "1 thread per module" method (The Method), instead of "2 threads per module," - just look at this and this. You, AMD, can use this method to improve the games' performance on Bulldozer right now, in Windows XP/V/7, not in Windows 8, just by one these cases:

                              1. Update Bulldozer's BIOS by implementing some kind of "Performance mode" core enumeration.
                              2. Cooperate with Microsoft and create an update for Windows with "Performance mode" core enumeration.
                              3. Implement The Method as a part of Catalyst AI. Of course, it won't help nVidia, but who cares?
                              4. Revive your Dual-core Optimizer by implementing The Method.

                               

                                • Bulldozer: modules, cores, threads
                                  Brane2

                                  I have read the developers thread for a patch for Linux kernel 3.2  and from what I understand it is a matter of cache aliasing in L1i cache.

                                  Different scheduling was proposed, but it wouldn't solve, just lessen the problem, so different solution had to be implemented.

                                    • Bulldozer: modules, cores, threads
                                      avk

                                      I'm not sure that in those cases I've mentioned the problem is in the cache aliasing of L1I. Think about this: in the "1 thread per module" method there is no competition between two threads in one module. Therefore, many module resources belong to single thread: L2, FPU, Scheduler.

                                        • Bulldozer: modules, cores, threads
                                          Brane2

                                          True, but it goes deeper than that.

                                          If they execute different progams, each "core" in the module can knock out L1i content that other core needs, so it aggravates the effect.

                                          Cache aliasing problems are not new and unique either for AMD or BD, but they are much more visible, since L1 cache is shared between cores.

                                          It's just a shame that I had to read about it from the linux kernel developers thread and not from "Fam15h SW Optimisation guide" that is published by AMD relatively recently and it does mention cache aliasing but not for L1 instruction cache.

                                          All in all, BD is a nice idea with much potential and I would really like it if AMD could take it a step further, so one would have e.g four integer execution units and for float/sse units, so they could be shared across more threads or something like that.

                                          Also, it would be nice for AMD to be less skimpy with modules- I wan't 8-moduled chip, even if I have to give up some of the L2 and L3 to get it.

                                          Furthermore, it is not clear why do they insist on BIG L3 on the same die and don't go for two dies with cores and L1/L2 on one die and memory controller/3 on the other. They could use fat and fast HT links on the chip, so bandwidth & latencies wouldn't be an issue, especially given L3 current latencies...

                                           

                                            • Bulldozer: modules, cores, threads
                                              avk

                                              Bulldozer is a server/workstation CPU architecture, and therefore it is not suitable for the notebook & desktop markets due to the very weak single thread performance.

                                              Who knows, maybe the upcoming Bulldozer's successors (Piledriver, Steamroller, Excavator) will work faster, but I'm sure that the performance increase won't be astonishing. Just two integer instructions per thread is not enough to compete with monsters like Sandy Bridge/Ivy Bridge, which can execute up to four or even five in some cases.

                                              Nonetheless, I hope that Trinity (Piledriver-based APU without L3 cache) will be successful, because it's relatively weak CPU cores will be compensated by the their high frequency (up to 4.1 GHz) and the most powerful integrated GPU in the world. Many people will choose Trinity over Sandy/Ivy Bridge just because of relatively powerful graphics.

                                                • Bulldozer: modules, cores, threads
                                                  Brane2

                                                  Again' you have got it wrong.

                                                   

                                                  1. Two integer units per half-module (= "core") and four per module. All four are available for execution of the program when only one core is executing.

                                                  One module can behave as one fat-core or two slim ones or mix of both on a cycle-per-cycle basis. Which isn't half-bad, as you can't always utilize all resources anyways.

                                                  2. Unicore performance is not that weak that would make new core unacceptable.It's somewhat lower, but nothing castrophic. Fanboys will of course scream at 10% performace loss and go berserk at 30%, but this is perfectly useable as it is, as long as it delivers in multithread performance. Which it doesn't really ATM.

                                                   

                                                  3. There is nothing in the archtecture per se that would be responsible for weak unicore performance. It is execution that sucks. They seem to have sc**ed a few things with cache latency and performance. All in all, even existing FX-8150 is capable performer on Linux platform, where program can be optimized for it.Had my 955BE died, I'd go for it, but as it is, it doesn't justify exchanging motherobard for that extra some performance, especially as my 955BE is working just fine.

                                                  4. Don't expect x86 to make some considerable performance leaps, either on AMD's or Intel's side. x86 as CPU and platform is nearing its end. You can see that on Intel's new models. Look at SandyBridge-E. They have filled friggin 4cm^2 on 32nm and got just a little bit extra performance.

                                                  Even with Ivy Bridge on 22nm, don't expect miracles with unicore performance. You'll get some more cores, some extra SSE instructions perhaps and more capable graphic units on chips that would have them.

                                                  I see Buldozer as a prudent step in the right direction to lengthen x86 lifetime.

                                                  Only problem I see, is with existing models and to some degree with upcoming Piledriver. AMD's marketing is trying to sell us one module as two cores, whereas its more like one full Intel's SB core.

                                                  I you look at a FX-8150 as a quad-core that can execute up to extra 4 threads ( and thus compare it with Intel's HT quadcore), it behaves just fine, especially with optimized code. I was hoping to see true 8-module chip, but it was not to be...

                                                   

                                                    • Bulldozer: modules, cores, threads
                                                      avk

                                                       

                                                      Originally posted by: Brane2 1. Two integer units per half-module (= "core") and four per module. All four are available for execution of the program when only one core is executing.
                                                      If you were right, then BD would perform at single thread faster than SB at the same frequency, which is not.

                                                       

                                                      Originally posted by: Brane2 2. Unicore performance is not that weak that would make new core unacceptable. It's somewhat lower, but nothing castrophic.
                                                      "Somewhat lower?" Actually, it's a catastrophic lower.

                                                       

                                                      Originally posted by: Brane2 All in all, even existing FX-8150 is capable performer on Linux platform, where program can be optimized for it.
                                                      That's fine for the Linux and its users, but please remember that at least 95% of PC users are still Windows users. And many Windows applications run on BD even slower than on K10, and it's a catastrophe.

                                                       

                                                      Originally posted by: Brane2 Had my 955BE died, I'd go for it, but as it is, it doesn't justify exchanging motherobard for that extra some performance, especially as my 955BE is working just fine.
                                                      That's your call, but many people wouldn't agree with you.

                                                       

                                                      Originally posted by: Brane2 4. Don't expect x86 to make some considerable performance leaps, either on AMD's or Intel's side.
                                                      Check it out: the upcoming Core i7 3770 (based on Ivy Bridge) works significally faster than its predecessor Core i7 2600 (based on Sandy Bridge), at the same frequency (3.4 GHz) and with the same L3-cache (8 MB). Is that not enough for the considerable performance leap?

                                                       

                                                      Originally posted by: Brane2 Look at SandyBridge-E. They have filled friggin 4cm^2 on 32nm and got just a little bit extra performance.
                                                      It's a different market - servers and workstations. Soon, the software vendors will improve their software in order to utilize all the threads of the modern CPUs, and you will see the performance leap.

                                                       

                                                      Originally posted by: Brane2 I see Buldozer as a prudent step in the right direction to lengthen x86 lifetime.
                                                      I disagree. The Bulldozer is a server architecture, and it shouldn't have to be the base of the desktop and notebook CPUs.

                                                       

                                                      Originally posted by: Brane2 Only problem I see, is with existing models and to some degree with upcoming Piledriver. AMD's marketing is trying to sell us one module as two cores, whereas its more like one full Intel's SB core.
                                                      Agreed. That was stupid on the AMD's behalf.

                                                       

                                                      Originally posted by: Brane2 I you look at a FX-8150 as a quad-core that can execute up to extra 4 threads ( and thus compare it with Intel's HT quadcore), it behaves just fine, especially with optimized code.
                                                      Heh, who will write the optimized code for the Bulldozer? Nobody, I guess, because AMD has almost nothing comparable in terms of quality and fast software as Intel has. So, who will bother?

                                                       

                                                      Originally posted by: Brane2 I was hoping to see true 8-module chip, but it was not to be...
                                                      As fast as AMD will transit to 22 or 20 nm, you will see 8-module chip.

                                                        • Bulldozer: modules, cores, threads
                                                          Brane2

                                                           

                                                           

                                                          Originally posted by: avk
                                                          Originally posted by: Brane2 1. Two integer units per half-module (= "core") and four per module. All four are available for execution of the program when only one core is executing.
                                                          If you were right, then BD would perform at single thread faster than SB at the same frequency, which is not.


                                                           

                                                          Please stop trolling. If you want to know, read the relevant literature. This is, after all, developer's forum. If you wanto participate "users" lamenting, join and participate in users forum. 

                                                           

                                                           

                                                          Originally posted by: Brane2 2. Unicore performance is not that weak that would make new core unacceptable. It's somewhat lower, but nothing castrophic.
                                                          "Somewhat lower?" Actually, it's a catastrophic lower.


                                                           

                                                          Why ? What could you do with old K-10 that you couldn't with BD ? For me ( and probably 99,5% of all users), BD's unicore performance is more than satisfactory. If I would get great multicore performance, I wouldn't think twice about unicore stats.

                                                           

                                                           

                                                          Originally posted by: Brane2 All in all, even existing FX-8150 is capable performer on Linux platform, where program can be optimized for it.
                                                          That's fine for the Linux and its users, but please remember that at least 95% of PC users are still Windows users. And many Windows applications run on BD even slower than on K10, and it's a catastrophe.


                                                           

                                                          1. Even Windows will migrate to non x86 platforms soon.

                                                           

                                                          2. AMD's market share is about around percentage of Linux users, if not smaller. So just a Linux market alone could consume everything AMD produces at the moment without stunting AMD's growth.

                                                           

                                                           

                                                          Originally posted by: Brane2 Had my 955BE died, I'd go for it, but as it is, it doesn't justify exchanging motherobard for that extra some performance, especially as my 955BE is working just fine.
                                                          That's your call, but many people wouldn't agree with you.


                                                           

                                                          Many people would disagree justa bout anything. After all, AMD would probably live just fine with many people choosing other options.

                                                           

                                                           

                                                           

                                                          Originally posted by: Brane2 4. Don't expect x86 to make some considerable performance leaps, either on AMD's or Intel's side.
                                                          Check it out: the upcoming Core i7 3770 (based on Ivy Bridge) works significally faster than its predecessor Core i7 2600 (based on Sandy Bridge), at the same frequency (3.4 GHz) and with the same L3-cache (8 MB). Is that not enough for the considerable performance leap?


                                                           

                                                          No, that's more or less peanuts. It shows significant speedup in those few apps that can utilize a few new SSE instructions and much less, if anything, everywhere else.

                                                           

                                                           

                                                          Originally posted by: Brane2 Look at SandyBridge-E. They have filled friggin 4cm^2 on 32nm and got just a little bit extra performance.
                                                          It's a different market - servers and workstations. Soon, the software vendors will improve their software in order to utilize all the threads of the modern CPUs, and you will see the performance leap.


                                                           

                                                          WHich is when you will se BD's performance rise, also. And the performance of other solutions (ARM etc). which btw won't be that soon. WIth many cores, interconnect becomes bottleneck, and this wasn't updated for yuite some time.

                                                           

                                                           

                                                          Originally posted by: Brane2 I see Buldozer as a prudent step in the right direction to lengthen x86 lifetime.
                                                          I disagree. The Bulldozer is a server architecture, and it shouldn't have to be the base of the desktop and notebook CPUs.


                                                           

                                                          What is so purely server oriented on BD ? It tries very smartly to economise with HW and power utilisation without ( in principle ) lot of performance loss.

                                                          As it is shown now, it seems that non-polished first attempt is more to blame than architecture per se. 

                                                           

                                                           

                                                           

                                                          Originally posted by: Brane2 I you look at a FX-8150 as a quad-core that can execute up to extra 4 threads ( and thus compare it with Intel's HT quadcore), it behaves just fine, especially with optimized code.
                                                          Heh, who will write the optimized code for the Bulldozer? Nobody, I guess, because AMD has almost nothing comparable in terms of quality and fast software as Intel has. So, who will bother?


                                                           

                                                          All software I have on the machine is compiler-optimized for my 955BE. When I buy FX-8xxx, I'll recompile my Gentoo for new chip. No big deal.

                                                          Memory management and scheduler on new Linux kernel (3.2) is already adapted for Bulldozer.

                                                           

                                                           

                                                           

                                                          Originally posted by: Brane2 I was hoping to see true 8-module chip, but it was not to be...
                                                          As fast as AMD will transit to 22 or 20 nm, you will see 8-module chip.


                                                           

                                                          As I said, I would much rather see smaller L2 ( eg 512KB per core ) and L3 ( eg 512 KB per module ) and extra 4 modules on that area...

                                                          Also, as said, on 32nm AMD should be able to put modules on chip easily. I don't know what stopped them, but it is obvious that they hit some big bump with GloFo's 32nm process...



                                                            • Bulldozer: modules, cores, threads
                                                              avk

                                                              Hey, I'm not trolling you or somebody else at all. BTW, what kind of the relevant literature you want me to read? What about #47414, where, in Chapter 2.1 Key Features, AMD states that the Bulldozer has just two-way integer execution? Do you know what does this mean? This means that one of BD x86-core can perform just two integer instructions. Do you remember how much Intel CPUs can? Since Core 2 (2006), up to four or even five.

                                                              As for weak single thread performance of the Bulldozer: some buyers prefere to read an Internet reviews before the spending their money. And what do they see? They see that their famous PC games work on BD much slower than on SB (1.5-2.0 times, link). So they are asking themselves: "Why should I buy BD instead os SB?" Do you have something to tell them? If it so, I'd like to read that :).

                                                              About AMD market share: it's about 20%. Are you sure that the Linux sector is as much as the AMD's one?

                                                              About the Ivy Bridge speed boost. I strongly doubt that Microsoft Office Excel 2010 has a support of the new IB instructions, but it shows about 25% of speed boost against the Sandy Bridge.

                                                              About the purely server BD orientation. I believe that a server CPU shouldn't have been released on the desktop and mobile markets, because they are very different ones.

                                                              About your ability to recompile all the software you have. Heck, you're a lucky man :). But hey, what about 99,9999999% of other people who can't or don't want to do the same?

                                                                • Bulldozer: modules, cores, threads
                                                                  Brane2

                                                                   

                                                                  q]Originally posted by: avk Hey, I'm not trolling you or somebody else at all. BTW, what kind of the relevant literature you want me to read? What about #47414, where, in Chapter 2.1 Key Features, AMD states that the Bulldozer has just two-way integer execution? Do you know what does this mean? This means that one of BD x86-core can perform just two integer instructions. Do you remember how much Intel CPUs can? Since Core 2 (2006), up to four or even five.

                                                                   

                                                                  Keywoard here being "_can_" perform. But how many do actually get used on average during execution ? This is not a GPU. CPU do have to worry about conditional and computed jumps and loops, iinstruction interedependency scheduling conflicts etc etc.

                                                                   

                                                                  From that perspective dual-core module is great idea. You have two lean cores that can execute low-to medium instruction loads but both cores have an ability to double their resources as needed on a cycle-by-cycle basis. One module performs more or less as well as one Intel's fat-core, but can work in split-mode. Its resources are therefore far better utilized.

                                                                   

                                                                  As for weak single thread performance of the Bulldozer: some buyers prefere to read an Internet reviews before the spending their money. And what do they see? They see that their famous PC games work on BD much slower than on SB (1.5-2.0 times, link). So they are asking themselves: "Why should I buy BD instead os SB?" Do you have something to tell them? If it so, I'd like to read that :).



                                                                   

                                                                  Yes. Then they should look after their interests and buy SB, if that is optimal solution for them. As recent reports show, AMD has no problem selling Buldozer, so there are obviously customers that see that solution as a perfect fit.

                                                                   

                                                                  As for gaming performance-as far as I have checked, BD performs very good with many of the names. Most folks couldn't care less for actual framerate but whether they will be able to actually play the game withou a problem and answer is obviously "yes" for many of them...

                                                                   

                                                                  About AMD market share: it's about 20%. Are you sure that the Linux sector is as much as the AMD's one?



                                                                   

                                                                  Last time I checked it was lower than that. On server front it's more like 5% and those folks don't have any problems running Linux when it's optimal to do so. On other fronts they don't go for perfomrance crown but for bang/buck ratio as it is, so as long as Linux optimized solution would do at least satisfactorily in Win world and have low price, it would sell, just as it is selling now.

                                                                   

                                                                  About the Ivy Bridge speed boost. I strongly doubt that Microsoft Office Excel 2010 has a support of the new IB instructions, but it shows about 25% of speed boost against the Sandy Bridge.



                                                                   

                                                                  25% is not speed boost, it's more like speed blip. Practically unnoticeable in everydays work. Gone are the days where each next gen chips would be easily twice as fast as current ones or faster. Besides, have you ever seen someone complaining about slowness of their Excel ?

                                                                   

                                                                  About the purely server BD orientation. I believe that a server CPU shouldn't have been released on the desktop and mobile markets, because they are very different ones.



                                                                   

                                                                   

                                                                  About your ability to recompile all the software you have. Heck, you're a lucky man :).



                                                                   

                                                                   

                                                                  No luck, just prudence. Say what you want about Gentoo, but "emerge -uD --keep-going @world" REALLY RECOMPILES YOUR WORLD ;o)

                                                                   

                                                                   

                                                                   

                                                                   But hey, what about 99,9999999% of other people who can't or don't want to do the same?


                                                                  Life is always a balance of things that you want or don't want to do on one side and things that you have or must not do on the other.

                                                                  Once you limit yourself on the one side, you have to pay on the other.



                                                                    • Bulldozer: modules, cores, threads
                                                                      Brane2

                                                                      And, btw, I, too, beginning to look at the Piledriver.

                                                                      Existing 955 is enough for now, but I'm starting to have problems on existing boards with electrolytic caps and that #$!*$ RoHS and while caps can be replaced ( i tried really good tantalum caps and boy they rock ! ), I'd rather not reball and resolder NB and SB albeight I will have to look into it someday...

                                                                      So, if I am about to go for new gear and if I support AMD on principle ( small, inovative player, above all playing relatively honest and offering open source solutions, whenever they can) it might as well be Piledriver.