12 Replies Latest reply on Jan 19, 2017 4:58 PM by hardcoregames™

    Playing H265 video on entry level AMD processors with acceptable performance

      Recently, I have experienced many different platforms for H265 playback. All my platforms do not provide support for hardware decoding capabilities for H265. But I found something charming, something disappointed. The most unbelievable one is that Celeron M1000, an Ivy Bridge processor, working only at 1.8GHz maximum, but it could playback H265 video smoothly with comparably low CPU consumption, less than Pentium G6950 (2.8GHz), Core 2 E4600 (2.4GHz) and Pentium E5200 (2.5GHz). Those latter processors could also playback it smoothly. Those four Intel processors decode such video files without any hardware acceleration from GPU.

       

      As to AMD entry level processors that I tested are AMD Sempron X2 180/190, AMD Athlon II 245/250 and A4-3300/3400, none of six processors could playback the same video file at an acceptable performance. But as to A4-3300/3400, I found some workaround, that is to find some a decoder which could utilise the internal GPGPU to accelerate the process of decoding H265. Strongene OpenCL H.265/HEVC Decoder for Windows is a good choice, I tested it on platform of A4-3300/3400. The performance is acceptable, and the CPU consumption is acceptable too.

       

      Update:

      I've also tested on AMD A4-5300 processor yesterday, this processor is excellent for decoding H265 by processor itself. When using this Chinese OPENCL decoder, the CPU consumption turns to be very high, and the performance turns to be worse. The reason might be that the overhead to set-up OpenCL environment consumes more computing resources than merely decoding by the processor itself.

       

      Sempron X2 180/190 and Athlon II 245/250 are not strong enough to decode H265 1080p, I tried to add an NVidia 8800GT external GPU attaching to such systems. This external GPU lacks the capabilities to decode H265, but its CUDA core also support OpenCL computing. After using this decoder, H265 1080p video files could also be replayed smoothly at acceptable performance.

        • Re: Playing H265 video on entry level AMD processors with acceptable performance

          How strong the Core Microarchitecture is?

           

          In order to decode H256 1080p video, I also obtained a Pentium E6500 processor, which has a little bit higher frequency than Athlon II X2 245 (2.93GHz vs 2.9GHz), the same size of L2 cache. Besides that Pentium E6500 lacks SSE4 instruction set, while Athlon II X2 lacks SSSE3 but supports SSE4A and Nested Paging for virtualisation. I use the dual channel DDR2 667 to support E6500, and dual channel DDR3 1333 (1066MT/s actually)  for Athlon II x2 245, the same OS (Windows 10 x64 RS1), same player (Media Player Home Cinema), same external GPU (NVidia 8800GT). But the performance is completely different! MPC lacks the support of utilising OpenCL, all the decoding process are done by the processors themselves. Under system based on Athlon II X2, the cpu consumption tends to reach 100%, and the motion and sound tend to be unsmoothly at an unacceptable degree. But for the Pentium E6500, the cpu consumption is less than 60%, and the video is enjoyable. After using that opengl enabled deocder on Windows Media Player, both could give out acceptable performance, but CPU consumption for Athlon II X2 is comparably high!

           

          Comparing with AMD Sempron 180/190 and A4-3300/3400, I also obtained an Intel Celeron E3400 processor, which clocks at 2.6GHz between 2.5GHz (Sempron 190/195 and A4-3300) and 2.7GHz (A4-3400), but like those AMD processor, it has the same 1MB L2 cache. As what I expected before, this processor could playback the same H265 through MPC smoothly, CPU consumption is about 50% to 78%. After using OpenCL enabled decoder through Windows Media Player, this processor has chances to jump back from 2.6GHz towards 1.2GHz while playing back that video.

           

          What is the reason why the Athlon II X2 245 2.9GHz could not playback that video smoothly, but Celeron E3400 2.6GHz can?

          There are many reasons, maybe the decoders provided by MPC is optimised for Intel processor with Intel compiler, heavily utilising additional instruction set such as SSSE3 and/or AES, which AMD 10h base processor lacks. But the most important reason might lie on the microarchitecture, Core Microarchitecture has all the time strong since it was releasing more than 10 years ago. The even wider execution engines, smart decoding algorithms enable it serving the computer industrial for the passed ten years. Even the latest Core i7 or Xeon are also based on or evolved from Core Microarchitecture. Core Microarchitecture is essentially the 64-bit version of P6 Microarchitecture introduced by Pentium Pro for the first time. While AMD 10h microarchitecture is an updated version of K8 microarchitecture, which is the 64-bit version of K7 (Thunderbird). In early 2009, I occasionally found the initial but later abandoned AMD Hammer microarchitecture (Talk:Bulldozer (microarchitecture) - Wikipedia ), that prototype resembles to the un-releasing Alpha processor, but a little bit similar with Bulldozer microarchitecture, but different. Hammer has the capabilities to merge two thinner threads into a thicker one, that is the distinct feature. This partition-lised multi-threading might have some advantages over Intel Hyper Threading, for in both situation no resources wasted. Most entry-level processors from Intel are HT-disabled ones, they just disable the HT rather than they are not existent, even including Core 2 processors. In this situation, a thread pipeline is ideal or shutdown all the time.

           

          I have no ideas about the actual microarchitecture of ZEN or Ryzen, I have no ideas about their way of implementing Multi-Threading. Comparing to choose the strongest, I would always prefer the entry level one. Because I am all the time using the computer, but general users are essentially using the applications.

           

           

          Multiple Cores for H265 decoding:

           

          I have also tested that H265 video on tablet and netbook computers. The netbook is based on Atom N2840 dual core processor, when using MPC, the processor stays at 100% consumption, and performance is unacceptable. After utilising its integrated GPU resouces through OpenCL, it could give out an acceptible performance but the cpu utilisation stays at 100%.

           

          The most unbelievable thing happened when testing on the Atom X5-Z8300! Using MPC playing the same video, the processor utilisation is below 50%. After using that OpenCL enabled decoder playing on Windows Media Player, I take a look at the tast manager, that CPU consumption is only around 20% and even lower! Unbelievable for such a toy that I have already consider it as.

           

          As to the AMD counterpart, I know those processors are based on Jaguar Microarchitecture (similar with processors used by PS4 and XB1) with codename Kabini. I tried to buy a Sempron 3850 as a test sample in late October, 2014 in a shop at Mong Kok Kowloon. That shopman promised me the next day I could obtain it, but eventually I waited about three days, and got an answer letting me to get in the next week. I was very angry about that! I dropped that purchasing plan. All those Kabini processors are made in Taiwan. In China mainland they are hard to find on shelves. If you have such processors, if you want to use systems based on those processors as an HTPC, you can also try it to playback H265 video. Any comments are welcome to put here...

          • Re: Playing H265 video on entry level AMD processors with acceptable performance

            gpgpuR.pnggpgpuD.pnggpgpu2.pnggpgpu.pnggpgpu3.pnggpgpu.png

             

            The latency risen by Hyper-Transport, in order to test the bandwidth of PCI-E 2.0 x16, I choose the NVidia 8800GT 512MB 256-bit as testing sample, on four machines. For the testing processors, I choose the Intel Pentium E6500, which could provides 1066MT/s FSB, working on Intel G43 chipsets with integrated GPU disabled. Like shown in the first picture, I equipped it with a single DDR3 1333 (actually working under 1066MT/s). Comparing the score of GPU Memory Copy, which exposes the bandwidth of PCI-E 2.0 and system memory controller. Like what I guessed before, it gains the highest score, in limited bandwidth of system memory.

             

             

            I've also chosen two entry-level processors from AMD, Sempron 180 and Athlon 5000 (45nm), for the reasons that they both equipped with different Hyper-Transport Bus. Selected versions of Athlon 5000 could be unlocked with two extra cores, 6MB L3 Cache, and 1600 MHz HT. I do not unlock its core, but just overclock the HT from 1000MHz to 1600MHz, same as its NB frequency. I test this processor on two different machines, which use AMD/ATI 780G (iGPU disabled) and NVidia GeForce 8200 (iGPU disabled), like shown on last two pictures. They both equipped with different configurations of ungang dual channel DDR2 667 modules. Except the performance of memory modules, the NVidia solution boosts the same processor and NVidia GPU. One could also benchmark on the cache of processor, the NVidia platform would slow down the scores of L2 or last level cache, for its activities on HT are much more frequently than AMD/ATI solution.

             

             

            As to the Sempron 180 processor, it supports both DDR2 and DDR3. I choose two machines on AMD/ATI 780G (iGPU disabled), providing support of DDR2 and DDR3 respectively. For the DDR2, I keep it as the same configuration as the test for Athlon 5000. But for DDR3, I overclocked the memory controller from 1066MT/s to 1600MT/s, equipped with dual channel DDR-3 1333 system memory. The mobo provides setup of ganged mode, but failed to enter into Windows. I keep it testing on the unganged mode.  From the comparison with Athlon 5000, one could easily figure out the HT bandwidth put effects onto the actual performance of GPU memory transaction. For the comparison against Intel platform, even though it provides the maximum memory bandwidth, but the score of GPU memory copy exposes the bandwidth latency caused by serialisation of HT architecture. So HT based platform is not always good for professional graphical workstation.

             

            Update:

            Finally, I managed in finding an A4-3400 processor, the very first generation of AMD APU. Its core is similar as Sempron 180 and Athlon 5000, but it also integrated GPU and PCI-E 2.0 bus onto its core, no more DDR2 support. And the Hyper-Transport Bus has been refined by the internal connection. This internal connection is similar with yesterday's Front Side Bus, without needing to serialise the data onto the HT Bridge anymore. There is also a separate data bus for the iGPU to access system memory directly. I disable the iGPU in favour of that GeForce 8800GT again. This GPU connects with the integrate PCI-E 2.0 unit directly, which further decode and code data between PCI-E and internal connection. There is no that HT operation, so that latency has been eliminated. I just equipped it with a single channel DDR3-1333, but as what I expected, it gains the top score!

            • Re: Playing H265 video on entry level AMD processors with acceptable performance

              Front Side Bus, Hyper-Transport and QPI

               

              AMD introduced the enhanced version of Front Side Bus with introducing AMD K7 based processor, Athlon. This enhanced version of Front Side Bus is distinct from its previous version for it double-pumps the data transaction per second. So the unit to evaluate the Front Side Bus changed from MHz towards MT/s (mega transaction per second). The bus clock feeds the Thunderbird processor with 100MHz based block, but it has a theoretical bandwidth of 64-bit x 200 MT/s = 12,800 Mbits/sec, the same bandwidth as the FSB is working at 200MHz. Later Intel released their Netburst based processor, Pentium 4, they realised a much more enhanced Front Side Bus, quad-pumped one. For the feeding base clock 100MHz, there would be 100 x 4 =400 MT/s. This solution could make the FSB easily to reach 800MT/s when the bus clock is at only 200MHz. If the AMD version of FSB reaches to 800MT/s, it needs burst the bus clock towards 400MHz, which was almost impossible to be realised when those days, Athlon XP was popular. The dropped AMD/DEC Hammer processor also has a Front Side Bus, only the server version was equipped with Hyper-Transport.

               

               

              In order to reach the bandwidth of 800MT/s x 64 = 51,200 bits/sec or even more in a tangible mean, AMD designed another solution, similar as serialising the ATA bus, which would gain the higher bandwidth in an less effort way, for the much fewer lines conveying data would generate few magnetic interferences in a high working speed. The product of their serialised FSB is the Hyper-Transport Bus. But another problem arose, serialisation involves more work to be done than the parallel connection. It needs to package all the data and commands in the form of packs, which are the things could be able to transferred directly by the comparably much fewer lines (single or several, but not many). Encoding and decoding operations are necessarily introduced into the internal connection, which would further increase the latencies. Parallel bus could convey different type of data on the same bus simultaneously. But serial bus could not, in order to make up this loss, it needs to provide a connection based on lanes (bi-direction) at a comparably higher frequency. The eventually version of Hyper Transport for AMD K8 processor is a serial bus comprising of two lanes, each has a 8-bit connection, double-pumped. So provided 800MHz base clock, the bandwidth of single direction is 8-bit x 2 lanes x 800MHz x 2 pumped = 25,600 Mbits/sec, equivalent as the 400MT/s Front Side Bus, without considering the overhead which HT arose. Even though the bandwidth of bi-direction is equivalent to 800MT/s Front Side Bus, but the actual performance is much poorer than the latter. In this analogy, the HT found on most popular AM3 processor, the bandwidth of 2000MHz at single direction is equivalent to 1000 MT/s Front Side Bus, just could feed for PCI-E 2.0 x16, not less, not more.

               

               

              Intel later designed their version of serialised Front Side Bus, QPI. For the consumer products, they just use it to replace the Front Side Bus, conveying data only facing with processor cores. Similar as Hyper-Transport, it provides four lanes double pumped, each has 5 bits; but different from the former, QPI put the transaction on an even higher layer. It is based on an virtual 80-bit connection, of which 64-bit is effective. Those 80 bits could be transfer by 20-bit, 10-bit or 5-bit physical connection at different cycles. Future version of QPI could also realise even wider connection, such as 40-bit, 80-bit and so forth. Take Intel Pentium G6950 for example which uses a 2.4GHz 20-bit QPI to connect processor core die and GMCH die, 2.4GHz / (80-bit / (20 bit x 2 pumped/sec)) x 64-bit = 76.8 Gbits/sec, equivalent to 1200 MT/s FSB at single direction with consideration of overhead which QPI arose. Serialisation introduces the extra operations such as encoding and decoding, and further put more latency towards transactions between cores and memory controller. In order to make it up, the comparably larger L3 cache is also introduced onto those processors.

              • Re: Playing H265 video on entry level AMD processors with acceptable performance

                gpgpu.pnggpgpug6950.png

                 

                Now I've also made that benchmark on other machines. Sadly, after benchmarking on the AMD machine, that graphics card was dead, GPU memory was worn out for around ten years' burning. I had to find another Geforce 8800GT card, which is a little bit faster card, GPU is running at 650MHz and GPU memory is running at 950MHz. Both were overclocked a little bit than the official release. I have no ideas how to get them both back, but I just make another test on Pentium G6950 based machine. For the AMD Athlon II X2 250 machine, I equipped with dual channel DDR II-800 modules, but overclocked at 1066MT/s for the best satisfaction to this processor. As is known to us, the performance of DDR II 1066 is much better than DDR III 1066, but in this section I would not make explanation. As to the Pentium G6950 based machine, I just equipped it with a single channel DDR III 1333 memory module (working at 1066MT/s).

                 

                In my previous reply to this title, the QPI bandwidth (single direction) between cores and QPI bridge is similar as the FSB working at 1200 MT/s, further equivalent to the bandwidth of DDR II/III 1200, much less than DDR III 1333. Might be for this reason, Pentium G6950 could support memory modules of DDR3 1066 rather than DDR3 1333, same as all the Athlon II series of processors. Because the memory controller for Athlon II is integrated onto its processor core, rather than on the bridge chipset. So I have to equipped with dual channel DDR2 1066 to satisfy the bandwidth of PCI-E 2.0 x16 needs. As what you've seen on the results, the performance is better for Pentium G6950, even though the bandwidth of DDR3 1066 single channel is comparably poor. But the QPI bridge or GMCH bridge is after all connected by the QPI link on the chip, rather than integrated onto the core. like what A4-3400 did, so even though the GPU memory clock is higher, but it still lose the competition on the Memory copy benchmark.

                 

                update:

                gpgpu.png

                 

                Now I remove that Geforce 8800GT card, but equip this machine with dual channel DDR3-1333 (working at 1066MT/s). Obviously, the dual channel configurations almost does not improve the performance of processor side. Like what I mentioned in the previous reply, the 2400MHz QPI linking between processor die and GMCH die acts just like a 1200MT/s FSB in single direction. But please pay attention to the score of Memory Copy, data bi-direction transferring simultaneously breaking the bound of 9600MB/s (bandwidth of 1200MT/s FSB). But unfortunately, the score of SHA-1 hash does not buy it, it needs the real bandwidth improvements towards to the processor cores rather than the memory controllers.

                • Re: Playing H265 video on entry level AMD processors with acceptable performance

                  Maximum Physical Address Size in FSB, QPI and HyperTransport

                   

                  When I was a high school student, I customised a Celeron SLOT1 based computer, I was aware of the thing, Maximum Physical Address Size, for the very first time. That Celeron processor is just like a stripped down version of Pentium II, which support PAE-36. In other words, 36-bit addresses could be used for addressing memory for 64GB at maximum. And the Front Side Bus of that processor just equipped with 32 lines for addressing physical space of 4G units of 16-bit (or 32-bit aligned),  which is exactly 64GB. All such systems have potential capabilities to support system memory up to 64GB with holes, but it is a long journey from 64MB towards 64GB, and there is no manufactured designed such Bridge chipsets. 12 years ago, I started my very first step the real 64-bit computing tangible with Athlon 64 2800+ 754 pins. It has the Maximum Physical Address Size of 40-bit, which is 1TB such large. But ridiculously, the integrated memory controller could address DDR memory up to 16GB, too far away from 1TB presumption. I repeatedly asked myself, is it a joke or lie to fool fools such as me? I tried to question it for many people, but the answers that I obtained is not what I want. But today, processors from Intel, such as Pentium G6950 answered such questions to me.

                   

                  Like what I described on Pentium G6950 on previous replies, it places its memory controller onto the QPI bridge die (GMCH). Similar with it, when in multiple Opteron processors system, system memory are partitioned by processors. Each processor has its own partition, but could be further shared through Hyper-Transport Bus. Yup, Hyper-Transport Bus could be used as system memory bus, such as Front Side Bus, but in a very different way. Those data have to be packaged into packs alongside with addresses and commands. One could not easily change the Front Side Bus lines when they are ironed, but the formats of those packs could be easily updated, without needing change the physical infrastructure. 40-bit system memory space would be filled with memory modules across the entire system through processors (but not a processor). That is not a joke or lie, but a real thing existed since the very first day those processors were born. AMD 10h microarchitecture based processor extended it from 40-bit towards 48-bit, without needing change the Hyper-Transport Bus, it has potential capabilities to address 256TB system memory across processors. I also guess one could also equipped a memory controller onto its Hyper Transport Bridge Chipset, this could further supplement the memory space, but at a comparable high latency. As to Athlon II X2 processors, 2000MHz Hyper Transport Bus could only be used to transfer data bandwidth like a DDR3 1000, at level 2 memory.