2 Replies Latest reply on Apr 25, 2017 11:20 PM by thornhale

    What is the Machine Learning Roadmap on AMD Hardware?


      Hi AMD community,


      Background: There is a lot of hype around the Ryzen release and imminent Vega release. Outside of work, I am getting more and more involved in deep learning approaches to solving different problems. Deep learning requires a lot of computational power. I currently have an A10-7850 Kaveri APU on which I am experimenting with deep learning approaches. Example: It takes about 1000 seconds for 1 epoch of learning in one of my scenarios using InceptionV3 in Tensorflow. This is painfully slow. In an effort to speed things up, I self-compiled Tensorflow with various extended instruction sets (AVX, SSE4.1, SSE4.2). This leads to speed ups reducing learning time down to 600 seconds/epoch. Clearly, 10 minutes per epoch is still too slow as it would take 8-12 hours to complete one learning cycle. I would like to be able to do the learning in about an hour to be able to iterate faster.


      In trying to speed up the learning time/epoch even further, I am trying to make use of tensorflow's experimental opencl support. This has been a very rough patch without resolution so far. I am stuck trying to get the application to compile successfully (another attempt will be done tonight).


      Here is the situation I find myself in: AMD is proposing ROCm to be the solution for deep learning. In fact, these are only tools that would enable someone to build a deep learning library. But in fact, these are not deep learning compute libraries like CUDA+cuDNN which NVIDIA is offering ontop of which deep learning libraries like tensorflow, torch, caffe are built.


      Even my simple attempts in trying to get Tensorflow to run with my A10-7850 have been frustrating. I really want AMD to succeed so that there are alternatives to just using NVIDIA GPUs, but I don't see how because this is what I see:


      - I see no discussion and/or significant progress towards CUDA/cuDNN equivalents that just work with any major deep learning frameworks such as Torch, Tensorflow, Theano, Caffe. There are experimental branches trying to use OpenCl, but those branches have existed since 2015 roughly and have not made sufficient progress since then. Update frequencies are very low. Support is low. User community as a result is low.

      - There is no clear road map provided by AMD or any of its partners.

      - I don't see any resources by AMD or any of its partners put into developing the infrastructure.

      - There are no major discussions here on the AMD forums. For that matter, there is not even a machine learning or deep learning subforum to discuss any of this.


      In about 3-6 months, I will have to make my decision on what build to go with. I would like to be able to do the same kind of deep learning on either a Ryzen 5 1600(x) CPU or a Ravenridge 1500S (<- if such a thing will get released), and pair the CPU up with a GPU. I want to at least explore the possibility of seeing how viable a non-NVIDIA approach to deep learning is before deciding. Since there is still time to consider, I am asking AMD and the community at large


      What is the Machine Learning Roadmap on AMD Hardware?


      This is important because I want to emphasize that only hardware support is insufficient. If there is not enough software support available, AMD will not make inroads into Deep Learning. At this point, I don't want to write tools to do deep learning. I want to just do deep learning.


      Message was edited by: Anthony Le: Cleaned up some grammar errors and clarified contribution of CUDA+cuDNN.

        • Re: What is the Machine Learning Roadmap on AMD Hardware?

          For CUDA level tools, have you been following AMD's HIP and HCC?


          AMD needs people comfortable developing products, tools, and other infrastructure for new uncertain markets, but may be reserving investment until they understand better how competitive Vega will be in this market.  So I hope some engineers are working on testing Vega's competitiveness in prototypes, if not yet in the product infrastructure you seek, so they have some results to announce with Vega.


          I hope it is a good sign for new results bolstering their confidence for Instinct that, in the week after your message, AMD happened to announce possible positions for leadership in this area: [platform engineering director] [product manager].   Hopefully leaders who are familiar with the evolving machine learning/inference market and how various GPGPU, FPGA, and ASIC approaches compete with or complement each other.    Leaders who can locate opportunities that fit the aptitudes of AMD technology, and propose a roadmap to build the infrastructure to reach those opportunities.


          2017-05-12: AMD has now announced possible positions for [software engineers Ma / N.Cal / S.Cal / Tx] and [graduate engineering co-ops] for leaders who prefer to start hands-on developing a solid technology base in this engineering-led company.

          2 of 2 people found this helpful
            • Re: What is the Machine Learning Roadmap on AMD Hardware?

              I have seen HIP and HCC. To follow your analogy:


              For deep learning, NVIDIA has cuDNN which is built on-top of CUDA. cuDNN is then what is incorporated into all major deep learning libraries (e.g.: Torch, CAFFE1/2, Tensorflow, Theano etc.).


              AMD currently does not have a cuDNN equivalent. There has been talk about MIOpen, but:


              - The library has not been released yet.

              - Consequently, MIOpen has not been incorporated into any of the above mentioned frameworks yet.


              Therefore, it is very uncertain how/when it will be possible to do DL on AMD GPU/APU equipment. I hope this situation will change within the next 2-5 months. But if it is as you say (the positions have been announced but not filled yet), this looks rather grim for me.


              A broader question in this context:


              - How does AMD + e.g.: Tensorflow + OpenCl fit into the picture? For 2-3 years now, there has been tepid development of tensorflow with opencl support. The code is still very much unoptimized. It's still developed on openCL 1.2 and not 2.2.


              As someone at the cross-roads of making a commitment, it is very unclear to me how AMD GPU DL will best be supported since there are at least 2 different paths forward. This is an advantage for NVIDIA because if I were to pick NVIDIA, I would know that I would use any DL framework built on cuDNN which is built using CUDA. End of story.