Instinct Accelerators Blog - Page 2

cancel
Showing results for 
Search instead for 
Did you mean: 

Instinct Accelerators Blog - Page 2

sam-amd
Community Manager
Community Manager

Capture.JPG

[Originally posted on 11/16/17 by Carlos E. Perez]

AMD’s newly released Vega architecture has several unique features that can be leveraged in Deep Learning training and inference workloads.

The first noteworthy feature is the capability to perform FP16 at twice the speed as FP32 and with INT8 at four times as fast as FP32. This translates to a peak performance of 24 teraflops on FP16 and 48 trillion operations per second on INT8. Deep Learning workloads have known to work well with lower precision arithmetic. It is as if AMD architects were aware of this reality and designed VEGA to exploit this characteristic. The second noteworthy feature of Vega is its new memory architecture that permits the addressability of up to 512GB of memory. The third benefit is favorable coupling with AMD’s ThreadRipper and EPYC lines of microprocessors.

On Deep Learning

Deep learning (DL) is a technology that is as revolutionary as the Internet and mobile computing that came before it. The current revival of interest in all things “Artificial Intelligence” (AI) is driven by the spectacular results achieved with deep learning. There are other AI technologies like expert systems, semantic knowledge bases, logic programming and Bayesian systems. Most of classical AI has not changed much, if any, in the last 5 years. The recent quantum leap disproportionately been driven by deep learning progress.

When Google embarked on converting their natural language translation software into using deep learning, they were surprised to discover major gains. This was best described in a recent article published in the New York Times, “The Great AI Awakening”:

The neural system, on the English-French language pair, showed an improvement over the old system of seven points. Hughes told Schuster’s team they hadn’t had even half as strong an improvement in their own system in the last four years. To be sure this wasn’t some fluke in the metric, they also turned to their pool of human contractors to do a side-by-side comparison. The user-perception scores, in which sample sentences were graded from zero to six, showed an average improvement of 0.4 — roughly equivalent to the aggregate gains of the old system over its entire lifetime of development. In mid-March, Hughes sent his team an email. All projects on the old system were to be suspended immediately.

Let’s pause to recognize what happened at Google. Since its inception, Google has used every type of AI or machine learning technology imaginable. In spite of this, their average gain for improvement per year was only 0.4%. In Google’s first implementation, the improvement due to DL was 7 percentage points better.

Google likely has the most talented AI and algorithm developers on the planet. However, several years of handcrafted development could not hold a candle to a single initial deep learning implementation.

ROCm

ROCm is software that supports High Performance Computing (HPC) workloads on AMD hardware. ROCm includes a C/C++ compiler called the Heterogeneous Compute Compiler (HCC). HCC is based on the open-source LLVM compiler infrastructure project. This HCC compiler supports the direct generation of native Radeon GPU instruction set (known as GSN ISA). Targeting native GPU instructions is crucial to get maximum performance. All the libraries under ROCm support GSN ISA.

Included with the compiler is an API called HC which provides additional control over synchronization, data movement and memory allocation. The HCC compiler is based on previous work in heterogeneous computing at the HSA foundation. The design allows CPU and GPU code to be written in the same source file and supports capabilities such as a unified CPU-GPU memory space.

1-848x700.jpg

The diagram above depicts the relationships between the ROCm components. The HCC compiler generates both the CPU and GPU code. It uses different LLVM back ends to generate x86 and GCN ISA code from a single C/C++ source. A GSN ISA assembler can also [1] be used as a source for the GCN target.

The CPU and GPU code are linked with the HCC runtime to form the application (compare this with HSA diagram). The application communicates with the ROCr driver that resides in user space in Linux. The ROCr driver uses a low latency mechanism (packet based AQL) to coordinate with the ROCk Kernel Driver.

To further narrow the capability gap, the ROCm Initiative created a CUDA porting tool called HIP (let’s ignore what it stands for). HIP provides tooling that scans CUDA source code and converts it into corresponding HIP source code. HIP source code looks similar to CUDA code, but compiled HIP code can support both CUDA and AMD based GPU devices.

2-211x300.png

The ROCm initiative provides the handcrafted libraries and assembly language tooling that will allow developers to extract every ounce of performance from AMD hardware. This includes a rocBLAS. This is implemented from scratch with a HIP interface. AMD also provides an FFT library called rocFFT that is also written with HIP interfaces. MIOpen is a native library that is tuned for Deep Learning workloads, it is AMD’s alternative to Nvidia’s cuDNN library. This library includes Radeon GPU-specific optimizations.

hipCaffe

AMD currently has ported Caffe to run using the ROCm stack. You can try examples here. I ran some benchmarks found here and here is a chart of the results:

3.png

Caffe is run on unspecified GPU hardware.

I don’t know the specific hardware that was used in these benchmarks, however, this comparison does show that the performance improvement is quite significant as compared to alternatives. One thing to observe is that the speedup is most impressive with a complex network like GoogleNet as compared to simpler one like VGG. This is a reflection of the amount of hand-tuning that AMD has done on the MIOpen library.

Deep Learning Standard Virtual Machines

Deep learning frameworks like Caffe have internal computational graphs. These graphs specify the execution order of mathematical operations, similar to a dataflow. These frameworks use the graph to orchestrate its execution on groups of CPUs and GPUs. The execution is parallel and this is one reason why GPUs are ideal for this kind of computation. There are however plenty of untapped opportunities to improve the orchestration between the CPU and GPU.

The current state of Deep Learning frameworks is similar to the fragmented state before the creation of common code generation backends like LLVM. In the chaotic good old days, every programming language had to re-invent its way of generating machine code. With the development of LLVM, many languages now share the same backend code. Many programming languages use LLVM as their backend. Several well-known examples of this are Ada, C#, Common Lisp, Delphi, Fortran, Haskell, Java bytecode, Julia, Lua, Objective-C, Python, R, Ruby, Rust, and Swift. The frontend code only needs to parse and translate source code to an intermediate representation (IR).

Deep Learning frameworks will eventually need their own “IR”. The IR for Deep Learning is, of course, the computational graph. Deep learning frameworks like Caffe and TensorFlow have their own internal computational graphs. These frameworks are all merely convenient fronts to the internal graph. These graphs specify the execution order of mathematical operations, analogous to what a dataflow graph does. The graph specifies the orchestration of collections of CPUs and GPUs. This execution is highly parallel. Parallelism is the one reason why GPUs are ideal for this kind of computation. There are however plenty of untapped opportunities to improve the orchestration between the CPU and GPU.

New research is exploring ways to optimize the computational graph in a way that goes beyond just single device optimization and towards more global multi-device optimization. NNVM is one such framework that performs a computation graph optimization framework using an intermediate representation. The goal is for NNVM optimizers to reduce memory and device allocation while preserving the original computational semantics.

A more recent development is the port of NNVM to support AMD GPUs. The NNVM compiler can compile to the TVM stack. The TVM stack is a compilation an end-to-end compilation stack that supports multiple backends. TVM compiles a high-level computation description written in TVM frontend down to an optimized native GPU code. It leverages an LLVM based code generator in TVM and LLVM’s ROCm capabilities. This new project can be found at: https://github.com/ROCmSoftwarePlatform/nnvm-rocm.

The NNVM and TVM stacks perform optimizations in a global manner across either the computational graph or an alternative declarative specification. Conventional DL frameworks, however, have code generation and execution all intertwined with their code base, making opportunities to develop optimization solutions less portable. Ideally, one would like to see a common standard, a DL virtual machine instruction set, where the community can collective contribute optimization routines. Open Neural Network eXchange (ONNX) is one such standard. ONNX is a project supported by Facebook and Microsoft. They are building support for Caffe2, PyTorch and Cognitive Toolkit. The recent TVM port reveals the potential of AMD support for a wider range of DL frameworks:

4-1440x700.png

TVM transforms the computational graph by minimizing memory, optimizing data layout and fusing computational kernels. It is a reusable framework that is designed to support multiple hardware back-ends. NNVM provides a high-level intermediate representation that represents tasks scheduling and memory management. TVM is a low-level IR for optimizing computation. A proof of concept showed that the approach of optimizing low-level operations lead to around a 35% improvement over hand-engineered kernels. This end-to-end optimization combined with AMD’s open sourced computational libraries like MIOpen is a very promising development.

Conclusion

There are many Deep Learning frameworks in existence today. Different frameworks have their own strengths and weaknesses. The field is making good progress to develop standardization that allows interoperability of these frameworks. This is through a common standard Deep Learning virtual machine. ONNX is one of these more recent standards.

In addition to standardization, global optimization of the computational graph found in Deep Learning frameworks is a means towards higher performance. The TVM framework and its integration with AMD’s LLVM based backend opens up the opportunity for end-to-end optimization of not only AMD GPUs but also the combination of CPUs and GPUs.

more
0 0 14.5K
sam-amd
Community Manager
Community Manager

Capture.JPG

[Originally posted on 10/20/17]

The recent release of ROCm 1.6, which includes a cuDNN-like library called MIOpen and a port of the deep learning Caffe framework (the AMD version is called hipCaffe), has opened up the opportunity for running deep learning projects using AMD Radeon GPUs. In this article we demonstrate 6 projects that you can start using with AMDs new hardware accelerators.

Most GPU-enabled deep learning frameworks rely on Nvidia’s CUDA and cuDNN libraries. AMD is however pulling an aggressive effort to port many deep learning frameworks such as Caffe, Torch, MXNet and Tensorflow to run on their hardware. Developers are now able to convert CUDA code to portable C++ code, thanks to AMD’s porting tools and libraries such as HIP.

The deep learning framework Caffe has recently been ported using HIP, allowing Deep Learning practitioners to run Caffe projects on AMD GPUs. This port, can be downloaded from here..

1. Traffic Sign Recognition

traffic.png

Source

An interesting image classification problem is the recognition of traffic signs. This project classifies 43 different German traffic signs. A data set of 50,000 images is used.

2. Image Synthesizer

5.png

Source

University of Wyoming’s Evolving AI Lab has a project whose goal is to understand how deep neural networks (DNNs) work by synthesizing preferred stimuli that highly activates the neurons for a particular image. A deep generator network (DGN) is used as prior to the DNN being studied. This DGN outputs a synthetic image very similar to real images from the ImageNet dataset as possible.

Below are a few results from running the sample scripts in the project:

2.jpg

The project’s paper is available from  here. The code needed to reproduce some of the results in the paper is on GitHub.

3. Traffic Light Detection

6.png

Source

David Brailovsky from Israel writes in Medium about Recognizing Traffic Lights with Deep Learning (see here). Source code for his project can be found here.

4. Cat/Dog Classifier

8.jpg

Source

This introductory tutorial by Adil Moujahid shows how to train a model and how to use a pre-existing model to distinguish cats from dogs in pictures. A Kaggle dataset is used for this tutorial. For the trained model, the BVLC CaffeNet Model is used.

The Caffe project already has pre-trained models (i.e. VGG, ImageNet) that can be used as a starting point for developing other kinds of image classification.

5. Visual Development Environment

9.png

Fabrik is an open source application for building, visualizing and training deep learning models. Fabrik provides simple drag-and-drop tools to streamline deep learning development. The application currently supports importing, editing and exporting of Caffe based models. This is a convenient way to view and edit your models.

6. Model Conversion Tools

Finally, there are vastly more projects that have been developed in frameworks other than Caffe. For these projects, there are some tools that can convert models into one that is compatible with Caffe. This GitHub project provides a listing of conversion tools to convert one frameworks model into another.

MXNet to Caffe

The code from this GitHub repository allows you to convert an MXNet model to a Caffe model.

PyTorch to Caffe

This project allows you to convert between PyTorch, Caffe, and Darknet models.

Torch to Caffe

Facebook has a converter that converts Torch models to Caffe.

Summary

In this article, we explore the many deep learning projects that you can now run using AMD Radeon Instinct hardware. We have included in this list, projects that you can test out with minimal effort. There are other projects that have customized Caffe with custom elements like new kinds of layers and activation function. For these projects, one may require porting CUDA specific code using AMD’s HIP tooling. Aside from the projects explored here, you can find other projects in the Caffe Model Zoo.

The smartest companies in the world are migrating their infrastructure to support this new paradigm. Daily, the press continues to report the amazing progress of AI. Furthermore, you hear about firms like Google and Microsoft changing their entire software DNA to move into AI. The reason for this massive migration is Deep Learning.

Deep Learning is supporting work by not only providing assistive capabilities, but also by enabling more creative generative capabilities. Assistive capabilities can happen in real time as well as in the backend. There are certain professions where the ability to curate and analyze information is extremely valuable. We can enhance these curation and analysis capabilities by reducing the deluge of information into smaller chunks that are more quickly digestible.

Generative capabilities are a new kind of capability that is becoming more pervasive. By now, we’ve all experienced the capabilities of mobile app Prisma that is able to re-render photographs into the style of different artists.

In this article, we highlighted several deep learning projects that explore both assistive and generative capabilities found in Deep Learning. We also covered some tools that allow you to port models from other projects as well as an IDE. Software that supports Radeon Instinct accelerators is still in its infancy. However, despite being out for just a few months, there are now plenty of interesting applications that can be used as a springboard to developing more complex solutions.

Albert J. De Vera and Carlos E.Perez, are Co-Founders at Intuition Machine. They specializes in Deep Learning patterns, methodology and strategy. Many of their other writings on Artificial Intelligence can be found on Medium. Their postings are their own opinions and may not represent AMD’s positions, strategies, or opinions. Links to third party sites and references to third party trademarks are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.

more
1 0 8,410
sam-amd
Community Manager
Community Manager

Capture.JPG

[Originally posted on 04/03/17]

When a company starts using disruptive technology or a disruptive business model, the results can be spectacular and can leave the competition eating dust.

The reason for this is that although the company’s growth seems linear at first, it eventually reveals itself as being exponential. When a company reaches this point, it becomes very difficult, if not impossible, for competitors to catch up.

This article explores AMD’s open source deep learning strategy and explains the benefits of AMD’s ROCm initiative to accelerating deep learning development. It asks if AMD’s competitors need to be concerned with the disruptive nature of what AMD is doing.

On Deep Learning

Deep learning (DL) is a technology that is as revolutionary as the Internet and mobile computing that came before it. One author found it so revolutionary that he described it as “The Last Invention of Man” [KHAT] – strong words indeed!

Currently, the revival of interest in all things “Artificial Intelligence” (AI) is primarily due to the spectacular results achieved with deep learning research. I must however emphasize that this revival is not due to other classical AI technologies like expert systems, semantic knowledge bases, logic programming or Bayesian systems. Most of classical AI has not changed much, if any, in the last 5 years. The recent quantum leap has solely been driven by deep learning successes.

For some perspective on the extent of deep learning development, look at this graph from Google that shows the adoption of deep learning technology in their applications:

deep-learning-google-1-1244x700.png

Source: https://www.slideshare.net/HadoopSummit/machine-intelligence-at-google-scale-tensorflow

As you can see, the adoption at Google has been exponential and the statistics are likely similar for many of the other big Internet firms like Facebook and Microsoft.

When Google embarked on converting their natural language translation software into using deep learning, they were surprised to discover major gains. This was best described in a recent article published in the New York Times, “The Great AI Awakening” [LEW]:

The neural system, on the English-French language pair, showed an improvement over the old system of seven points. Hughes told Schuster’s team they hadn’t had even half as strong an improvement in their own system in the last four years. To be sure this wasn’t some fluke in the metric, they also turned to their pool of human contractors to do a side-by-side comparison. The user-perception scores, in which sample sentences were graded from zero to six, showed an average improvement of 0.4 — roughly equivalent to the aggregate gains of the old system over its entire lifetime of development. In mid-March, Hughes sent his team an email. All projects on the old system were to be suspended immediately.

Let’s pause to recognize what happened at Google.

Since its inception, Google has used every type of AI or machine learning technology imaginable. In spite of this, their average gain for improvement per year was only 0.4%. In Google’s first implementation, the improvement due to DL was 7 percentage points better.

This translates to more gains than the entire lifetime of improvements!

Google likely has the most talented AI and algorithm developers on the planet. However, several years of handcrafted development could not hold a candle to a single initial deep learning implementation.

Deep Learning is unexpectedly, and disruptively, taking over the world

Google’s founder Sergey Brin, an extremely talented computer scientist himself, stated in a recent World Economic Forum [CHA] discussion that he did not foresee deep learning:

“The revolution in deep nets has been very profound, it definitely surprised me, even though I was sitting right there.”

The deep learning progress has been taking the academic community by storm. Two articles by practitioners of classical machine learning have summarized why they think DL is taking over the world. Chris Manning, a renowned expert in NLP, writes about the “Deep learning Tsunami“ [MAN]:

Deep learning waves have lapped at the shores of computational linguistics for several years now, but 2015 seems like the year when the full force of the tsunami hit the major Natural Language Processing (NLP) conferences. However, some pundits are predicting that the final damage will be even worse.

The same sentiment is expressed by Nicholas Paragios, who works in the field of computer vision. Paragios writes in “Computer Vision Research: the Deep Depression“ [PAR]:

It might be simply because deep learning on highly complex, hugely determined in terms of degrees of freedom graphs once endowed with massive amount of annotated data and unthinkable — until very recently — computing power can solve all computer vision problems. If this is the case, well it is simply a matter of time that industry (which seems to be already the case) takes over, research in computer vision becomes a marginal academic objective and the field follows the path of computer graphics (in terms of activity and volume of academic research).

Although I don’t want to detail the many deep learning developments of the past several years, Nell Watson provides a quick, short summary when she writes in “Artificial Intuition” [WAT]:

To sum up, machine intelligence can do a lot of creative things; it can mash up existing content [SHO], reframe it to fit a new context [PARK], fill in gaps in an appropriate fashion [CON], or generate potential solutions given a range of parameters [AUTO].

Make no mistake – Deep Learning is a “Disruptive” technology that is taking over operations of the most advanced technology companies in the world.

On Disruptiveness

Of late, the business world has become much more difficult and competitive. This situation has been made worse by disruptive changes in the global economy. The potential of nimbler competitors to disrupt the businesses of incumbents has never been greater. Peter Diamandis describes the Six D’s of Exponentials as consisting of the following:

  • Digitization – Anything that can be digitized can lead to the same exponential growth we find in computing. Anything that is digitized or virtualized instead is unencumbered by physical law. It thus costs less to mass produce and moves faster in spreading.
  • Deception – Once digitized or virtualized, initial growth deceptively appears linear. However, given time, exponential growth becomes obvious. For many it is too late to react once growth of a competitor hits this transition.
  • Disruption – New markets that are more effective and less costly are created. Existing markets that are tied to the physical world will eventually become extinct. We’ve seen this in music, photography and many other areas.
  • Demonetization – As cost heads towards zero, so does the ability to solicit a payment for it. Thus, a business has to reinvent its revenue model, or come up with new ways of monetization.
  • Dematerialization – Physical products disappear and are replaced by a more convenient and accessible alternative.
  • Democratization — More people now have access to technology at a lower cost. The means of production have become more accessible to everyone. This access is no longer confined to the big corporation, or the wealthy. We see this fragmentation everywhere where producers are publishing their own books, music and videos. This feeds back into itself and smaller players become able to compete.

To survive this disruption, there is an ever-pressing need for enterprises to take drastic action by re-engineering how they run their businesses.

John Hagel proposes four kinds of platforms [HAG] that leverage networking effects as an organizational mechanism to combat disruptive technologies. The four platforms that Hagel proposes are Aggregation platforms (example: Marketplaces), Social platforms (example: Social Networks), Mobilization platforms (example: Complex supply chains) and Learning platforms.

Learning platforms

Learning platforms are dynamic and adaptive environments where people come together to collectively learn how to address complex problems. Members can connect to ask questions, share experiences and offer advice. Open source projects that are actively managed with distributed source control, test-driven development, issue tracking, and continuous integration, is a good example of a learning platform. The key ingredient here is that there is a learning mechanism that gets codified continuously. The fact that we find this in software development should not come as a surprise, as software development is essentially a learning process.

John Hagel describes an intriguing property of a Learning platform:

What if we change the assumption, though? What if each fax machine acquired more features and functions as it connected with more fax machines? What if its features multiplied at a faster rate as more fax machines joined the network? Now, we’d have a second level of network effect — we’d still have the network effects that come by simply increasing the number of fax machines, but now there’s an additional network effect that accrues as each fax machine adds more and more features as a result of interacting with other fax machines.

What Hagel is saying is that the members of the network adaptively become more effective and capable as a participant in the learning network. In other words, not only is there the conventional networking effect, but another mechanism kicks network effects into overdrive. Learning platforms such as an open source community can further accelerate the disruptiveness of an already disruptive technology.

Historically, an open source strategy has been quite effective in many disruptive technology areas. In the Internet, Linux (79%) dominates in back end infrastructure services – Google’s Chrome (58%), Android (65%), Web-servers (65% Apache and Nginx). It should not surprise anyone when an open source strategy in the disruptive deep learning space eventually emerges as the dominant platform.

There are only a few semiconductor manufacturers that have the economies of scale to be competitive in high-performance computing. These are Nvidia, Intel, AMD, Qualcomm and Xilinx. We will now explore AMD’s deep learning solution and detail their unique open source strategy. We will also look at how it gives the company a competitive advantage.

Deep learning as a disruptive technology is critically enabled by hardware. AMD is one of the few semiconductor companies that actually exploits neural network in their hardware. In AMD’s SenseMI Infinity Fabric, an evolution of AMD HyperTransport interconnect technology, the design uses “perceptrons” to support branch prediction. AMD’s GPU hardware has always been competitive against Nvidia hardware. When algorithms are extensively optimized, AMD hardware is in fact favored. This is shown in the many cryptocurrency proof-of-work algorithms that have favored AMD hardware. Raja Koduri, head of AMD Radeon products, recently noted that AMD has had more compute per buck since 2005.

AMD’s Open Source Deep Learning Stack

Before we get into the detail of AMD’s deep learning stack, let’s look at the philosophy behind the development tooling. AMD, having a unique position of being both a CPU and GPU vendor, has been promoting the concept of a Heterogeneous System Architecture (HSA) for a number of years. Unlike most development tools from other vendors, AMD’s tooling is designed to support both their x86 based CPU and their GPU. AMD shares the HSA design and implementations in the HSA foundation (founded in 2012), a non-profit organization that has members including other CPU vendors like ARM, Qualcomm and Samsung.

The HSA foundation has an informative graphic that illustrates the HSA stack:

hsa.png

As you can see, the middleware (i.e. HSA Runtime Infrastructure) provides an abstraction layer between the different kinds of compute devices that reside in a single system. One can think of this as a virtual machine that allows the same program to be run on both a CPU and a GPU.

In November 2015, AMD announced the ROCm initiative to support High Performance Computing (HPC) workloads, and to provide an alternative to Nvidia’s CUDA platform. The initiative released an open source 64-bit Linux driver (known as the ROCk Kernel Driver) and an extended (i.e. non-standard) HSA runtime (known as the ROCr Runtime). ROCm also inherits previous HSA innovations such as AQL packets, user-mode queues and context-switching.

ROCm also released a C/C++ compiler called the Heterogeneous Compute Compiler (HCC) targeted to support HPC applications. HCC is based on the open-source LLVM compiler infrastructure project [WIKI]. There are many other open source versions of languages that use LLVM. Some examples are Ada, C#, Delphi, Fortran, Haskell, Java bytecode, Julia, Lua, Objective-C, Python, R, Ruby, Rust, and Swift. This rich ecosystem opens the possibility of alternative languages on the ROCm platform. One promising development of this kind is the Python implementation called NUMBA.

Added to the compiler is an API called HC which provides additional control over synchronization, data movement and memory allocation. HCC supports other parallel programming APIs, but to avoid further confusion, I will not mention them here.

The HCC compiler is based on work at the HSA foundation. This allows CPU and GPU code to be written in the same source file and supports capabilities such as a unified CPU-GPU memory space.

To further narrow the capability gap, the ROCm Initiative created a CUDA porting tool called HIP (let’s ignore what it stands for). HIP provides tooling that scans CUDA source code and converts it into corresponding HIP source code. HIP source code looks similar to CUDA code, but compiled HIP code can support both CUDA and AMD based GPU devices.

AMD1-211x300.png

AMD took the Caffe framework with 55,000 lines of optimized CUDA code and applied their HIP tooling. 99.6% of the 55,000 lines of code was translated automatically. The remaining code took a week to complete by a single developer. Once ported, the HIP code performed as well as the original CUDA version.

HIP is not 100% compatible with CUDA, but it does provide a migration path for developers to support an alternative GPU platform. This is great for developers who already have a large CUDA code base.

Early this year AMD decided to get even “closer to the metal” by announcing the “Lightning Compiler Initiative.” This HCC compiler now supports the direct generation of the Radeon GPU instruction set (known as GSN ISA) instead of HSAIL.

As we shall see later, directly targeting native GPU instructions is critical to get higher performance. All the libraries under ROCm support GSN ISA.

AMD2.png

The diagram depicts the relationships between the ROCm components. The HCC compiler generates both the CPU and GPU code. It uses different LLVM back ends to generate x86 and GCN ISA code from a single C/C++ source. A GSN ISA assembler can also [1] be used as a source for the GCN target.

The CPU and GPU code are linked with the HCC runtime to form the application (compare this with HSA diagram). The application communicates with the ROCr driver that resides in user space in Linux. The ROCr driver uses a low latency mechanism (packet based AQL) to coordinate with the ROCk Kernel Driver.

This raises two key points about what is required for high-performance computation:

1. The ability to perform work at the assembly language level of a device.

2. The availability of highly optimized libraries.

In 2015, Peter Warden wrote, “Why GEMM is at the heart of deep learning” [WAR] about the importance of optimized matrix libraries. BLAS (Basic Linear Algebra Subprograms) are hand-optimized libraries that trace their origins way back to Fortran code. Warden writes:

The Fortran world of scientific programmers has spent decades optimizing code to perform large matrix to matrix multiplications, and the benefits from the very regular patterns of memory access outweigh the wasteful storage costs.

This kind of attention to every detailed memory access is hard to replicate despite our advances in compiler technology. Warden went even further in 2017 when he wrote, “Why Deep learning Needs Assembler Hackers” [WAR2]:

I spend a large amount of my time worrying about instruction dependencies and all the other hardware details that we were supposed to be able to escape in the 21st century.

Despite being a very recent technology, software that enables deep learning is a complex stack. A common perception is that most deep learning frameworks (i.e. TensorFlow, Torch, Caffe etc) are open source. These frameworks are however built on highly optimized kernels that are often proprietary. Developers can go to great lengths to squeeze every ounce of performance from their hardware.

As an example, Scott Gray of Nervana systems had to reverse engineer Nvidia’s instruction set [GRAY] to create an assembler:

I basically came to the conclusion that it was not possible to fully utilize the hardware I bought with the tools Nvidia provides. Nvidia, unfortunately, doesn’t believe in eating their own dog food and they hand assemble their library routines, rather than use ptxas like the rest of us have to.

Gray used assembly language to write their kernels, thus creating algorithms that bested the proprietary alternatives. Now imagine how much less work he would have to do if the assembly language was available and documented. This is what AMD is bringing to the table.

The ROCm initiative provides the handcrafted libraries and assembly language tooling that will allow developers to extract every ounce of performance from AMD hardware. This includes a rocBLAS [KNOX], an implementation of BLAS that provides these level capabilities:

BLAS Level-1:

  • amax, amin, asum, axpy, copy, dot, nrm2, scal, swap

BLAS Level-2:

  • gemv

BLAS Level-3:

  • gemm, trtri, batched-trtri

This is implemented from scratch with a HIP interface. AMD has even provided a tool (i.e. Tensile) that supports the benchmarking of rocBLAS. AMD also provides an FFT library called rocFFT that is also written with HIP interfaces.

I wonder if Facebook’s fbcunn (Deep learning extensions for CUDA) [GIT], a library that employs FFTs to accelerate convolutions, could be ported using the HIP tooling.

Deep learning algorithms continue to evolve at a rapid pace. In the beginning, frameworks exploited the available matrix multiplication libraries. These finely tuned algorithms have been developed over decades. As research continued, newer kinds of algorithms were proposed.

Thus came the need to go beyond generic matrix multiplication. Convolutional networks came along and this resulted in even more innovative algorithms. Today, many of these algorithms are crafted by hand using assembly language.

Here is a partial list of deep learning specific optimizations that are performed by a proprietary library:

Activation Functions: ReLU, Sigmoid, Tanh, Pooling, Softmax, Log Softmax

Higher Order Tensor Operations: Ordering, Striding, Padding, Subregions

Forward and Backward Convolutions: 2D, FFT, Tiled, 3×3

Small Data Types: FP16, Half2

Normalization: Batch, Local Response

Recurrent Neural Network: LSTM

These low-level tweaks can lead to remarkable performance improvements. For some operations (i.e. batch normalization), the performance increases 14 times compared to a non-optimized solution.

AMD is set to release a library called miOpen that includes handcrafted optimizations. This library includes Radeon GPU-specific optimizations for operations and will likely include many of those described above. MiOpen is scheduled for a release in the first half of this year. Its release will coincide with the release of other popular deep learning frameworks such as Caffe, Torch7, and TensorFlow. This will allow application code that uses these frameworks to perform competitively on Radeon GPU hardware.

Many other state-of-the-art methods have not yet worked their way into proprietary deep learning libraries. These are proposed almost every day as new papers are published in Arxiv.

Here are just a few:

  • CReLU
  • PReLU
  • Hierarchical Softmax
  • Adaptive Softmax
  • Layer Normalization
  • Weight Normalization
  • Wasserstein Loss
  • Z-Loss

It would be very difficult for any vendor to keep up with such a furious pace. In the current situation, given the lack of transparency in development tools, developers are forced to wait, although they would rather be performing the coding and optimizations themselves. Fortunately, the open source ROCm initiative solves the problem.

ROCm includes an open source GCN ISA based assembler and disassembler.

System Wide Optimization

In a recent investor’s meeting by Intel, the company shared some of their statistics:

Among servers used for deep learning applications, the chipmaker says that 91% use just Intel Xeon processors to handle the computations, 7% use Xeon processors paired with graphics processing units, while 2% use alternative architectures altogether.

The mix will change as the value of deep learning is understood better. The point here is that CPUs will always be required, even if most of the computations are performed by GPUs. That being said, it is important to recognize that system-wide optimizations are equally critical. This is where AMD’s original investments in Heterogeneous System Architecture may pay big dividends. I would however like to point out that new research efforts are underway to optimize the code that is emitted by deep learning frameworks further.

Deep learning frameworks like Caffe and Tensorflow have internal computational graphs. These graphs specify the execution order of mathematical operations, similar to a dataflow. These frameworks use the graph to orchestrate its execution on groups of CPUs and GPUs. The execution is parallel and this is one reason why GPUs are ideal for this kind of computation. There are however plenty of untapped opportunities to improve the orchestration between the CPU and GPU.

The current state of Deep Learning frameworks is similar to the state before the creation of a common code generation backend like the LLVM. In the past, every programming language had its own way of generating machine code. With the development of LLVM, many languages now share the same backend code. The frontend code only needs to translate source code to an intermediate representation (IR). Deep Learning frameworks will eventually need a similar IR for Deep Learning solutions. The IR for Deep Learning is the computational graph.

New research is exploring ways to optimize the computational graph in a way that goes beyond just single device optimization and towards more global multi-device optimization.

An example of this is the research project XLA (Accelerated Linear Algebra) from the TensorFlow developers. XLA supports both Just in Time (JIT) or Ahead of Time (AOT) compilation. XLA is a high-level optimizer that performs its work by optimizing the interplay of the CPUs, GPUs and FPGAs.

The optimizations planned include:

  • Fusing of pipelined operations
  • Aggressive constant propagation
  • Reduction of storage buffers
  • Fusing of low-level operators

There are two other open source projects that are also exploring computational graph optimization. NNVM from the MXNet developer is another computation graph optimization framework that, similar to XLA, provides an intermediate representation. The goal is for optimizers to reduce memory and device allocation, while preserving the original computational semantics.

NGraph from Intel is exploring optimizations that include:

  • Kernel fusion
  • Buffer allocation
  • Training optimizations
  • Inference optimizations
  • Data layout
  • Distributed training

There are certainly plenty of ideas around of how to improve the performance.

AMD has developed a runtime framework that takes into account heterogeneous CPU-GPU systems. It is called Asynchronous Task and Memory Interface (ATMI). The ATMI runtime is driven by a declarative description of high-level tasks that will execute the scheduling and memory in an optimal manner.

ATMI is also open source and can be exploited to drive deep learning based computational graphs like the ones found in XLA, NNVM or NGraph. The future of Deep Learning software will revolve around a common computational graph and optimizations will take the orchestration of the entire system into consideration.

Operations and Virtualization

What we have been discussing so far are the opportunities to squeeze as much performance from hardware as possible, but there is more to a complete solution than just raw performance.

Every complex system requires good manageability to ensure continued and sustained operations. The ROCm initiative does not overlook this need and provides open source implementations. ROC-smi, ROCm-Docker and ROCm-profiler are three open source projects that provide support for operations.

AMD’s GPU hardware and drivers have also been designed to support GPU virtualization (see: MxGPU). This permits GPU hardware to be shared by multiple users. I will discuss operational aspects of AMD’s offerings in a next article.

Deployment

Throughout this article, we’ve discussed the promising aspects of the ROCm software stack. When the rubber meets the road, we need to discuss the kind of hardware that software will run on. There are many different scenarios where it makes sense to deploy deep learning. Contrary to popular belief, not everything needs to reside in the cloud. Self-driving cars or universal translation devices need to operate without connectivity.

Deep learning also has two primary modes of operation – “training” and “inference”. In the training mode, you would like to have the biggest, fastest GPUs on the planet and you want many of them. In inference mode, you still want fast, but the emphasis is on economic power consumption. We don’t want to drive our businesses to the ground by paying for expensive power.

In summary, you want a variety of hardware that operates in different contexts. That’s where AMD is in good position. AMD has recently announced some pretty impressive hardware that’s geared toward deep learning workloads. The product is called Radeon Instinct and it consists of several GPU cards: the MI6, MI8, and MI25. The number roughly corresponds to the number of operations the card can crank out. An MI6 can perform roughly 6 trillion floating-point operations per second (aka teraflops).

The Radeon Instinct MI6 with a planned 16GB for GDDR5 memory is a low-cost inference and training solution. MI8 with 4GB HBM is designed primarily for inference-based workloads. MI25 is designed for large training workloads and will be based on the soon to be released Vega architecture. Shuttling data back and forth between GPU and CPU is one of the bottlenecks in training deep learning systems. Vega’s unique architecture, capable of addressing 512TB of memory, gives it a distinct advantage.

There’s also a lot more to say about GPU and CPU integration. I’ll briefly mention some points. On the server-side, AMD has partnered with Supermicro and Inventec to come up with some impressive hardware. At the top of the line, the Inventec K888 (dubbed “Falconwitch”) is a 400-teraflop 4U monster. By comparison, the Nvidia flagship DGX-1 3U server can muster a mere 170 teraflops.

There is also promise at the embedded device level. AMD already supports custom CPU-GPU chips for Microsoft’s Xbox and Sony’s PlayStation. An AMD APU (i.e. CPUs with integrated GPUs) can also provide solutions for smaller form factor devices. The beauty of AMD’s strategy is that the same HSA based architecture is available for the developer in the smallest of footprints, as well as in the fastest servers. This breadth of hardware offerings allows deep learning developers a wealth of flexibility in deploying their solutions. Deep learning is progressing at breakneck speed and one can never predict the best way to deploy a solution.

Conclusion

Deep learning is a disruptive technology like the Internet and mobile computing that came before. Open source software has been the dominant platform that has enabled these technologies.

AMD combines these powerful principles with its open source ROCm initiative. On its own, this definitely has the potential of accelerating deep learning development. ROCm provides a comprehensive set of components that address the high performance computing needs, such as providing tools that are closer to the metal. These include hand-tuned libraries and support for assembly language tooling.

Future deep learning software will demand even greater optimizations that span many kinds of computing cores. In my view, AMD’s strategic vision of investing heavily in heterogeneous system architectures gives their platform a distinct edge.

AMD’s open source strategy is uniquely positioned to disrupt and take the lead in future deep learning developments.

Carlos E. Perez is Co-Founder at Intuition Machine. He specializes in Deep Learning patterns, methodology and strategy. Many of his other writings on Artificial Intelligence can be found on Medium. His postings are his own opinions and may not represent AMD’s positions, strategies, or opinions. Links to third party sites and references to third party trademarks are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.

FOOTNOTES:

[AUTO] Autodesk. http://www.autodesk.com/solutions/generative-design

[CHA] Chainey, Ross. “Google co-founder Sergey Brin: I didn’t see AI coming.” https://www.weforum.org/agenda/2017/01/google-sergey-brin-i-didn-t-see-ai-coming/

[CON] Conner-Simons, Adam. “Artificial intelligence produces realistic sounds that fool humans.”

http://news.mit.edu/2016/06/13/artificial-intelligence-produces-realistic-sounds-0613

[GIT] Facebook FAIR. https://github.com/2017/facebook/fbcunn

[GRAY] Gray, Scott. “Maxas Assembler.” https://github.com/NervanaSystems/2015/01/24/maxas/wiki/Introduction

[HAG] Hagel, John. “Harnessing the Full Potential of platforms.” http://www.marketingjournal.org/2016/04/05/john-hagel-harnessing-the-full-potential-of-platforms/

[HSA] “HSA-Debugger-AMD.”

https://github.com/HSAFoundation/2015/12/24/HSA-Debugger-AMD/blob/master/TUTORIAL.md

[KHAT] Khatchadourian, Raffi. "The Doomsday Invention." http://www.newyorker.com/magazine/2015/11/23/doomsday-invention-artificial-intelligence-nick-bostrom

[KNOX] Knox, Kent. “rocBLAS.” https://github.com/RadeonOpenCompute/2016/11/09/rocBLAS/wiki

[LEW] Lewis-Kraus, Gideon. “The Great A.I Awakening.” https://www.nytimes.com/2016/12/14/magazine/the-great-ai-awakening.html?_r=1

[MAN] Manning, Christopher. “Computational Linguistics and Deep Learning.” http://www.mitpressjournals.org/doi/pdf/2016/06/10.1162/COLI_a_00239

[PAR] Paragios, Nikos. “Computer Vision Research: ‘The deep depression.’”

https://www.linkedin.com/2016/06/05/pulse/computer-vision-research-my-deep-depression-nikos-paragios

[PARK] Parkinson, Hannah Jane. “Computer algorithm recreates Van Gogh painting in one hour.”

https://www.theguardian.com/technology/2015/sep/02/computer-algorithm-recreates-van-gogh-painting-pi...

[SHO] Shontell, Alyson. “A start up that uses robots to write news gets acquired for $80 million in cash.” http://www.businessinsider.com/2015/02/23automated-insights-gets-acquired-by-vista-for-80-million-20...

[WAR] Warden, Pete. “Why GEMM is at the heart of deep learning.” https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/

[WAR2] Warden, Pete. “Why Deep Learning Needs Assembly Hackers.” https://petewarden.com/2017/01/03/why-deep-learning-needs-assembler-hackers/

[WAT] Watson, Nell. “Artificial Intuition; The Limitations (and ridiculous power) of Deep Learning Creativity.” https://medium.com/intuitionmachine/artificial-intuition-/2017/03/-3418fac2eb9c#.wrr6unq5g

[WIKI] https://en.wikipedia.org/wiki/LLVM

more
0 0 2,496
derek_bouius
Staff
Staff

Today in San Francisco, California, AMD held a special event where we announced the newest additions to the Radeon Instinct™ family of compute products; the AMD Radeon Instinct™ MI60 and Radeon Instinct™ MI50. In step with the new hardware, the Radeon Open eCosystem (ROCm) has been updated with massive improvements in the device drivers, the compilers and supporting tools. The low-level math libraries, along with MIOpen, the machine intelligence library, have been optimized to really make deep learning  applications sing.

ROCm is an open software platform for GPU-enabled HPC computing. It was created with developers in mind to accommodate future technologies including machine learning and artificial intelligence. As an open platform, the ROCm ecosystem provides a rich foundation of modern programming languages, designed to speed development of high-performance, energy-efficient heterogeneous computing systems.

We enabled AMD’s ROCm capable GPUs in the Linux ecosystem for easy deployment of deep learning applications in Linux distributions. The amdkfd device driver is now supported in the mainline kernel and this kernel is picked up by all the major distributions for their standard releases. Now we also support MI60 and MI50, based on the new Vega architecture, in the linux-next repository. For distributions not using the latest kernel, a DKMS build is still a viable option to add support for the MI60 and MI50 GPUs.

We have updated the LLVM based clang compiler to support the new GPU architecture, including the new compute instructions targeted to accelerate machine learning computations. These low-level instructions implement compute operations all the way from single bit precision to 64-bit floating point. The most beneficial instruction for the acceleration of deep learning training is a float 16 dot product which accumulates into a 32-bit result, maintaining the accuracy of the operation.

Profiling and debugging tools required updates to support the new hardware. These tools enable developers to get the most out of the GPU compute cycles and understand where the bottlenecks occur in their applications. Follow the development on our github site.

Math libraries were customized with the hardware architecture in mind, resulting in an very optimized solution. There are many different ways to optimize these math operations, and each specific matrix and convolution size needs to be tuned, so AMD built a tool to help automate the optimization process. This tool is called Tensile and is very useful for creating a library for GEMMs, GEMM-like problems (such as batched GEMM), N-dimensional tensor contractions, and anything else that multiplies two multi-dimensional objects together on a GPU. MIOpen also underwent massive optimizations and updates to realize the incredible benefits of the foundational math libraries when integrated with deep learning frameworks.

One of the most exciting developments over the past year is the integration and progress with the machine learning frameworks. ROCm has been updated to support the TensorFlow framework API v1.11 and is actively upstreaming the code into the main repository. Check out the TensorFlow github to follow the updates or see our github page for PyTorch, Caffe2, Caffe and other framework developments.

To try out the newest packages, develop an application and easily deploy a ROCm solution, get the most recent Docker images here - which saves you the time of collecting all the libraries and building them specifically for your platform.

We are always looking for skilled developers excited to work in this rapidly changing field. Check out our job listings at amd.com.

more
0 2 4,620