Skip navigation

Capture.JPG

[Originally posted on 11/06/18]

 

Today in San Francisco, California, AMD held a special event where we announced the newest additions to the Radeon Instinct™ family of compute products. The AMD Radeon Instinct™ MI60 and Radeon Instinct™ MI50 accelerators are the first GPUs in the world that are based on the advanced 7nm FinFET process technology. The ability to go down to 7nm allows us to put more transistors on to an even smaller package than was possible before – in this case, the MI60 contains 13.2 billion transistors on a package size of 331.46mm2, while the previous generation Radeon Instinct™ MI25 had 12.5 billion transistors on a package size of 494.8mm2 – a 58% improvement in number of transistors per mm2. This allows us to provide a more powerful and robust product, capable of tackling a wide range of workloads from training and inference, to high performance computing.

 

 

Supercharged Deep Learning Operations – Ideal for Training and Inference

 

We’ve made numerous improvements on these new products, including optimized deep learning operations. In addition to native half-precision (FP16) performance, the MI60 and MI50 now support INT8 and INT4 operations, delivering up to a whopping 118 TFLOPS of INT4 peak performance on the MI60. The supercharged compute capabilities of these new products are designed to meet today’s demanding system requirements of handling large data efficiently for training complex neural networks and running inference against those neural networks used in deep learning.

 

Vega20_LBlue_JPEG-768x432.jpg

 

World’s Fastest Double Precision PCIe® Based Accelerator

 

On the other end of the compute spectrum are FP64 calculations primarily used in high performance compute workloads. These types of workloads require extreme accuracy and speed, which the MI60 and MI50 deliver. The Radeon Instinct MI60 is the fastest double precision PCIe® based accelerator1, delivering up to 7.4 TFLOPS of FP64 peak performance, while the MI50 is not far behind at 6.7 TFLOPS. In addition to fast FP64 performance, the MI60 and MI50 both sport full-chip ECC memory3 as well as RAS4. This allows scientists and researchers across several industries including life sciences, energy, automotive and aerospace, government and more to achieve results with both speed and accuracy.

 

RadeonInstinct_FrontAngle_RGB_5inch-768x679.png

Finely Balanced, Ultra-Scalable Datacenter Solution

 

Most of the improvements we’ve talked about so far have been at the chip level, but we didn’t stop there. We also have a number of new benefits found beyond the chip as well. We meticulously designed the MI60 and MI50 to deliver finely tuned and balanced performance. We took a look at some of the common bottlenecks found in previous generations and made improvements to ensure your data is processed in the most efficient manner possible. This includes making these cards PCIe® Gen 4* capable, delivering up to 2x more bandwidth (64 GB/s vs. 32 GB/s) than PCIe® Gen 3 when communicating over the bus. In addition to improved performance between GPU and CPU, we’ve also built in to these products a peer-to-peer GPU communication feature called Infinity Fabric™ Link technology. Each card includes two physical Infinity Fabric™ Links allowing you to directly connect four GPUs together in a GPU hive ring and up to two of these hives in an 8 GPU server. Each GPU card provides up to 200 GB/s bandwidth between peer GPUs, which is up to 6x faster than PCIe Gen 3 alone2. We have also doubled memory bandwidth speeds from our previous generation Radeon Instinct MI25 accelerator5, delivering up to 1TB/s memory bandwidth on both the MI50 and MI60 accelerator – the first GPUs to achieve this speed.

 

RadeonInstinct_MI60_Multi4_Ang1_RGB_5inch-768x638.png

 

With improved performance from both within the GPU and between GPUs and CPUs, these new finely-balanced, ultra-fast and scalable solutions are the ideal datacenter compute solution for all your needs whether they’re inference, training or HPC related.

 

Learn More About the AMD Radeon Instinct MI60

Learn More About the AMD Radeon Instinct MI50

Learn More About AMD’s “Vega 7nm” Technology

Learn More About ROCm

 

 

Warren Eng is a Product Marketing Manager for professional graphics and compute at AMD. His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied. GD-5

Capture.JPG

 

[Originally posted on 11/21/17]

 

This year at SC17, AMD showcased Radeon Instinct™ accelerators, AMD EPYC™ processors and the ROCm open software platform – a complete ecosystem to drive a new era in the datacenter. Our booth was packed with server racks from partners like Inventec, Gigabyte, Supermicro and BOXX. Attendees had the opportunity to check out Project 47, both on display and running demos, offering 1 PetaFLOPS of compute power.

 

The much anticipated TensorFlow support with ROCm 1.7 was revealed in our booth alongside a demo of deep learning inference from a trained Caffe model. AMD also offered hourly Tech Talks, diving into a wide range of topics – from AMD EPYC™ performance to Radeon technology powering the exploration of dark energy with the CHIME radio telescope.

 

Thank you to everyone that joined us at SC17. For those that were unable to attend, check out our photo gallery below. We hope to see you next year at SC18!

 

Capture.JPG

 

Daniel Skrba, Marketing and Communications Specialist for the Radeon Technologies Group at AMD. His postings are his own opinions and may not represent AMD’s positions, strategies, or opinions. Links to third party sites and references to third party trademarks are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.

Capture.JPG

 

[Originally posted on 10/27/17]

 

Visit AMD at our SC17 booth #825 and learn how AMD together with our partners is bringing about a new era in the datacenter that is revolutionizing High Performance Computing with our new AMD EPYC™ processors and Radeon Instinct™ accelerators. On top of this year’s show stopping demos, you will have the opportunity to attend one of our interactive and educational booth Tech Talks – check out the schedule below.

 

Featured AMD Tech Talks

 

Tuesday, Nov. 14th, 2017

 

  • 11AM: Reconfigurable Acceleration at Cloud Scale, Manish Muthal, Vice President of Data Center Marketing, Xilinx
  • 1PM: Introducing AMD EPYC™: A New Standard of Performance and Innovation, Girish Kulkarni, Director of Product Marketing, AMD Server Group, AMD
  • 2PM: Exploring Dark Energy with the CHIME Radio Telescope, powered by Radeon™ Technology, Andre Renard, Chime Computing Specialist, Dunlap institute for Astronomy & Astrophysics, University of Toronto
  • 3PM: AMD EPYC™ for HPC, Joshua Mora, PhD, Manager Field Application Engineering, AMD
  • 4PM: AMD Radeon Instinct™ Accelerators, Niles Burbank, Sr. Product Manager, AMD
  • 5PM Redefining HPC Performance with EPYC-based Supermicro Servers, Super Micro Computer, Inc.

 

Wednesday, Nov. 15th, 2017

 

  • 11AM: Interconnect Your Future with Mellanox “Smart” Interconnect, Gilad Shainer, Vice president of Marketing, Mellanox Technologies
  • 1:00 PM: Accelerating 3D Acoustics With HCC-C++, Reid Atcheson, Accelerator Software Engineer, NAG
  • 2PM: AMD EPYC™ for HPC, Joshua Mora, PhD, Manager Field Application Engineering, AMD
  • 3PM: Advances in GPU Networking at AMD, Michael Lebeane, Sr. Design Engineer, AMD Research
  • 4PM: Running TensorFlow on AMD’s ROCm software platform with HIP, Ben Sander, Sr. Fellow, Software Engineer, AMD
  • AMD Booth # 825 Tech Talks November 14 – 15, 2017

 

Venue: COLORADO CONVENTION CENTER (Denver, CO)

 

We hope to see you in Denver!

Capture.JPG

 

[Originally posted on 10/10/17 - by Gregory Stoner]

 

AMD is excited to see the emergence of the Open Neural Network Exchange (ONNX) format which is creating a common format model to bridge three industry-leading deep learning frameworks (PyTorch, Caffe2, and Cognitive Toolkit) to give our customers simpler paths to explore their networks via rich framework interoperability.

 

The ONNX format, via its extensible computation graph model, built-in operators, and standard data types will allow our team to focus on more in-depth optimization with our Radeon Instinct Hardware and more productive solution set via our open source MIOpen deep learning solver library and ROCm Compiler technology. It also gives us the path to explore new foundation production beyond traditional frameworks for production to bring lighter weight more optimized solutions for our hardware.

 

It is great to see the collaboration of Facebook and Microsoft continuing to also follow in the path of open software development practice with ONNX, building on their open source projects PyTorch, Caffe2, and Cognitive Toolkit. Open Software development aligns with our philosophy of bringing out open source software platform, tools, and driver to allow the research community to have more powerful ability to explore broader deep learning design space.

 

We feel this is an excellent step for the community to open up these platform to a broader set of diverse architectures. We look forward to working with the project and help it grow in the coming months.

 

 

Gregory Stoner, is Sr. Director of Radeon Open Compute. Links to third-party sites and references to third-party trademarks are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third-party endorsement of AMD or any of its products is implied. Use of third-party names or marks is for informational purposes only and no endorsement of or by AMD is intended or implied.

Capture.JPG

 

[Originally posted on 09/08/17 by Albert J. De Vera]

 

Deep Learning, an advanced form of machine learning, has generated a lot of interest due to the wide range of applications on complex data sets. Current technologies and the availability of very large amounts of complex data have made analytics on the latter more tractable.

 

With deep neural networks as basis for deep learning algorithms, GPUs are now being used in deep learning applications because they provide many processing units. These processing units simulate a neural network that does the computation on data. Neural networks can therefore scale and improve the extraction of information from data.

 

ROCm and The AMD Deep Learning Stack

The AMD Deep Learning Stack is the result of AMD’s initiative to enable DL applications using their GPUs such as the Radeon Instinct product line. Currently, deep learning frameworks such as Caffe, Torch, and TensorFlow are being ported and tested to run on the AMD DL stack. Supporting these frameworks is MIOpen, AMD’s open-source deep learning library built for the Radeon Instinct line of compute accelerators.

 

AMD’s ROCm platform serves as the foundation of this DL stack. ROCm enables the seamless integration of the CPU and GPU for high performance computing (HPC) and ultra-scale class computing. To achieve this, ROCm is built for language independence and takes advantage of the Heterogenous System Architecture (HSA) Runtime API.3 This is the basis of the ROCr System Runtime, a thin user-mode API providing access to graphics hardware driven by the AMDGPU driver and the ROCk kernel driver.

 

1.jpg

 

For now, OS support for ROCm is limited to Ubuntu 14.04, Ubuntu 16.04, and Fedora 23. For these OSs, AMD provides a modified Linux version 4.6 kernel with patches to the HSA kernel driver (amdkfd) and the AMDGPU (amdgpu) kernel driver currently in the mainline Linux kernel.5

 

Using Docker With The AMD Deep Learning Stack

 

Docker Containers

Software containers isolate the application and its dependencies from other software installed on the host. They abstract the underlying operating system while keeping its own resources (filesystem, memory, CPU) and environment separate from other containers.

 

In contrast to virtual machines, all containers running on the same host share a single operating system without the need to virtualize a complete machine with its own OS. This makes software containers perform much faster than virtual machines because of the lack of overhead from the guest OS and the hypervisor.

 

Docker is the most popular software container platform today. It is available for Linux, macOS, and Microsoft Windows. Docker containers can run under any OS with the Docker platform installed.6

 

Installing Docker and The AMD Deep Learning Stack

The ROCm-enabled Linux kernel and the ROCk driver, together with other needed kernel modules, must be installed on all hosts that run Docker containers. This is because the containers do not have the kernel installed inside them. Instead, the containers share the host kernel.7

 

The installation procedure described here is for Ubuntu 16.04. Ubuntu 16.04 is currently the most tested OS for ROCm.

 

Installing ROCm

The next step is to install ROCm and the ROCm kernel on each host. The procedure described below is based on instructions found in https://rocm.github.io/install.html.

 

Grab and install the GPG key for the repository:

wget -qO – http://repo.radeon.com/rocm/apt/debian/rocm.gpg.key | sudo apt-key add –

 

You should get the message ‘OK’. You can check if it’s there using apt-key:

apt-key list

 

In /etc/apt/sources.list.d, create a file named rocm.list and place the following line in it:

deb [arch=amd64] http://repo.radeon.com/rocm/apt/debian/ xenial main

 

Update the repository information by running ‘apt update’. If you get a warning because of the key signature, you may ignore it since the repository administrator will update this in the future.

 

Install the ROCm Runtime software stack using ‘apt install rocm’:

 

[root@pegasus ~]# apt install rocm

Reading package lists… Done

Building dependency tree

Reading state information… Done

 

 

The following packages were automatically installed and are no longer required:

hcblas hcfft hcrng miopengemm

Use ‘sudo apt autoremove’ to remove them.

The following additional packages will be installed:

hcc hip_hcc linux-headers-4.11.0-kfd-compute-rocm-rel-1.6-148 linux-image-4.11.0-kfd-compute-rocm-rel-1.6-148 rocm-dev

rocm-device-libs rocm-profiler rocm-smi rocm-utils

 

 

Suggested packages:

linux-firmware-image-4.11.0-kfd-compute-rocm-rel-1.6-148

 

 

The following NEW packages will be installed:

hcc hip_hcc linux-headers-4.11.0-kfd-compute-rocm-rel-1.6-148 linux-image-4.11.0-kfd-compute-rocm-rel-1.6-148 rocm rocm-dev

rocm-device-libs rocm-profiler rocm-smi rocm-utils

0 upgraded, 10 newly installed, 0 to remove and 0 not upgraded.

Need to get 321 MB of archives.

After this operation, 1,934 MB of additional disk space will be used.

Do you want to continue? [Y/n]

Get:1 http://repo.radeon.com/rocm/apt/debian xenial/main amd64 rocm-utils amd64 1.0.0 [30.7 kB]

Get:2 http://repo.radeon.com/rocm/apt/debian xenial/main amd64 hcc amd64 1.0.17312 [255 MB]

Get:3 http://repo.radeon.com/rocm/apt/debian xenial/main amd64 hip_hcc amd64 1.2.17305 [876 kB]

Get:4 http://repo.radeon.com/rocm/apt/debian xenial/main amd64 linux-headers-4.11.0-kfd-compute-rocm-rel-1.6-148 amd64 4.11.0-kfd-compute-rocm-rel-1.6-148-1 [10.8 MB]

Get:5 http://repo.radeon.com/rocm/apt/debian xenial/main amd64 linux-image-4.11.0-kfd-compute-rocm-rel-1.6-148 amd64 4.11.0-kfd-compute-rocm-rel-1.6-148-1 [46.5 MB]

Get:6 http://repo.radeon.com/rocm/apt/debian xenial/main amd64 rocm-device-libs amd64 0.0.1 [587 kB]

Get:7 http://repo.radeon.com/rocm/apt/debian xenial/main amd64 rocm-smi amd64 1.0.0-25-gbdb99b4 [8,158 B]

Get:8 http://repo.radeon.com/rocm/apt/debian xenial/main amd64 rocm-profiler amd64 5.1.6400 [7,427 kB]

Get:9 http://repo.radeon.com/rocm/apt/debian xenial/main amd64 rocm-dev amd64 1.6.148 [902 B]

Get:10 http://repo.radeon.com/rocm/apt/debian xenial/main amd64 rocm amd64 1.6.148 [1,044 B]

Fetched 321 MB in 31s (10.1 MB/s)

Selecting previously unselected package rocm-utils.

(Reading database … 254059 files and directories currently installed.)

Preparing to unpack …/rocm-utils_1.0.0_amd64.deb …

Unpacking rocm-utils (1.0.0) …

Selecting previously unselected package hcc.

Preparing to unpack …/hcc_1.0.17312_amd64.deb …

Unpacking hcc (1.0.17312) …

Selecting previously unselected package hip_hcc.

Preparing to unpack …/hip%5fhcc_1.2.17305_amd64.deb …

Unpacking hip_hcc (1.2.17305) …

Selecting previously unselected package linux-headers-4.11.0-kfd-compute-rocm-rel-1.6-148.

Preparing to unpack …/linux-headers-4.11.0-kfd-compute-rocm-rel-1.6-148_4.11.0-kfd-compute-rocm-rel-1.6-148-1_amd64.deb …

Unpacking linux-headers-4.11.0-kfd-compute-rocm-rel-1.6-148 (4.11.0-kfd-compute-rocm-rel-1.6-148-1) …

Selecting previously unselected package linux-image-4.11.0-kfd-compute-rocm-rel-1.6-148.

Preparing to unpack …/linux-image-4.11.0-kfd-compute-rocm-rel-1.6-148_4.11.0-kfd-compute-rocm-rel-1.6-148-1_amd64.deb …

Unpacking linux-image-4.11.0-kfd-compute-rocm-rel-1.6-148 (4.11.0-kfd-compute-rocm-rel-1.6-148-1) …

Selecting previously unselected package rocm-device-libs.

Preparing to unpack …/rocm-device-libs_0.0.1_amd64.deb …

Unpacking rocm-device-libs (0.0.1) …

Selecting previously unselected package rocm-smi.

Preparing to unpack …/rocm-smi_1.0.0-25-gbdb99b4_amd64.deb …

Unpacking rocm-smi (1.0.0-25-gbdb99b4) …

Selecting previously unselected package rocm-profiler.

Preparing to unpack …/rocm-profiler_5.1.6400_amd64.deb …

Unpacking rocm-profiler (5.1.6400) …

Selecting previously unselected package rocm-dev.

Preparing to unpack …/rocm-dev_1.6.148_amd64.deb …

Unpacking rocm-dev (1.6.148) …

Selecting previously unselected package rocm.

Preparing to unpack …/rocm_1.6.148_amd64.deb …

Unpacking rocm (1.6.148) …

Setting up rocm-utils (1.0.0) …

Setting up hcc (1.0.17312) …

Setting up hip_hcc (1.2.17305) …

Setting up linux-headers-4.11.0-kfd-compute-rocm-rel-1.6-148 (4.11.0-kfd-compute-rocm-rel-1.6-148-1) …

Setting up linux-image-4.11.0-kfd-compute-rocm-rel-1.6-148 (4.11.0-kfd-compute-rocm-rel-1.6-148-1) …

update-initramfs: Generating /boot/initrd.img-4.11.0-kfd-compute-rocm-rel-1.6-148

W: mdadm: /etc/mdadm/mdadm.conf defines no arrays.

Generating grub configuration file …

Found linux image: /boot/vmlinuz-4.11.0-kfd-compute-rocm-rel-1.6-148

Found initrd image: /boot/initrd.img-4.11.0-kfd-compute-rocm-rel-1.6-148

Found linux image: /boot/vmlinuz-4.4.0-93-generic

Found initrd image: /boot/initrd.img-4.4.0-93-generic

Found memtest86+ image: /memtest86+.elf

Found memtest86+ image: /memtest86+.bin

done

Setting up rocm-device-libs (0.0.1) …

Setting up rocm-smi (1.0.0-25-gbdb99b4) …

Setting up rocm-profiler (5.1.6400) …

Setting up rocm-dev (1.6.148) …

Setting up rocm (1.6.148) …

KERNEL==”kfd”, MODE=”0666″

 

 

Reboot the server. Make sure that the Linux ROCm kernel is running:

 

Welcome to Ubuntu 16.04.3 LTS (GNU/Linux 4.11.0-kfd-compute-rocm-rel-1.6-148 x86_64)

 

* Documentation: https://help.ubuntu.com

* Management: https://landscape.canonical.com

* Support: https://ubuntu.com/advantage

 

0 packages can be updated.

0 updates are security updates.

 

Test if your installation works with this sample program:

 

cd /opt/rocm/hsa/sample

make

./vector_copy

 

You should get an output similar to this:

 

Initializing the hsa runtime succeeded.

Checking finalizer 1.0 extension support succeeded.

Generating function table for finalizer succeeded.

Getting a gpu agent succeeded.

Querying the agent name succeeded.

The agent name is gfx803.

Querying the agent maximum queue size succeeded.

The maximum queue size is 131072.

Creating the queue succeeded.

“Obtaining machine model” succeeded.

“Getting agent profile” succeeded.

Create the program succeeded.

Adding the brig module to the program succeeded.

Query the agents isa succeeded.

Finalizing the program succeeded.

Destroying the program succeeded.

Create the executable succeeded.

Loading the code object succeeded.

Freeze the executable succeeded.

Extract the symbol from the executable succeeded.

Extracting the symbol from the executable succeeded.

Extracting the kernarg segment size from the executable succeeded.

Extracting the group segment size from the executable succeeded.

Extracting the private segment from the executable succeeded.

Creating a HSA signal succeeded.

Finding a fine grained memory region succeeded.

Allocating argument memory for input parameter succeeded.

Allocating argument memory for output parameter succeeded.

Finding a kernarg memory region succeeded.

Allocating kernel argument memory buffer succeeded.

Dispatching the kernel succeeded.

Passed validation.

Freeing kernel argument memory buffer succeeded.

Destroying the signal succeeded.

Destroying the executable succeeded.

Destroying the code object succeeded.

Destroying the queue succeeded.

Freeing in argument memory buffer succeeded.

Freeing out argument memory buffer succeeded.

Shutting down the runtime succeeded.

 

Installing Docker

We are installing the Docker Community Edition (also called Docker CE) on the host by using Docker’s apt repository. Our procedure is based on documentation published by Docker.8 There may be some slight differences from the original documentation. Note that the installation is done as the superuser. You can also use sudo to install Docker.

 

First, remove old versions of Docker:

apt remove docker docker-engine

 

If they are not installed, you will simply get a message that they are missing.

 

Install the following prerequisite packages using apt:

 

apt-transport-https

ca-certificates

curl

software-properties-common

 

Add the Docker GPG key to your host:

curl -fsSL https://download.docker.com/linux/ubuntu/gpg |

sudo apt-key add –

The GPG fingerprint should be 9DC8 5822 9FC7 DD38 854A E2D8 8D81 803C 0EBF CD88. Use the command

 

apt-key fingerprint 0EBFCD88

to verify this.

 

Now add the repository information:

add-apt-repository \

“deb [arch=amd64] https://download.docker.com/linux/ubuntu \

$(lsb_release -cs) \

stable”

 

Finally, issue the command ‘apt update’.

 

Installing Docker CE should be done with ‘apt install docker-ce’. After the installation is complete, verify that Docker is properly configured and installed using the command ‘docker run hello-world’.

 

Running ROCm Docker Images

AMD provides a Docker image of the ROCm software framework.9 The image can be pulled from the official Docker repository:

 

sudo docker pull rocm/rocm-terminal

The image is about 1.5 GB in size and contains the necessary libraries to run ROCm-based applications. Create a container out of this image and look at the installed software in /opt/rocm:

 

sudo docker run -it –rm –device=/dev/kfd rocm/rocm-terminal

You can check for the ROCm libraries using ldconfig:

 

ldconfig -NXv

The command above should list all the libraries in the library path including the ROCm libraries.

 

The ROCm-docker source is available from GitHub:

 

mkdir ~/tmp

cd ~/tmp

git clone https://github.com/RadeonOpenCompute/ROCm-docker.git

Creating A ROCm Application Docker Image

We can use the rocm/rocm-terminal Docker image to build our own ROCm application Docker image. In the following examples, we use a couple of the sample applications that come with the

ROCm development package. One of them shall be /opt/rocm/hip/samples/1_Utils/hipInfo.

 

Assuming the host has the complete ROCm development tools, we just do the following:

cd /opt/rocm/hip/samples/1_Utils/hipInfo

make

 

The outcome of the make command shall be a binary called hipInfo.

 

If the compiler complains because of a missing shared library called libsupc++, we will need to install that somewhere in the host’s library path. In our case, we shall place the shared library in /usr/local/lib and make sure that ldconfig can find it. You can simply create a shared library from the installed static library /usr/lib/gcc/x86_64-linux-gnu/4.8/libsupc++.a:

 

 

mkdir -p ~/tmp/libsupc++

cd ~/tmp/libsupc++

ar x /usr/lib/gcc/x86_64-linux-gnu/4.8/libsupc++.a

ls -l *.o

gcc -shared -o libsupc++.so *.o

sudo cp -p libsupc++.so /usr/local/lib/

sudo ldconfig -v

Make sure that /usr/local/lib is seen by ldconfig. You may have to specify this directory in /etc/ld.so.conf.d if it is not found. Simply add a file named local_lib.conf with the line /usr/local/lib by itself.

 

 

Check the output of hipInfo by running it. You should get something like this (it will be slightly different from the literal output below depending on what type of GPU configuration you have):

 

 

$ ./hipInfo

compiler: hcc version=1.0.17312-d1f4a8a-19aa706-56b5abe, workweek (YYWWD) = 17312

——————————————————————————–

device# 0

Name: Device 67df

pciBusID: 1

pciDeviceID: 0

multiProcessorCount: 36

maxThreadsPerMultiProcessor: 2560

isMultiGpuBoard: 1

clockRate: 1303 Mhz

memoryClockRate: 2000 Mhz

memoryBusWidth: 256

clockInstructionRate: 1000 Mhz

totalGlobalMem: 8.00 GB

maxSharedMemoryPerMultiProcessor: 8.00 GB

totalConstMem: 16384

sharedMemPerBlock: 64.00 KB

regsPerBlock: 0

warpSize: 64

l2CacheSize: 0

computeMode: 0

maxThreadsPerBlock: 1024

maxThreadsDim.x: 1024

maxThreadsDim.y: 1024

maxThreadsDim.z: 1024

maxGridSize.x: 2147483647

maxGridSize.y: 2147483647

maxGridSize.z: 2147483647

major: 2

minor: 0

concurrentKernels: 1

arch.hasGlobalInt32Atomics: 1

arch.hasGlobalFloatAtomicExch: 1

arch.hasSharedInt32Atomics: 1

arch.hasSharedFloatAtomicExch: 1

arch.hasFloatAtomicAdd: 0

arch.hasGlobalInt64Atomics: 1

arch.hasSharedInt64Atomics: 1

arch.hasDoubles: 1

arch.hasWarpVote: 1

arch.hasWarpBallot: 1

arch.hasWarpShuffle: 1

arch.hasFunnelShift: 0

arch.hasThreadFenceSystem: 0

arch.hasSyncThreadsExt: 0

arch.hasSurfaceFuncs: 0

arch.has3dGrid: 1

arch.hasDynamicParallelism: 0

peers:

non-peers: device#0

memInfo.total: 8.00 GB

memInfo.free: 7.75 GB (97%)

 

Now that hipInfo is compiled and has been tested, let us create a Docker image with it. Create a directory for building an image with Docker.

 

mkdir ~/tmp/my_rocm_hipinfo

cd ~/tmp/my_rocm_hipinfo

 

Copy the necessary files for the Docker image to run properly:

 

cp -p /usr/local/lib/libsupc++.so . # If hipInfo needs this

cp -p /opt/rocm/hip/samples/1_Utils/hipInfo/hipInfo .

Create a file named Dockerfile in the current directory. It should contain this:

 

FROM rocm/rocm-terminal:latest

COPY libsupc++.so /usr/local/lib/

COPY hipInfo /usr/local/bin/

RUN sudo ldconfig

 

USER rocm-user

WORKDIR /home/rocm-user

ENV PATH “${PATH}:/opt/rocm/bin:/usr/local/bin”

 

ENTRYPOINT [“hipInfo”]

 

Build the Docker image:

 

sudo docker build -t my_rocm_hipinfo .

Create and run a container based on the new image:

 

 

sudo docker run –rm –device=”/dev/kfd” my_rocm_hipinfo

The device /dev/kfd is the kernel fusion driver. You should be getting a similar output as if you ran the hipInfo binary directly on the host.

 

 

Without the –rm parameter, the container will persist. You can then run the same container again and get some output:

 

 

sudo docker run –device=”/dev/kfd” –name nifty_hugle my_rocm_hipinfo

The Docker container shall persist:

 

 

sudo docker ps -a

You may get an output that looks like this:

 

 

Now, try this command and you should see the output from hipInfo again:

 

 

sudo docker start -i nifty_hugle

The second Docker image we shall create will contain the sample binary called vector_copy. The source is in /opt/rocm/hsa/sample. As done with hipInfo, use make to build the binary. Note that this binary also depends on the files with the .brig extension to run.

 

 

We do the following before we build the image:

 

 

mkdir ~/tmp/my_rocm_vectorcopy

cd ~/tmp/my_rocm_vectorcopy

mkdir vector_copy

cp -p /usr/local/lib/libsupc++.so . # Do this if necessary

cd vector_copy

cp -p /opt/rocm/hsa/sample/vector_copy .

cp -p /opt/rocm/hsa/sample/vector_copy*.brig .

cd .. # Back to ~/tmp/my_rocm_vectorcopy

For our Dockerfile, we have this:

 

 

FROM rocm/rocm-terminal:latest

COPY libsupc++.so /usr/local/lib/

RUN sudo mkdir /usr/local/vector_copy

COPY vector_copy/* /usr/local/vector_copy/

RUN sudo ldconfig

 

 

USER rocm-user

ENV PATH “${PATH}:/opt/rocm/bin:/usr/local/vector_copy”

 

 

WORKDIR /usr/local/vector_copy

ENTRYPOINT [“vector_copy”]

 

 

 

 

 

Building the Docker image for vector_copy should be familiar by now.

 

 

As an exercise, run the Docker image to see what output you get. Try with or without –rm and with the ‘docker start’ command.

 

 

 

 

 

For our last example, we shall use a Docker container for the Caffe deep learning framework. We are going to use the HIP port of Caffe which can be targeted to both AMD ROCm and Nvidia CUDA devices.10 Converting CUDA code to portable C++ is enabled by HIP. For more information on HIP, see https://github.com/ROCm-Developer-Tools/HIP.

 

 

 

 

 

Let us pull the hip-caffe image from the Docker registry:

 

 

docker pull intuitionfabric/hip-caffe

Test the image by running a device query on the AMD GPUs:

 

 

sudo docker run –name my_caffe -it –device=/dev/kfd –rm \

intuitionfabric/hip-caffe ./build/tools/caffe device_query -gpu all

You should get an output similar to the one below. Note that your output may differ due to your own host configuration.

I0831 19:05:30.814853 1 caffe.cpp:138] Querying GPUs all

I0831 19:05:30.815135 1 common.cpp:179] Device id: 0

I0831 19:05:30.815145 1 common.cpp:180] Major revision number: 2

I0831 19:05:30.815148 1 common.cpp:181] Minor revision number: 0

I0831 19:05:30.815153 1 common.cpp:182] Name: Device 67df

I0831 19:05:30.815158 1 common.cpp:183] Total global memory: 8589934592

I0831 19:05:30.815178 1 common.cpp:184] Total shared memory per block: 65536

I0831 19:05:30.815192 1 common.cpp:185] Total registers per block: 0

I0831 19:05:30.815196 1 common.cpp:186] Warp size: 64

I0831 19:05:30.815201 1 common.cpp:188] Maximum threads per block: 1024

I0831 19:05:30.815207 1 common.cpp:189] Maximum dimension of block: 1024, 1024, 1024

I0831 19:05:30.815210 1 common.cpp:192] Maximum dimension of grid: 2147483647, 2147483647, 2147483647

I0831 19:05:30.815215 1 common.cpp:195] Clock rate: 1303000

I0831 19:05:30.815219 1 common.cpp:196] Total constant memory: 16384

I0831 19:05:30.815223 1 common.cpp:200] Number of multiprocessors: 36

 

 

 

Let us now run Caffe in a container. We begin by creating a container for this purpose.

 

 

 

 

 

sudo docker run -it –device=/dev/kfd –rm intuitionfabric/hip-caffe

Run the MNIST example in the container. Once the above command is executed, you should be inside the container.

First, get the raw MNIST data:

 

 

./data/mnist/get_mnist.sh

Make sure you format the data for Caffe:

 

 

./examples/mnist/create_mnist.sh

Once that’s done, proceed with training the network:

 

 

./examples/mnist/train_lenet.sh

You should get an output similar to this:

 

 

I0831 18:43:19.290951 37 caffe.cpp:217] Using GPUs 0

I0831 18:43:19.291165 37 caffe.cpp:222] GPU 0: Device 67df

I0831 18:43:19.294853 37 solver.cpp:48] Initializing solver from parameters:

test_iter: 100

test_interval: 500

base_lr: 0.01

display: 100

max_iter: 10000

lr_policy: “inv”

gamma: 0.0001

power: 0.75

momentum: 0.9

weight_decay: 0.0005

snapshot: 5000

snapshot_prefix: “examples/mnist/lenet”

solver_mode: GPU

device_id: 0

net: “examples/mnist/lenet_train_test.prototxt”

train_state {

level: 0

stage: “”

}

I0831 18:43:19.294972 37 solver.cpp:91] Creating training net from net file: examples/mnist/lenet_train_test.prototxt

I0831 18:43:19.295145 37 net.cpp:322] The NetState phase (0) differed from the phase (1) specified by a rule in layer mnist

I0831 18:43:19.295169 37 net.cpp:322] The NetState phase (0) differed from the phase (1) specified by a rule in layer accuracy

I0831 18:43:19.295181 37 net.cpp:58] Initializing net from parameters:

name: “LeNet”

state {

phase: TRAIN

level: 0

stage: “”

}

layer {

name: “mnist”

type: “Data”

top: “data”

top: “label”

include {

phase: TRAIN

}

transform_param {

scale: 0.00390625

}

data_param {

source: “examples/mnist/mnist_train_lmdb”

batch_size: 64

backend: LMDB

}

}

layer {

name: “conv1”

type: “Convolution”

bottom: “data”

top: “conv1”

param {

lr_mult: 1

}

param {

lr_mult: 2

}

convolution_param {

num_output: 20

kernel_size: 5

stride: 1

weight_filler {

type: “xavier”

}

bias_filler {

type: “constant”

}

}

}

….….layer {

name: “loss”

type: “SoftmaxWithLoss”

bottom: “ip2”

bottom: “label”

top: “loss”

}

I0831 18:43:19.295332 37 layer_factory.hpp:77] Creating layer mnist

I0831 18:43:19.295426 37 net.cpp:100] Creating Layer mnist

I0831 18:43:19.295444 37 net.cpp:408] mnist -> data

I0831 18:43:19.295478 37 net.cpp:408] mnist -> label

I0831 18:43:19.304414 40 db_lmdb.cpp:35] Opened lmdb examples/mnist/mnist_train_lmdb

I0831 18:43:19.304760 37 data_layer.cpp:41] output data size: 64,1,28,28

I0831 18:43:19.305835 37 net.cpp:150] Setting up mnist

I0831 18:43:19.305842 37 net.cpp:157] Top shape: 64 1 28 28 (50176)

I0831 18:43:19.305848 37 net.cpp:157] Top shape: 64 (64)

I0831 18:43:19.305851 37 net.cpp:165] Memory required for data: 200960

I0831 18:43:19.305874 37 layer_factory.hpp:77] Creating layer conv1

I0831 18:43:19.305907 37 net.cpp:100] Creating Layer conv1

I0831 18:43:19.305912 37 net.cpp:434] conv1 <- data

I0831 18:43:19.305940 37 net.cpp:408] conv1 -> conv1

I0831 18:43:19.314159 37 cudnn_conv_layer.cpp:259] Before miopenConvolution*GetWorkSpaceSize

I0831 18:43:19.319051 37 cudnn_conv_layer.cpp:295] After miopenConvolution*GetWorkSpaceSize

I0831 18:43:19.319625 37 cudnn_conv_layer.cpp:468] Before miopenFindConvolutionForwardAlgorithm

I0831 18:43:19.927783 37 cudnn_conv_layer.cpp:493] fwd_algo_[0]: 1

I0831 18:43:19.927809 37 cudnn_conv_layer.cpp:494] workspace_fwd_sizes_[0]:57600

I0831 18:43:19.928071 37 cudnn_conv_layer.cpp:500] Before miopenFindConvolutionBackwardWeightsAlgorithm

….….I0831 18:43:23.296785 37 net.cpp:228] mnist does not need backward computation.

I0831 18:43:23.296789 37 net.cpp:270] This network produces output loss

I0831 18:43:23.296799 37 net.cpp:283] Network initialization done.

I0831 18:43:23.296967 37 solver.cpp:181] Creating test net (#0) specified by net file: examples/mnist/lenet_train_test.prototxt

I0831 18:43:23.296985 37 net.cpp:322] The NetState phase (1) differed from the phase (0) specified by a rule in layer mnist

I0831 18:43:23.296995 37 net.cpp:58] Initializing net from parameters:

name: “LeNet”

state {

phase: TEST

}

layer {

name: “mnist”

type: “Data”

top: “data”

top: “label”

include {

phase: TEST

}

transform_param {

scale: 0.00390625

}

data_param {

source: “examples/mnist/mnist_test_lmdb”

batch_size: 100

backend: LMDB

}

}……

I0831 18:44:12.620506 37 solver.cpp:404] Test net output #1: loss = 0.0299084 (* 1 = 0.0299084 loss)

I0831 18:44:12.624415 37 solver.cpp:228] Iteration 9000, loss = 0.011652

I0831 18:44:12.624441 37 solver.cpp:244] Train net output #0: loss = 0.011652 (* 1 = 0.011652 loss)

I0831 18:44:12.624449 37 sgd_solver.cpp:106] Iteration 9000, lr = 0.00617924

I0831 18:44:13.055759 37 solver.cpp:228] Iteration 9100, loss = 0.0061008

I0831 18:44:13.055778 37 solver.cpp:244] Train net output #0: loss = 0.0061008 (* 1 = 0.0061008 loss)

I0831 18:44:13.055800 37 sgd_solver.cpp:106] Iteration 9100, lr = 0.00615496

I0831 18:44:13.497696 37 solver.cpp:228] Iteration 9200, loss = 0.00277705

I0831 18:44:13.497715 37 solver.cpp:244] Train net output #0: loss = 0.00277706 (* 1 = 0.00277706 loss)

I0831 18:44:13.497720 37 sgd_solver.cpp:106] Iteration 9200, lr = 0.0061309

I0831 18:44:13.941920 37 solver.cpp:228] Iteration 9300, loss = 0.0111398

I0831 18:44:13.941941 37 solver.cpp:244] Train net output #0: loss = 0.0111398 (* 1 = 0.0111398 loss)

I0831 18:44:13.941946 37 sgd_solver.cpp:106] Iteration 9300, lr = 0.00610706

I0831 18:44:14.386647 37 solver.cpp:228] Iteration 9400, loss = 0.0179196

I0831 18:44:14.386667 37 solver.cpp:244] Train net output #0: loss = 0.0179195 (* 1 = 0.0179195 loss)

I0831 18:44:14.386672 37 sgd_solver.cpp:106] Iteration 9400, lr = 0.00608343

I0831 18:44:14.828459 37 solver.cpp:337] Iteration 9500, Testing net (#0)

I0831 18:44:14.983165 37 solver.cpp:404] Test net output #0: accuracy = 0.9884

I0831 18:44:14.983183 37 solver.cpp:404] Test net output #1: loss = 0.0393952 (* 1 = 0.0393952 loss)

I0831 18:44:14.987198 37 solver.cpp:228] Iteration 9500, loss = 0.00496538

I0831 18:44:14.987211 37 solver.cpp:244] Train net output #0: loss = 0.00496537 (* 1 = 0.00496537 loss)

I0831 18:44:14.987217 37 sgd_solver.cpp:106] Iteration 9500, lr = 0.00606002

I0831 18:44:15.433176 37 solver.cpp:228] Iteration 9600, loss = 0.00308157

I0831 18:44:15.433193 37 solver.cpp:244] Train net output #0: loss = 0.00308157 (* 1 = 0.00308157 loss)

I0831 18:44:15.433200 37 sgd_solver.cpp:106] Iteration 9600, lr = 0.00603682

I0831 18:44:15.878787 37 solver.cpp:228] Iteration 9700, loss = 0.00220143

I0831 18:44:15.878806 37 solver.cpp:244] Train net output #0: loss = 0.00220143 (* 1 = 0.00220143 loss)

I0831 18:44:15.878813 37 sgd_solver.cpp:106] Iteration 9700, lr = 0.00601382

I0831 18:44:16.321408 37 solver.cpp:228] Iteration 9800, loss = 0.0108761

I0831 18:44:16.321426 37 solver.cpp:244] Train net output #0: loss = 0.0108761 (* 1 = 0.0108761 loss)

I0831 18:44:16.321432 37 sgd_solver.cpp:106] Iteration 9800, lr = 0.00599102

I0831 18:44:16.765200 37 solver.cpp:228] Iteration 9900, loss = 0.00478531

I0831 18:44:16.765219 37 solver.cpp:244] Train net output #0: loss = 0.00478531 (* 1 = 0.00478531 loss)

I0831 18:44:16.765226 37 sgd_solver.cpp:106] Iteration 9900, lr = 0.00596843

I0831 18:44:17.204908 37 solver.cpp:454] Snapshotting to binary proto file examples/mnist/lenet_iter_10000.caffemodel

I0831 18:44:17.208767 37 sgd_solver.cpp:273] Snapshotting solver state to binary proto file examples/mnist/lenet_iter_10000.solverstate

I0831 18:44:17.211735 37 solver.cpp:317] Iteration 10000, loss = 0.0044067

I0831 18:44:17.211750 37 solver.cpp:337] Iteration 10000, Testing net (#0)

I0831 18:44:17.364528 37 solver.cpp:404] Test net output #0: accuracy = 0.9902

I0831 18:44:17.364547 37 solver.cpp:404] Test net output #1: loss = 0.0303562 (* 1 = 0.0303562 loss)

I0831 18:44:17.364552 37 solver.cpp:322] Optimization Done.

I0831 18:44:17.364555 37 caffe.cpp:254] Optimization Done.

 

 

Conclusion

In this article, we provided with you a guide on how to use AMD’s ROCm framework with Docker container technology. This should serve as a good jumpstart to begin your Deep Learning development using AMDs platform.

 

Docker has become an essential technology in containing the complexity of Deep Learning development. Deep Learning frameworks and tools have many dependencies. By leveraging Docker to isolate these dependencies within a Linux container leads to not only greater reliability and robustness but also to greater agility and flexibility. There are many frameworks and tools that are emerging and it is best practice to have a robust solution to the management of disparate parts. Docker containers have become a standard practice in Deep Learning and this technology is well supported by AMD’s ROCm framework.

Capture.JPG

 

[Originally posted on 06/20/17 by Ogi Brkic]

 

Back in December 2016, we first announced our Radeon Instinct initiative, combining our strength in compute with our dedication to open software. We later announced our Radeon Vega Frontier Edition, an enabler of Radeon Instinct.

 

Today, we’re excited to tell you about the next chapter in our vision for instinctive computing. AMD’s Radeon Instinct™ accelerators will soon ship to our partners (including Boxx, Colfax, Exxact Corporation, Gigabyte, Inventec and Supermicro, among others) and power their deep learning and HPC solutions starting in Q3 2017.

 

Artificial intelligence and machine learning are changing the world in ways we never could have imagined only a few years ago, enabling life-changing breakthroughs that can solve previously unsolvable problems. Radeon Instinct™ MI25, MI8, and MI6, together with AMD’s open ROCm 1.6 software platform, can dramatically increase performance, efficiency, and ease of implementation, speeding through deep learning inference and training workloads. We’re not just looking to accelerate the drive to machine intelligence, but to power the next era of true heterogeneous compute.

 

New Radeon Instinct Accelerators

Through our Radeon Instinct server accelerator products and open ecosystem approach, we’re able to offer our customers cost-effective machine and deep learning training, edge-training and inference solutions, where workloads can take the most advantage of the GPU’s highly parallel computing capabilities.

 

We’ve also designed the three initial Radeon Instinct accelerators to address a wide range of machine intelligence applications, which includes data-centric HPC-class systems in academics, government labs, energy, life science, financial, automotive and other industries:

 

mi25-700x700.png

The Radeon Instinct™ MI25 accelerator, based on the new “Vega” GPU architecture with a 14nm FinFET process, will be the world’s ultimate training accelerator for large-scale machine intelligence and deep learning datacenter applications. The MI25 will deliver superior FP16 and FP32 performance in a passively-cooled single GPU server card with 24.6 TFLOPS of FP16 or 12.3 TFLOPS of FP32 peak performance through its 64 compute units (4,096 stream processors). With 16GB of ultra–high bandwidth HBM2 ECC GPU memory and up to 484 GB/s of memory bandwidth, the Radeon Instinct MI25’s design is optimized for massively parallel applications with large datasets for Machine Intelligence and HPC-class systems.

 

mi8-1-700x700.png

 

The Radeon Instinct™ MI8 accelerator, harnessing the high-performance, energy-efficiency of the “Fiji” GPU architecture, is a small form factor HPC and inference accelerator with 8.2 TFLOPS of peak FP16|FP32 performance at less than 175W board power and 4GB of High-Bandwidth Memory (HBM) on a 512-bit memory interface. The MI8 is well suited for machine learning inference and HPC applications.

 

mi6-700x700.png

 

The Radeon Instinct™ MI6 accelerator, based on the acclaimed “Polaris” GPU architecture, is a passively cooled inference accelerator with 5.7 TFLOPS of peak FP16|FP32 performance at 150W board power and 16GB of ultra-fast GDDR5 GPU memory on a 256-bit memory interface. The MI6 is a versatile accelerator ideal for HPC and machine learning inference and edge-training deployments.

 

 

Radeon Instinct hardware is fueled by our open-source software platform, including:

  • Planned for June 29th rollout, the ROCm 1.6 software platform with performance improvements and now support for MIOpen 1.0 is scalable and fully open source providing a flexible, powerful heterogeneous compute solution for a new class of hybrid Hyperscale and HPC-class systems. Comprised of an open-source Linux® driver optimized for scalable multi-GPU computing, the ROCm software platform provides multiple programming models, the HIP CUDA conversion tool, and support for GPU acceleration using the Heterogeneous Computing Compiler (HCC).

 

  • The open-source MIOpen GPU-accelerated library available June 29th with the ROCm platform and supports machine intelligence frameworks including planned support of Caffe®, TensorFlow® and Torch®.

 

Revolutionizing the Datacenter with “Zen”-based Epyc™ Servers and Radeon Instinct Accelerators

The Radeon Instinct MI25, combined with our new “Zen”-based Epyc servers and the revolutionary ROCm open software platform, will provide a progressive approach to open heterogeneous compute and machine learning from the metal forward.

 

We plan to ship Radeon Instinct products to our technology partners in Q3 for design in their deep learning and HPC solutions, giving customers a real choice of vendors for open, scale-out machine learning solutions.

 

For more details and specifications on these cards, please check out the product pages below.

 

Radeon Instinct MI25

Radeon Instinct MI8

Radeon Instinct MI6

mark.hirsch

AMD Powers Project 47

Posted by mark.hirsch Nov 13, 2018

Capture.JPG

[Originally posted on 07/30/17 - by Mark Hirsch]

 

1 PetaFLOPS of Performance for the Ultimate Virtualization and Machine Intelligence Solution

Today at Capsaicin SIGGRAPH, AMD showcased what can be achieved when the world’s greatest server CPU is combined with the world’s greatest GPU, based on AMD’s revolutionary “Vega” architecture. Developed by AMD in collaboration with Inventec, Project 47 is based on Inventec’s P-series massively parallel computing platform, and is a rack designed to excel in a range of tasks, from graphics virtualization to machine intelligence.

 

Project 47 boasts 1 PetaFLOPS of compute power at full 32-bit precision delivering a stunning 30 GigaFLOPS/W, demonstrating dramatic compute efficiency.1 It boasts more cores, threads, compute units, IO lanes and memory channels in use at one time than in any other similarly configured system ever before. The incredible performance-per-dollar and performance-per-watt of Project 47 makes supercomputing a more affordable reality than ever before, whether for machine learning, virtualization or rendering.

 

p47-1-306x700.jpg

Project 47 is made up of a rack of individual servers, each harnessing one EPYC™ 7601 processor to drive up to four “Vega”-based Radeon Instinct™ MI25 accelerators using 128 PCIe® lanes, in contrast to the costly dual-CPU and PLX switch setups typically needed on competing platforms in order to run four GPUs. With Project 47, AMD showcased the ease with which multiple servers can be daisy-chained, demonstrating a rack of 20 servers running 20 EPYC SoCs and 80 Radeon Instinct MI25 accelerators.

 

To bring Project 47 to life, AMD worked closely with Samsung Electronics with respect to the HBM2 memory used across the “Vega”-based product lines including the Radeon Instinct MI25 accelerators. Samsung also provided high-performance NVMe SSD storage and high-speed DDR4 memory to enable the 1 PetaFLOPS of performance. AMD also collaborated with Mellanox Technologies, leveraging their InfiniBand solution to deliver 100Gb connectivity through the rack.

 

Project 47 is expected to be available from Inventec and their principal distributor AMAX in Q4 of this year.

 

 

 

 

 

Mark Hirsch, Corporate Vice President, Systems & Solutions for the Radeon Technologies Group at AMD. His postings are his own opinions and may not represent AMD’s positions, strategies, or opinions. Links to third party sites and references to third party trademarks are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.

Capture.JPG

[Originally posted on 11/16/17 by Carlos E. Perez]

 

AMD’s newly released Vega architecture has several unique features that can be leveraged in Deep Learning training and inference workloads.

 

The first noteworthy feature is the capability to perform FP16 at twice the speed as FP32 and with INT8 at four times as fast as FP32. This translates to a peak performance of 24 teraflops on FP16 and 48 trillion operations per second on INT8. Deep Learning workloads have known to work well with lower precision arithmetic. It is as if AMD architects were aware of this reality and designed VEGA to exploit this characteristic. The second noteworthy feature of Vega is its new memory architecture that permits the addressability of up to 512GB of memory. The third benefit is favorable coupling with AMD’s ThreadRipper and EPYC lines of microprocessors.

 

On Deep Learning

Deep learning (DL) is a technology that is as revolutionary as the Internet and mobile computing that came before it. The current revival of interest in all things “Artificial Intelligence” (AI) is driven by the spectacular results achieved with deep learning. There are other AI technologies like expert systems, semantic knowledge bases, logic programming and Bayesian systems. Most of classical AI has not changed much, if any, in the last 5 years. The recent quantum leap disproportionately been driven by deep learning progress.

 

When Google embarked on converting their natural language translation software into using deep learning, they were surprised to discover major gains. This was best described in a recent article published in the New York Times, “The Great AI Awakening”:

 

The neural system, on the English-French language pair, showed an improvement over the old system of seven points. Hughes told Schuster’s team they hadn’t had even half as strong an improvement in their own system in the last four years. To be sure this wasn’t some fluke in the metric, they also turned to their pool of human contractors to do a side-by-side comparison. The user-perception scores, in which sample sentences were graded from zero to six, showed an average improvement of 0.4 — roughly equivalent to the aggregate gains of the old system over its entire lifetime of development. In mid-March, Hughes sent his team an email. All projects on the old system were to be suspended immediately.

 

Let’s pause to recognize what happened at Google. Since its inception, Google has used every type of AI or machine learning technology imaginable. In spite of this, their average gain for improvement per year was only 0.4%. In Google’s first implementation, the improvement due to DL was 7 percentage points better.

 

Google likely has the most talented AI and algorithm developers on the planet. However, several years of handcrafted development could not hold a candle to a single initial deep learning implementation.

 

ROCm

ROCm is software that supports High Performance Computing (HPC) workloads on AMD hardware. ROCm includes a C/C++ compiler called the Heterogeneous Compute Compiler (HCC). HCC is based on the open-source LLVM compiler infrastructure project. This HCC compiler supports the direct generation of native Radeon GPU instruction set (known as GSN ISA). Targeting native GPU instructions is crucial to get maximum performance. All the libraries under ROCm support GSN ISA.

 

Included with the compiler is an API called HC which provides additional control over synchronization, data movement and memory allocation. The HCC compiler is based on previous work in heterogeneous computing at the HSA foundation. The design allows CPU and GPU code to be written in the same source file and supports capabilities such as a unified CPU-GPU memory space.

1-848x700.jpg

The diagram above depicts the relationships between the ROCm components. The HCC compiler generates both the CPU and GPU code. It uses different LLVM back ends to generate x86 and GCN ISA code from a single C/C++ source. A GSN ISA assembler can also [1] be used as a source for the GCN target.

 

The CPU and GPU code are linked with the HCC runtime to form the application (compare this with HSA diagram). The application communicates with the ROCr driver that resides in user space in Linux. The ROCr driver uses a low latency mechanism (packet based AQL) to coordinate with the ROCk Kernel Driver.

 

To further narrow the capability gap, the ROCm Initiative created a CUDA porting tool called HIP (let’s ignore what it stands for). HIP provides tooling that scans CUDA source code and converts it into corresponding HIP source code. HIP source code looks similar to CUDA code, but compiled HIP code can support both CUDA and AMD based GPU devices.

 

2-211x300.png

 

The ROCm initiative provides the handcrafted libraries and assembly language tooling that will allow developers to extract every ounce of performance from AMD hardware. This includes a rocBLAS. This is implemented from scratch with a HIP interface. AMD also provides an FFT library called rocFFT that is also written with HIP interfaces. MIOpen is a native library that is tuned for Deep Learning workloads, it is AMD’s alternative to Nvidia’s cuDNN library. This library includes Radeon GPU-specific optimizations.

 

hipCaffe

AMD currently has ported Caffe to run using the ROCm stack. You can try examples here. I ran some benchmarks found here and here is a chart of the results:

 

3.png

 

Caffe is run on unspecified GPU hardware.

I don’t know the specific hardware that was used in these benchmarks, however, this comparison does show that the performance improvement is quite significant as compared to alternatives. One thing to observe is that the speedup is most impressive with a complex network like GoogleNet as compared to simpler one like VGG. This is a reflection of the amount of hand-tuning that AMD has done on the MIOpen library.

 

Deep Learning Standard Virtual Machines

Deep learning frameworks like Caffe have internal computational graphs. These graphs specify the execution order of mathematical operations, similar to a dataflow. These frameworks use the graph to orchestrate its execution on groups of CPUs and GPUs. The execution is parallel and this is one reason why GPUs are ideal for this kind of computation. There are however plenty of untapped opportunities to improve the orchestration between the CPU and GPU.

 

The current state of Deep Learning frameworks is similar to the fragmented state before the creation of common code generation backends like LLVM. In the chaotic good old days, every programming language had to re-invent its way of generating machine code. With the development of LLVM, many languages now share the same backend code. Many programming languages use LLVM as their backend. Several well-known examples of this are Ada, C#, Common Lisp, Delphi, Fortran, Haskell, Java bytecode, Julia, Lua, Objective-C, Python, R, Ruby, Rust, and Swift. The frontend code only needs to parse and translate source code to an intermediate representation (IR).

 

Deep Learning frameworks will eventually need their own “IR”. The IR for Deep Learning is, of course, the computational graph. Deep learning frameworks like Caffe and TensorFlow have their own internal computational graphs. These frameworks are all merely convenient fronts to the internal graph. These graphs specify the execution order of mathematical operations, analogous to what a dataflow graph does. The graph specifies the orchestration of collections of CPUs and GPUs. This execution is highly parallel. Parallelism is the one reason why GPUs are ideal for this kind of computation. There are however plenty of untapped opportunities to improve the orchestration between the CPU and GPU.

 

New research is exploring ways to optimize the computational graph in a way that goes beyond just single device optimization and towards more global multi-device optimization. NNVM is one such framework that performs a computation graph optimization framework using an intermediate representation. The goal is for NNVM optimizers to reduce memory and device allocation while preserving the original computational semantics.

 

A more recent development is the port of NNVM to support AMD GPUs. The NNVM compiler can compile to the TVM stack. The TVM stack is a compilation an end-to-end compilation stack that supports multiple backends. TVM compiles a high-level computation description written in TVM frontend down to an optimized native GPU code. It leverages an LLVM based code generator in TVM and LLVM’s ROCm capabilities. This new project can be found at: https://github.com/ROCmSoftwarePlatform/nnvm-rocm.

 

The NNVM and TVM stacks perform optimizations in a global manner across either the computational graph or an alternative declarative specification. Conventional DL frameworks, however, have code generation and execution all intertwined with their code base, making opportunities to develop optimization solutions less portable. Ideally, one would like to see a common standard, a DL virtual machine instruction set, where the community can collective contribute optimization routines. Open Neural Network eXchange (ONNX) is one such standard. ONNX is a project supported by Facebook and Microsoft. They are building support for Caffe2, PyTorch and Cognitive Toolkit. The recent TVM port reveals the potential of AMD support for a wider range of DL frameworks:

 

4-1440x700.png

TVM transforms the computational graph by minimizing memory, optimizing data layout and fusing computational kernels. It is a reusable framework that is designed to support multiple hardware back-ends. NNVM provides a high-level intermediate representation that represents tasks scheduling and memory management. TVM is a low-level IR for optimizing computation. A proof of concept showed that the approach of optimizing low-level operations lead to around a 35% improvement over hand-engineered kernels. This end-to-end optimization combined with AMD’s open sourced computational libraries like MIOpen is a very promising development.

 

Conclusion

There are many Deep Learning frameworks in existence today. Different frameworks have their own strengths and weaknesses. The field is making good progress to develop standardization that allows interoperability of these frameworks. This is through a common standard Deep Learning virtual machine. ONNX is one of these more recent standards.

 

In addition to standardization, global optimization of the computational graph found in Deep Learning frameworks is a means towards higher performance. The TVM framework and its integration with AMD’s LLVM based backend opens up the opportunity for end-to-end optimization of not only AMD GPUs but also the combination of CPUs and GPUs.

Capture.JPG

[Originally posted on 10/20/17]

 

The recent release of ROCm 1.6, which includes a cuDNN-like library called MIOpen and a port of the deep learning Caffe framework (the AMD version is called hipCaffe), has opened up the opportunity for running deep learning projects using AMD Radeon GPUs. In this article we demonstrate 6 projects that you can start using with AMDs new hardware accelerators.

 

Most GPU-enabled deep learning frameworks rely on Nvidia’s CUDA and cuDNN libraries. AMD is however pulling an aggressive effort to port many deep learning frameworks such as Caffe, Torch, MXNet and Tensorflow to run on their hardware. Developers are now able to convert CUDA code to portable C++ code, thanks to AMD’s porting tools and libraries such as HIP.

 

The deep learning framework Caffe has recently been ported using HIP, allowing Deep Learning practitioners to run Caffe projects on AMD GPUs. This port, can be downloaded from https://github.com/ROCmSoftwarePlatform/hipCaffe.

 

1. Traffic Sign Recognition

 

traffic.png

Source: http://benchmark.ini.rub.de/?section=gtsrb&subsection=news

 

An interesting image classification problem is the recognition of traffic signs. This project (https://github.com/magnusja/GTSRB-caffe-model) classifies 43 different German traffic signs. A data set of 50,000 images is used.

 

2. Image Synthesizer

 

5.png

Source: http://www.evolvingai.org/synthesizing

 

University of Wyoming’s Evolving AI Lab has a project whose goal is to understand how deep neural networks (DNNs) work by synthesizing preferred stimuli that highly activates the neurons for a particular image. A deep generator network (DGN) is used as prior to the DNN being studied. This DGN outputs a synthetic image very similar to real images from the ImageNet dataset as possible.

 

Below are a few results from running the sample scripts in the project:

 

2.jpg

 

The project’s paper is available from http://www.evolvingai.org/files/nguyen2016synthesizing.pdf. The code needed to reproduce some of the results in the paper is on GitHub: https://github.com/Evolving-AI-Lab/synthesizing.

 

3. Traffic Light Detection

 

6.png

Source: https://github.com/davidbrai/deep-learning-traffic-lights

 

David Brailovsky from Israel writes in Medium about Recognizing Traffic Lights with Deep Learning (see: https://medium.freecodecamp.org/recognizing-traffic-lights-with-deep-learning-23dae23287cc ). Source code for his project can be found here: https://github.com/davidbrai/deep-learning-traffic-lights

 

4. Cat/Dog Classifier

 

8.jpg

Source:http://adilmoujahid.com/posts/2016/06/introduction-deep-learning-python-caffe/

 

This introductory tutorial by Adil Moujahid shows how to train a model and how to use a pre-existing model to distinguish cats from dogs in pictures. A Kaggle dataset is used for this tutorial. For the trained model, the BVLC CaffeNet Model is used.

 

The Caffe project already has pre-trained models (i.e. VGG, ImageNet) that can be used as a starting point for developing other kinds of image classification.

 

5. Visual Development Environment

 

9.png

 

Fabrik is an open source application for building, visualizing and training deep learning models. Fabrik provides simple drag-and-drop tools to streamline deep learning development. The application currently supports importing, editing and exporting of Caffe based models. This is a convenient way to view and edit your models.

 

6. Model Conversion Tools

Finally, there are vastly more projects that have been developed in frameworks other than Caffe. For these projects, there are some tools that can convert models into one that is compatible with Caffe. This GitHub project https://github.com/ysh329/deep-learning-model-convertor provides a listing of conversion tools to convert one frameworks model into another.

 

MXNet to Caffe

The code from this GitHub repository allows you to convert an MXNet model to a Caffe model.

 

PyTorch to Caffe

This project allows you to convert between PyTorch, Caffe, and Darknet models.

 

Torch to Caffe

Facebook has a converter (see: https://github.com/facebook/fb-caffe-exts ) that converts Torch models to Caffe.

 

Summary

In this article, we explore the many deep learning projects that you can now run using AMD Radeon Instinct hardware. We have included in this list, projects that you can test out with minimal effort. There are other projects that have customized Caffe with custom elements like new kinds of layers and activation function. For these projects, one may require porting CUDA specific code using AMD’s HIP tooling. Aside from the projects explored here, you can find other projects in the Caffe Model Zoo (see: https://github.com/BVLC/caffe/wiki/Model-Zoo).

 

The smartest companies in the world are migrating their infrastructure to support this new paradigm. Daily, the press continues to report the amazing progress of AI. Furthermore, you hear about firms like Google and Microsoft changing their entire software DNA to move into AI. The reason for this massive migration is Deep Learning.

 

Deep Learning is supporting work by not only providing assistive capabilities, but also by enabling more creative generative capabilities. Assistive capabilities can happen in real time as well as in the backend. There are certain professions where the ability to curate and analyze information is extremely valuable. We can enhance these curation and analysis capabilities by reducing the deluge of information into smaller chunks that are more quickly digestible.

 

Generative capabilities are a new kind of capability that is becoming more pervasive. By now, we’ve all experienced the capabilities of mobile app Prisma that is able to re-render photographs into the style of different artists.

 

In this article, we highlighted several deep learning projects that explore both assistive and generative capabilities found in Deep Learning. We also covered some tools that allow you to port models from other projects as well as an IDE. Software that supports Radeon Instinct accelerators is still in its infancy. However, despite being out for just a few months, there are now plenty of interesting applications that can be used as a springboard to developing more complex solutions.

 

 

 

Albert J. De Vera and Carlos E.Perez, are Co-Founders at Intuition Machine. They specializes in Deep Learning patterns, methodology and strategy. Many of their other writings on Artificial Intelligence can be found on Medium. Their postings are their own opinions and may not represent AMD’s positions, strategies, or opinions. Links to third party sites and references to third party trademarks are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.

Capture.JPG

[Originally posted on 04/03/17]

 

When a company starts using disruptive technology or a disruptive business model, the results can be spectacular and can leave the competition eating dust.

 

The reason for this is that although the company’s growth seems linear at first, it eventually reveals itself as being exponential. When a company reaches this point, it becomes very difficult, if not impossible, for competitors to catch up.

 

This article explores AMD’s open source deep learning strategy and explains the benefits of AMD’s ROCm initiative to accelerating deep learning development. It asks if AMD’s competitors need to be concerned with the disruptive nature of what AMD is doing.

 

On Deep Learning

Deep learning (DL) is a technology that is as revolutionary as the Internet and mobile computing that came before it. One author found it so revolutionary that he described it as “The Last Invention of Man” [KHAT] – strong words indeed!

 

Currently, the revival of interest in all things “Artificial Intelligence” (AI) is primarily due to the spectacular results achieved with deep learning research. I must however emphasize that this revival is not due to other classical AI technologies like expert systems, semantic knowledge bases, logic programming or Bayesian systems. Most of classical AI has not changed much, if any, in the last 5 years. The recent quantum leap has solely been driven by deep learning successes.

 

For some perspective on the extent of deep learning development, look at this graph from Google that shows the adoption of deep learning technology in their applications:

 

deep-learning-google-1-1244x700.png

 

Source: https://www.slideshare.net/HadoopSummit/machine-intelligence-at-google-scale-tensorflow

 

As you can see, the adoption at Google has been exponential and the statistics are likely similar for many of the other big Internet firms like Facebook and Microsoft.

 

When Google embarked on converting their natural language translation software into using deep learning, they were surprised to discover major gains. This was best described in a recent article published in the New York Times, “The Great AI Awakening” [LEW]:

 

 

The neural system, on the English-French language pair, showed an improvement over the old system of seven points. Hughes told Schuster’s team they hadn’t had even half as strong an improvement in their own system in the last four years. To be sure this wasn’t some fluke in the metric, they also turned to their pool of human contractors to do a side-by-side comparison. The user-perception scores, in which sample sentences were graded from zero to six, showed an average improvement of 0.4 — roughly equivalent to the aggregate gains of the old system over its entire lifetime of development. In mid-March, Hughes sent his team an email. All projects on the old system were to be suspended immediately.

 

Let’s pause to recognize what happened at Google.

Since its inception, Google has used every type of AI or machine learning technology imaginable. In spite of this, their average gain for improvement per year was only 0.4%. In Google’s first implementation, the improvement due to DL was 7 percentage points better.

 

This translates to more gains than the entire lifetime of improvements!

 

Google likely has the most talented AI and algorithm developers on the planet. However, several years of handcrafted development could not hold a candle to a single initial deep learning implementation.

 

Deep Learning is unexpectedly, and disruptively, taking over the world

Google’s founder Sergey Brin, an extremely talented computer scientist himself, stated in a recent World Economic Forum [CHA] discussion that he did not foresee deep learning:

 

“The revolution in deep nets has been very profound, it definitely surprised me, even though I was sitting right there.”

 

The deep learning progress has been taking the academic community by storm. Two articles by practitioners of classical machine learning have summarized why they think DL is taking over the world. Chris Manning, a renowned expert in NLP, writes about the “Deep learning Tsunami“ [MAN]:

 

 

Deep learning waves have lapped at the shores of computational linguistics for several years now, but 2015 seems like the year when the full force of the tsunami hit the major Natural Language Processing (NLP) conferences. However, some pundits are predicting that the final damage will be even worse.

 

The same sentiment is expressed by Nicholas Paragios, who works in the field of computer vision. Paragios writes in “Computer Vision Research: the Deep Depression“ [PAR]:

 

It might be simply because deep learning on highly complex, hugely determined in terms of degrees of freedom graphs once endowed with massive amount of annotated data and unthinkable — until very recently — computing power can solve all computer vision problems. If this is the case, well it is simply a matter of time that industry (which seems to be already the case) takes over, research in computer vision becomes a marginal academic objective and the field follows the path of computer graphics (in terms of activity and volume of academic research).

 

Although I don’t want to detail the many deep learning developments of the past several years, Nell Watson provides a quick, short summary when she writes in “Artificial Intuition” [WAT]:

 

To sum up, machine intelligence can do a lot of creative things; it can mash up existing content [SHO], reframe it to fit a new context [PARK], fill in gaps in an appropriate fashion [CON], or generate potential solutions given a range of parameters [AUTO].

 

Make no mistake – Deep Learning is a “Disruptive” technology that is taking over operations of the most advanced technology companies in the world.

 

On Disruptiveness

Of late, the business world has become much more difficult and competitive. This situation has been made worse by disruptive changes in the global economy. The potential of nimbler competitors to disrupt the businesses of incumbents has never been greater. Peter Diamandis describes the Six D’s of Exponentials as consisting of the following:

 

  • Digitization – Anything that can be digitized can lead to the same exponential growth we find in computing. Anything that is digitized or virtualized instead is unencumbered by physical law. It thus costs less to mass produce and moves faster in spreading.
  • Deception – Once digitized or virtualized, initial growth deceptively appears linear. However, given time, exponential growth becomes obvious. For many it is too late to react once growth of a competitor hits this transition.
  • Disruption – New markets that are more effective and less costly are created. Existing markets that are tied to the physical world will eventually become extinct. We’ve seen this in music, photography and many other areas.
  • Demonetization – As cost heads towards zero, so does the ability to solicit a payment for it. Thus, a business has to reinvent its revenue model, or come up with new ways of monetization.
  • Dematerialization – Physical products disappear and are replaced by a more convenient and accessible alternative.
  • Democratization — More people now have access to technology at a lower cost. The means of production have become more accessible to everyone. This access is no longer confined to the big corporation, or the wealthy. We see this fragmentation everywhere where producers are publishing their own books, music and videos. This feeds back into itself and smaller players become able to compete.

 

To survive this disruption, there is an ever-pressing need for enterprises to take drastic action by re-engineering how they run their businesses.

 

John Hagel proposes four kinds of platforms [HAG] that leverage networking effects as an organizational mechanism to combat disruptive technologies. The four platforms that Hagel proposes are Aggregation platforms (example: Marketplaces), Social platforms (example: Social Networks), Mobilization platforms (example: Complex supply chains) and Learning platforms.

 

 

Learning platforms

Learning platforms are dynamic and adaptive environments where people come together to collectively learn how to address complex problems. Members can connect to ask questions, share experiences and offer advice. Open source projects that are actively managed with distributed source control, test-driven development, issue tracking, and continuous integration, is a good example of a learning platform. The key ingredient here is that there is a learning mechanism that gets codified continuously. The fact that we find this in software development should not come as a surprise, as software development is essentially a learning process.

 

John Hagel describes an intriguing property of a Learning platform:

 

What if we change the assumption, though? What if each fax machine acquired more features and functions as it connected with more fax machines? What if its features multiplied at a faster rate as more fax machines joined the network? Now, we’d have a second level of network effect — we’d still have the network effects that come by simply increasing the number of fax machines, but now there’s an additional network effect that accrues as each fax machine adds more and more features as a result of interacting with other fax machines.

 

What Hagel is saying is that the members of the network adaptively become more effective and capable as a participant in the learning network. In other words, not only is there the conventional networking effect, but another mechanism kicks network effects into overdrive. Learning platforms such as an open source community can further accelerate the disruptiveness of an already disruptive technology.

 

Historically, an open source strategy has been quite effective in many disruptive technology areas. In the Internet, Linux (79%) dominates in back end infrastructure services – Google’s Chrome (58%), Android (65%), Web-servers (65% Apache and Nginx). It should not surprise anyone when an open source strategy in the disruptive deep learning space eventually emerges as the dominant platform.

 

There are only a few semiconductor manufacturers that have the economies of scale to be competitive in high-performance computing. These are Nvidia, Intel, AMD, Qualcomm and Xilinx. We will now explore AMD’s deep learning solution and detail their unique open source strategy. We will also look at how it gives the company a competitive advantage.

 

Deep learning as a disruptive technology is critically enabled by hardware. AMD is one of the few semiconductor companies that actually exploits neural network in their hardware. In AMD’s SenseMI Infinity Fabric, an evolution of AMD HyperTransport interconnect technology, the design uses “perceptrons” to support branch prediction. AMD’s GPU hardware has always been competitive against Nvidia hardware. When algorithms are extensively optimized, AMD hardware is in fact favored. This is shown in the many cryptocurrency proof-of-work algorithms that have favored AMD hardware. Raja Koduri, head of AMD Radeon products, recently noted that AMD has had more compute per buck since 2005.

 

AMD’s Open Source Deep Learning Stack

Before we get into the detail of AMD’s deep learning stack, let’s look at the philosophy behind the development tooling. AMD, having a unique position of being both a CPU and GPU vendor, has been promoting the concept of a Heterogeneous System Architecture (HSA) for a number of years. Unlike most development tools from other vendors, AMD’s tooling is designed to support both their x86 based CPU and their GPU. AMD shares the HSA design and implementations in the HSA foundation (founded in 2012), a non-profit organization that has members including other CPU vendors like ARM, Qualcomm and Samsung.

 

The HSA foundation has an informative graphic that illustrates the HSA stack:

 

hsa.png

As you can see, the middleware (i.e. HSA Runtime Infrastructure) provides an abstraction layer between the different kinds of compute devices that reside in a single system. One can think of this as a virtual machine that allows the same program to be run on both a CPU and a GPU.

 

In November 2015, AMD announced the ROCm initiative to support High Performance Computing (HPC) workloads, and to provide an alternative to Nvidia’s CUDA platform. The initiative released an open source 64-bit Linux driver (known as the ROCk Kernel Driver) and an extended (i.e. non-standard) HSA runtime (known as the ROCr Runtime). ROCm also inherits previous HSA innovations such as AQL packets, user-mode queues and context-switching.

 

ROCm also released a C/C++ compiler called the Heterogeneous Compute Compiler (HCC) targeted to support HPC applications. HCC is based on the open-source LLVM compiler infrastructure project [WIKI]. There are many other open source versions of languages that use LLVM. Some examples are Ada, C#, Delphi, Fortran, Haskell, Java bytecode, Julia, Lua, Objective-C, Python, R, Ruby, Rust, and Swift. This rich ecosystem opens the possibility of alternative languages on the ROCm platform. One promising development of this kind is the Python implementation called NUMBA.

 

Added to the compiler is an API called HC which provides additional control over synchronization, data movement and memory allocation. HCC supports other parallel programming APIs, but to avoid further confusion, I will not mention them here.

 

The HCC compiler is based on work at the HSA foundation. This allows CPU and GPU code to be written in the same source file and supports capabilities such as a unified CPU-GPU memory space.

 

To further narrow the capability gap, the ROCm Initiative created a CUDA porting tool called HIP (let’s ignore what it stands for). HIP provides tooling that scans CUDA source code and converts it into corresponding HIP source code. HIP source code looks similar to CUDA code, but compiled HIP code can support both CUDA and AMD based GPU devices.

 

AMD1-211x300.png

AMD took the Caffe framework with 55,000 lines of optimized CUDA code and applied their HIP tooling. 99.6% of the 55,000 lines of code was translated automatically. The remaining code took a week to complete by a single developer. Once ported, the HIP code performed as well as the original CUDA version.

 

HIP is not 100% compatible with CUDA, but it does provide a migration path for developers to support an alternative GPU platform. This is great for developers who already have a large CUDA code base.

 

Early this year AMD decided to get even “closer to the metal” by announcing the “Lightning Compiler Initiative.” This HCC compiler now supports the direct generation of the Radeon GPU instruction set (known as GSN ISA) instead of HSAIL.

 

As we shall see later, directly targeting native GPU instructions is critical to get higher performance. All the libraries under ROCm support GSN ISA.

 

AMD2.png

The diagram depicts the relationships between the ROCm components. The HCC compiler generates both the CPU and GPU code. It uses different LLVM back ends to generate x86 and GCN ISA code from a single C/C++ source. A GSN ISA assembler can also [1] be used as a source for the GCN target.

 

The CPU and GPU code are linked with the HCC runtime to form the application (compare this with HSA diagram). The application communicates with the ROCr driver that resides in user space in Linux. The ROCr driver uses a low latency mechanism (packet based AQL) to coordinate with the ROCk Kernel Driver.

 

This raises two key points about what is required for high-performance computation:

 

1. The ability to perform work at the assembly language level of a device.

2. The availability of highly optimized libraries.

 

In 2015, Peter Warden wrote, “Why GEMM is at the heart of deep learning” [WAR] about the importance of optimized matrix libraries. BLAS (Basic Linear Algebra Subprograms) are hand-optimized libraries that trace their origins way back to Fortran code. Warden writes:

 

The Fortran world of scientific programmers has spent decades optimizing code to perform large matrix to matrix multiplications, and the benefits from the very regular patterns of memory access outweigh the wasteful storage costs.

 

This kind of attention to every detailed memory access is hard to replicate despite our advances in compiler technology. Warden went even further in 2017 when he wrote, “Why Deep learning Needs Assembler Hackers” [WAR2]:

 

I spend a large amount of my time worrying about instruction dependencies and all the other hardware details that we were supposed to be able to escape in the 21st century.

 

Despite being a very recent technology, software that enables deep learning is a complex stack. A common perception is that most deep learning frameworks (i.e. TensorFlow, Torch, Caffe etc) are open source. These frameworks are however built on highly optimized kernels that are often proprietary. Developers can go to great lengths to squeeze every ounce of performance from their hardware.

 

As an example, Scott Gray of Nervana systems had to reverse engineer Nvidia’s instruction set [GRAY] to create an assembler:

 

I basically came to the conclusion that it was not possible to fully utilize the hardware I bought with the tools Nvidia provides. Nvidia, unfortunately, doesn’t believe in eating their own dog food and they hand assemble their library routines, rather than use ptxas like the rest of us have to.

 

Gray used assembly language to write their kernels, thus creating algorithms that bested the proprietary alternatives. Now imagine how much less work he would have to do if the assembly language was available and documented. This is what AMD is bringing to the table.

 

The ROCm initiative provides the handcrafted libraries and assembly language tooling that will allow developers to extract every ounce of performance from AMD hardware. This includes a rocBLAS [KNOX], an implementation of BLAS that provides these level capabilities:

 

BLAS Level-1:

  • amax, amin, asum, axpy, copy, dot, nrm2, scal, swap

 

BLAS Level-2:

  • gemv

 

BLAS Level-3:

  • gemm, trtri, batched-trtri

 

This is implemented from scratch with a HIP interface. AMD has even provided a tool (i.e. Tensile) that supports the benchmarking of rocBLAS. AMD also provides an FFT library called rocFFT that is also written with HIP interfaces.

 

I wonder if Facebook’s fbcunn (Deep learning extensions for CUDA) [GIT], a library that employs FFTs to accelerate convolutions, could be ported using the HIP tooling.

 

Deep learning algorithms continue to evolve at a rapid pace. In the beginning, frameworks exploited the available matrix multiplication libraries. These finely tuned algorithms have been developed over decades. As research continued, newer kinds of algorithms were proposed.

 

Thus came the need to go beyond generic matrix multiplication. Convolutional networks came along and this resulted in even more innovative algorithms. Today, many of these algorithms are crafted by hand using assembly language.

 

Here is a partial list of deep learning specific optimizations that are performed by a proprietary library:

 

 

Activation Functions: ReLU, Sigmoid, Tanh, Pooling, Softmax, Log Softmax

Higher Order Tensor Operations: Ordering, Striding, Padding, Subregions

Forward and Backward Convolutions: 2D, FFT, Tiled, 3×3

Small Data Types: FP16, Half2

Normalization: Batch, Local Response

Recurrent Neural Network: LSTM

 

These low-level tweaks can lead to remarkable performance improvements. For some operations (i.e. batch normalization), the performance increases 14 times compared to a non-optimized solution.

 

AMD is set to release a library called miOpen that includes handcrafted optimizations. This library includes Radeon GPU-specific optimizations for operations and will likely include many of those described above. MiOpen is scheduled for a release in the first half of this year. Its release will coincide with the release of other popular deep learning frameworks such as Caffe, Torch7, and TensorFlow. This will allow application code that uses these frameworks to perform competitively on Radeon GPU hardware.

 

Many other state-of-the-art methods have not yet worked their way into proprietary deep learning libraries. These are proposed almost every day as new papers are published in Arxiv.

Here are just a few:

  • CReLU
  • PReLU
  • Hierarchical Softmax
  • Adaptive Softmax
  • Layer Normalization
  • Weight Normalization
  • Wasserstein Loss
  • Z-Loss

 

It would be very difficult for any vendor to keep up with such a furious pace. In the current situation, given the lack of transparency in development tools, developers are forced to wait, although they would rather be performing the coding and optimizations themselves. Fortunately, the open source ROCm initiative solves the problem.

 

ROCm includes an open source GCN ISA based assembler and disassembler.

 

System Wide Optimization

In a recent investor’s meeting by Intel, the company shared some of their statistics:

 

Among servers used for deep learning applications, the chipmaker says that 91% use just Intel Xeon processors to handle the computations, 7% use Xeon processors paired with graphics processing units, while 2% use alternative architectures altogether.

 

The mix will change as the value of deep learning is understood better. The point here is that CPUs will always be required, even if most of the computations are performed by GPUs. That being said, it is important to recognize that system-wide optimizations are equally critical. This is where AMD’s original investments in Heterogeneous System Architecture may pay big dividends. I would however like to point out that new research efforts are underway to optimize the code that is emitted by deep learning frameworks further.

 

Deep learning frameworks like Caffe and Tensorflow have internal computational graphs. These graphs specify the execution order of mathematical operations, similar to a dataflow. These frameworks use the graph to orchestrate its execution on groups of CPUs and GPUs. The execution is parallel and this is one reason why GPUs are ideal for this kind of computation. There are however plenty of untapped opportunities to improve the orchestration between the CPU and GPU.

 

The current state of Deep Learning frameworks is similar to the state before the creation of a common code generation backend like the LLVM. In the past, every programming language had its own way of generating machine code. With the development of LLVM, many languages now share the same backend code. The frontend code only needs to translate source code to an intermediate representation (IR). Deep Learning frameworks will eventually need a similar IR for Deep Learning solutions. The IR for Deep Learning is the computational graph.

 

New research is exploring ways to optimize the computational graph in a way that goes beyond just single device optimization and towards more global multi-device optimization.

 

An example of this is the research project XLA (Accelerated Linear Algebra) from the TensorFlow developers. XLA supports both Just in Time (JIT) or Ahead of Time (AOT) compilation. XLA is a high-level optimizer that performs its work by optimizing the interplay of the CPUs, GPUs and FPGAs.

 

The optimizations planned include:

 

  • Fusing of pipelined operations
  • Aggressive constant propagation
  • Reduction of storage buffers
  • Fusing of low-level operators

 

There are two other open source projects that are also exploring computational graph optimization. NNVM from the MXNet developer is another computation graph optimization framework that, similar to XLA, provides an intermediate representation. The goal is for optimizers to reduce memory and device allocation, while preserving the original computational semantics.

NGraph from Intel is exploring optimizations that include:

 

  • Kernel fusion
  • Buffer allocation
  • Training optimizations
  • Inference optimizations
  • Data layout
  • Distributed training

 

There are certainly plenty of ideas around of how to improve the performance.

 

AMD has developed a runtime framework that takes into account heterogeneous CPU-GPU systems. It is called Asynchronous Task and Memory Interface (ATMI). The ATMI runtime is driven by a declarative description of high-level tasks that will execute the scheduling and memory in an optimal manner.

 

ATMI is also open source and can be exploited to drive deep learning based computational graphs like the ones found in XLA, NNVM or NGraph. The future of Deep Learning software will revolve around a common computational graph and optimizations will take the orchestration of the entire system into consideration.

 

Operations and Virtualization

What we have been discussing so far are the opportunities to squeeze as much performance from hardware as possible, but there is more to a complete solution than just raw performance.

 

Every complex system requires good manageability to ensure continued and sustained operations. The ROCm initiative does not overlook this need and provides open source implementations. ROC-smi, ROCm-Docker and ROCm-profiler are three open source projects that provide support for operations.

 

AMD’s GPU hardware and drivers have also been designed to support GPU virtualization (see: MxGPU). This permits GPU hardware to be shared by multiple users. I will discuss operational aspects of AMD’s offerings in a next article.

 

Deployment

Throughout this article, we’ve discussed the promising aspects of the ROCm software stack. When the rubber meets the road, we need to discuss the kind of hardware that software will run on. There are many different scenarios where it makes sense to deploy deep learning. Contrary to popular belief, not everything needs to reside in the cloud. Self-driving cars or universal translation devices need to operate without connectivity.

 

Deep learning also has two primary modes of operation – “training” and “inference”. In the training mode, you would like to have the biggest, fastest GPUs on the planet and you want many of them. In inference mode, you still want fast, but the emphasis is on economic power consumption. We don’t want to drive our businesses to the ground by paying for expensive power.

 

In summary, you want a variety of hardware that operates in different contexts. That’s where AMD is in good position. AMD has recently announced some pretty impressive hardware that’s geared toward deep learning workloads. The product is called Radeon Instinct and it consists of several GPU cards: the MI6, MI8, and MI25. The number roughly corresponds to the number of operations the card can crank out. An MI6 can perform roughly 6 trillion floating-point operations per second (aka teraflops).

 

The Radeon Instinct MI6 with a planned 16GB for GDDR5 memory is a low-cost inference and training solution. MI8 with 4GB HBM is designed primarily for inference-based workloads. MI25 is designed for large training workloads and will be based on the soon to be released Vega architecture. Shuttling data back and forth between GPU and CPU is one of the bottlenecks in training deep learning systems. Vega’s unique architecture, capable of addressing 512TB of memory, gives it a distinct advantage.

 

There’s also a lot more to say about GPU and CPU integration. I’ll briefly mention some points. On the server-side, AMD has partnered with Supermicro and Inventec to come up with some impressive hardware. At the top of the line, the Inventec K888 (dubbed “Falconwitch”) is a 400-teraflop 4U monster. By comparison, the Nvidia flagship DGX-1 3U server can muster a mere 170 teraflops.

 

There is also promise at the embedded device level. AMD already supports custom CPU-GPU chips for Microsoft’s Xbox and Sony’s PlayStation. An AMD APU (i.e. CPUs with integrated GPUs) can also provide solutions for smaller form factor devices. The beauty of AMD’s strategy is that the same HSA based architecture is available for the developer in the smallest of footprints, as well as in the fastest servers. This breadth of hardware offerings allows deep learning developers a wealth of flexibility in deploying their solutions. Deep learning is progressing at breakneck speed and one can never predict the best way to deploy a solution.

 

Conclusion

Deep learning is a disruptive technology like the Internet and mobile computing that came before. Open source software has been the dominant platform that has enabled these technologies.

 

AMD combines these powerful principles with its open source ROCm initiative. On its own, this definitely has the potential of accelerating deep learning development. ROCm provides a comprehensive set of components that address the high performance computing needs, such as providing tools that are closer to the metal. These include hand-tuned libraries and support for assembly language tooling.

 

Future deep learning software will demand even greater optimizations that span many kinds of computing cores. In my view, AMD’s strategic vision of investing heavily in heterogeneous system architectures gives their platform a distinct edge.

 

AMD’s open source strategy is uniquely positioned to disrupt and take the lead in future deep learning developments.

 

 

Carlos E. Perez is Co-Founder at Intuition Machine. He specializes in Deep Learning patterns, methodology and strategy. Many of his other writings on Artificial Intelligence can be found on Medium. His postings are his own opinions and may not represent AMD’s positions, strategies, or opinions. Links to third party sites and references to third party trademarks are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.

Today in San Francisco, California, AMD held a special event where we announced the newest additions to the Radeon Instinct™ family of compute products; the AMD Radeon Instinct™ MI60 and Radeon Instinct™ MI50. In step with the new hardware, the Radeon Open eCosystem (ROCm) has been updated with massive improvements in the device drivers, the compilers and supporting tools. The low-level math libraries, along with MIOpen, the machine intelligence library, have been optimized to really make deep learning  applications sing.

 

ROCm is an open software platform for GPU-enabled HPC computing. It was created with developers in mind to accommodate future technologies including machine learning and artificial intelligence. As an open platform, the ROCm ecosystem provides a rich foundation of modern programming languages, designed to speed development of high-performance, energy-efficient heterogeneous computing systems.

 

We enabled AMD’s ROCm capable GPUs in the Linux ecosystem for easy deployment of deep learning applications in Linux distributions. The amdkfd device driver is now supported in the mainline kernel and this kernel is picked up by all the major distributions for their standard releases. Now we also support MI60 and MI50, based on the new Vega architecture, in the linux-next repository. For distributions not using the latest kernel, a DKMS build is still a viable option to add support for the MI60 and MI50 GPUs.

 

We have updated the LLVM based clang compiler to support the new GPU architecture, including the new compute instructions targeted to accelerate machine learning computations. These low-level instructions implement compute operations all the way from single bit precision to 64-bit floating point. The most beneficial instruction for the acceleration of deep learning training is a float 16 dot product which accumulates into a 32-bit result, maintaining the accuracy of the operation.

 

Profiling and debugging tools required updates to support the new hardware. These tools enable developers to get the most out of the GPU compute cycles and understand where the bottlenecks occur in their applications. Follow the development on our github site.

 

Math libraries were customized with the hardware architecture in mind, resulting in an very optimized solution. There are many different ways to optimize these math operations, and each specific matrix and convolution size needs to be tuned, so AMD built a tool to help automate the optimization process. This tool is called Tensile and is very useful for creating a library for GEMMs, GEMM-like problems (such as batched GEMM), N-dimensional tensor contractions, and anything else that multiplies two multi-dimensional objects together on a GPU. MIOpen also underwent massive optimizations and updates to realize the incredible benefits of the foundational math libraries when integrated with deep learning frameworks.

 

One of the most exciting developments over the past year is the integration and progress with the machine learning frameworks. ROCm has been updated to support the TensorFlow framework API v1.11 and is actively upstreaming the code into the main repository. Check out the TensorFlow github to follow the updates or see our github page for PyTorch, Caffe2, Caffe and other framework developments.

 

To try out the newest packages, develop an application and easily deploy a ROCm solution, get the most recent Docker images here - which saves you the time of collecting all the libraries and building them specifically for your platform.

 

We are always looking for skilled developers excited to work in this rapidly changing field. Check out our job listings at amd.com.