Instinct Accelerators Blog

cancel
Showing results for 
Search instead for 
Did you mean: 

Instinct Accelerators Blog

This week at Supercomputing ’20, AMD unveiled our new AMD Instinct™ MI100 accelerator, the world’s fastest HPC GPU accelerator for scientific workloads and the first to surpass the 10 teraflops (FP64) performance barrier. 

 

Read more...

more
0 0 111

Learn how AMD EPYC Processors are powering the new HPE Apollo 6500 Gen10 Plus System: the GPU workhorse for HPC and AI workloads.

Read more...

more
0 0 103

Ever since Microsoft’s introduction of its new NVv4 instances in Microsoft Azure, a lot of attention has been rightly focused on the underlying technology. And to be sure, the 2nd Gen AMD EPYC processors and AMD Radeon Instinct GPUs underpinning the NVv4 physical platform enable state-of-the-art virtualization solutions for GPU-accelerated workloads in the public cloud.

But while everybody has been busy looking at the “speeds and feeds”, I think the experiences and functionality made possible by NVv4 are more interesting to talk about. With that in mind, I thought it would be useful to begin a series of blogs to help answer the questions, “what can you do with NVv4,” and “what does that experience look like,” from the perspective of end-users;  the workers, makers, do-ers, and creators who will most directly make daily use of the offering.

 

To be clear, GPU acceleration in the cloud is not new, however, NVv4 rewrites the rules in substantial ways. With the arrival of NVv4, GPU acceleration (and virtualization!) in the public cloud is finally coming of age. NVv4 allows for fine-grained provisioning of virtual machines in a golden zone of matched price and performance across the broadest range of requirements for cloud-based Windows 10 desktops, as well as interactive and immersive applications. In short, NVv4 is the multitool of the modern visual cloud.

 

In the past, GPUs in the cloud were limited to one GPU to one user, and they were dang expensive! While it was possible to provide GPU support this way, an enterprise had to pay quite a bit to reserve the needed cloud resources. NVv4 is different because it enables one to right-size cloud-based GPU capacity and performance to fit the job. This flexibility makes a wider range of options feasible to better support both office productivity and power users.

 

Why does GPU support matter?

One of the challenges for cloud-based operations has been the fact that many, and frankly most, modern applications require GPU acceleration to run smoothly and effectively. That goes not only for high-end design software. The everyday business productivity applications used by millions of workers need GPU support too including applications such as Microsoft Office, word processing, spreadsheets, Microsoft PowerPoint, and video conferencing.

 

Added to that, many offices have a population of power users who need to work with media, from a bit of light video/photo editing to desktop design. They expect the application experience to be snappy. In fact, application responsiveness is vital for people to stay in their creative flow, whether that is crunching numbers, drafting a document, or designing a newsletter. GPU support is the secret ingredient that can make sure they all enjoy a great user experience, which leads to better productivity and happy workers.

What is a great user experience?

In the context of the Cloud, a great user experience is one that is comparable to a modern local desktop experience. Of course, that starts with basic application responsiveness. But it also refers to must-have capabilities such as support for high-resolution monitors, multiple monitors, and rich multimedia capabilities. Simply put, when the work moves to the Cloud, users should not be asked to make tradeoffs to get there.

 

How does NVv4 deliver?

To support workers from the cloud, NVv4 for Azure uses Windows Virtual Desktop. Since the applications used are not going to demand all the resources of a high-performance GPU, the platform is designed so that a number of people can happily share the resources of a single CPU and a single GPU. This session-based, desktop as a service (DaaS) solution is well suited to supporting large numbers of workers while making efficient use of processing resources and, therefore, budgets.

 

NVv4 would be compelling even if it only migrated existing office workflows to the Cloud. Not only can this approach deliver a great experience to the worker’s desktop, but it opens remarkable opportunities for flexibility and mobility. No longer needing a high-powered desktop or laptop, NVv4 desktop virtualization combined with ubiquitous access to broadband services at home and on the road, means more people can be as productive away from their desks as they’ve been in the office.

Things are a lot different today than they were even a year ago in terms of options, pricing, and capabilities for virtualized environments for the majority of office work. Ultimately NVv4 presents an opportunity to revisit and challenge our preconceptions about what is possible for GPU accelerated DaaS from the Cloud.

Side by side comparision of a session-based deployment - with and without GPU enabled 

In future blogs, we will dig more deeply into both the user experience and the IT management considerations. In particular, I am excited to explore how NVv4 and its AMD-powered GPU support can solve challenges for workers in different industries such as design, manufacturing, architecture, engineering, construction, finance, and others. The opportunities extend far and wide!

 

Other resources to consider: 

 

Adam Glick is a Technical Marketing Manager for AMD. His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied. Use of third-party marks / logos/ products is for informational purposes only and no endorsement of or by AMD is intended or implied.

more
1 0 977

With the arrival of NVv4 instances for Microsoft Azure, decision-makers in many industries, education and government are asking themselves whether cloud-based virtualized desktops can meet their stringent requirements, both in terms of high productivity and financial feasibility.

 

With the further announcement of NVv4 being successfully tested and recommended by Esri for its flagship ArcGIS Pro applications, the answer is now a clear, yes! After undergoing rigorous testing and detailed evaluation, IT managers and users have the assurance of reliability they need to take their trusted workstation and desktop working environments to the Cloud. This verification and validation is critical because it provides affirmation that NVv4 has been carefully evaluated by Esri to ensure that it is fully optimized to meet the expectations of Esri users and that they can rely on a fully vendor supported solution

 

So what is NVv4?

NVv4 instances for Azure are virtualization solutions that use the power of 2nd generation AMD EPYC processors and Radeon Instinct GPUs from the Cloud. The close, balanced interplay between these resources is the key to making affordable, fully cloud-based desktop environments capable of addressing the computing needs of a wide variety of workers, from those using everyday office productivity applications to full-blown high-performance workstation tools.

 

The Opportunity for Esri Workflows

Complex GIS (Geographic Information System) software such as ArcGIS Pro requires GPU support to deliver a smooth, reliable user experience. However, not all applications or use cases can make use of the capabilities of a complete server GPU. In the past, this has been a limiting factor to mass adoption as the only available option was to dedicate an entire server GPU in Azure to each user’s GPU (16GB). This was an inefficient and costly approach. While the most demanding visualization power users, data analysts or geophysicists may very well require an NVv4 option of a full, dedicated GPU to support their workflow, a desktop user viewing and modifying data may only require one-eighth of a GPU (2GB) to have a great experience from the Cloud.

 

One of the significant innovations found in NVv4 is fractional GPU capability. Made possible by AMD’s implementation of SR-IOV technology in its AMD Radeon Instinct GPUs, fractional GPU means that individual AMD GPUs in Azure can be shared among multiple users. With NVv4, each individual user enjoys an experience comparable to that which they would expect from a locally installed GPU, even when the GPU they access is shared among multiple users. Hardware resources are physically isolated, separating each VM from others even when a GPU is shared, which helps ensure security within the environment. Optimizations resulting from the collaborative effort of Microsoft, Esri, and AMD further underpin the powerful experience for the user.

 

Further Information

With demand for validated remote and home working solutions rising, Esri have released a number of resources documenting their support for the NVv4 instances including a detailed whitepaper, ArcGIS Pro Virtualization and a collection of resources targeted at Higher Education including architectures to support remote working and online classes and labs, as well as on campus, Virtualization of ArcGIS from the Cloud and On-Premise platforms to support Higher Education”

 

Esri have release a detail guide to the performance, functionality and benchmarking tests they performed upon NVv4 alongside resource planning advice to aid those wanting to choose between NVv4 for specific use cases on their own site, seeArcGIS Pro on the Azure NVv4-series

 

Esri testing and endorsement may rewrite the rules that dictate where and how people work. Even the most demanding application requirements can be addressed from wherever the user is located and using whatever device is available to them. One need no longer be shackled to high-performance workstations: engineers, geologists, data analysts and data visualization experts can access their Esri tools whenever and wherever work or life takes them.

 

For more resources:

  • NVv4 Microsoft GA blog: Link
  • NVv4 pricing: Link
  • com/Nvv4: Link
  • NVv4 for Education: Link
  • NVv4 for Design and Manufacturing: Link
  • NVv4 for Architecture, Engineering and Construction (AEC): Link
  • ESRI NVv4 blog: Link
  • ESRI in higher education: Link

 

George Watkins is a Product Marketing Manager for AMD. His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied.

 

more
0 0 964

With the arrival of NVv4 instances for Microsoft Azure, decision-makers in many industries and universities are asking themselves whether cloud-based virtualized desktops can meet their stringent requirements, both in terms of high productivity and financial feasibility.

 

With the further announcement of NVv4 certification by Autodesk for its AutoCAD, Revit, and Inventor applications, the answer is now a clear yes. After undergoing rigorous testing and detailed evaluation, IT managers and users have the assurance of reliability and the support from Autodesk they need to take their trusted workstation and desktop working environments to the Cloud. This certification is critical because it provides affirmation that NVv4 has been carefully evaluated by Autodesk and is fully supported meeting the expectations of Autodesk’s 3D CAD, AEC and VFX users.

 

So what is NVv4?

NVv4 instances for Azure are virtualization solutions that use the power of 2nd generation AMD EPYC processors and Radeon Instinct GPUs from the Cloud. The close, balanced interplay between these resources is the key to making cost-effective, fully cloud-based desktop environments capable of addressing the computing needs of a wide variety of workers, from those using everyday office productivity applications to full-blown high-performance workstation tools.

 

The Opportunity for Autodesk Workflows

The AutoCAD, Revit and Inventor applications from Autodesk require GPU support to deliver a smooth, reliable user experience. However, not all applications or use cases make use of the capabilities of a complete server GPU. In the past, this has been a limiting factor as the only available option was to dedicate an entire server GPU (often 16GB per user) in Azure to each user’s VM. This was an inefficient and costly approach, limiting server density. While the most demanding design visualization power users may very well require a full, dedicated GPU to support their workflow, a desktop user preparing technical publications may only require one-eighth of a GPU to have a great experience from the Cloud. 

 

One of the significant innovations found in NVv4 is fractional GPU capability. Made possible by AMD’s implementation of SR-IOV technology in its Radeon Instinct GPUs, fractional GPU means that individual AMD GPUs in Azure NVv4 instances can be shared among multiple users. With NVv4, each individual user can enjoy an experience comparable to that which they would expect from a locally installed professional grade GPU including professional drivers, even when the GPU they access is shared among multiple users. Hardware resources are physically isolated, separating each VM from others even when a GPU is shared, which helps ensure security within the environment. Optimizations resulting from the collaborative effort of Microsoft, Autodesk, and AMD further underpin the powerful experience for the user.

 

Autodesk certification means the offer of full vendor support, regardless of the users location and the device being used. One need no longer be shackled to high-performance workstations: architects, designers, engineers and visual effects (VFX) experts can access their Autodesk tools whenever and wherever work or life takes them.

The certifications for 2019 and 2020 versions of AutoCAD, Revit and Inventor can be found here and are listed as "Radeon Instinct MI25 MxGPU". 2021 certifications are coming soon.

 

For more resources:

  • NVv4 Microsoft GA blog: Link
  • NVv4 pricing: Link
  • AMD.com/Nvv4: Link
  • NVv4 for Education: Link
  • NVv4 for Design and Manufacturing: Link
  • NVv4 for Architecture, Engineering and Construction (AEC): Link

 

George Watkins is a Product Marketing Manager for AMD. His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied.

more
0 0 1,995

The Oil and Gas sector places tremendous demand upon IT infrastructure. Extraction, mining, and drilling projects may cost billions, span multiple years, and are often geographically distributed. With the arrival of Microsoft Azure’s latest GPU-enabled NVv4 instances, oil and gas companies now have a new virtual desktop option that offers significant potential benefits to their workflows, productivity, and IT costs when considering how to address the breadth of their IT requirements.

At the high-end, these companies rely on some of the most demanding workloads in existence, processing massive datasets with 2D and 3D simulation and modelling software in order to plan and manage vast engineering sites, rigs, and construction projects. Combining AMD Radeon InstinctTM MI25 GPUs with up to 16GB of dedicated memory and 64-core AMD EPYCTM 7742 CPUs, NVv4 instances in Azure delivers virtual machines capable of reviewing, processing and analyzing large datasets while delivering workstation-class experiences from the Cloud.

NVv4 is a compelling new virtual desktop and workstation solution that enables geologists and engineers to prototype, scale, and adapt rapidly without the usual risks of long-term commitment or project changes that may render hardware and infrastructure decisions invalid. 

Oil and gas companies also rely on huge numbers of people using office applications, collaboration and communication software (such as Microsoft Teams, Jabber, Hangouts), PLM, SAP, and CRM systems. These applications require a small, but critical, amount of GPU processing to deliver a modern user experience. NVv4 fractional GPU capability makes it possible to support these use cases using virtual desktops, partitioning the GPU resources to satisfy performance, mobility, security, and budget requirements while addressing IT management and security requirements.

Let’s explore in detail some of the features and benefits NVv4 offers:

Secure Remote Access
Migrating workstation-class workloads and user access to the Cloud ensures data is centralized, managed, and secured. Application and graphical and compute processing all take place in the data center. Users receive only a stream of display pixels, protecting against scenarios such as losing a laptop loaded with sensitive data, data loss caused by the failure of local workstations, or viruses. Azure portal-based access removes the need for VPNs and other insecure security measures that can be compromised on an unmanaged endpoint.

Now, key user groups like geologists, engineers, project managers can access files at the office, on-site, or while at home or traveling. Geographically dispersed teams can collaborate on files confident that data is protected and that they’re all working on a single master file. Significantly, the NVv4 portfolio offers a range of performance options that can deliver rich, modern desktop computing experiences to nearly any internet-connected device including tablets, mobile phones and PCs. Now  key user groups can access and work with data while in the harshest physical environments or most sensitive political regions, all while the data remains secure in the data center.

The NVv4 instance is fully supported by Windows® Virtual Desktop, Citrix® Cloud and Teradici® Cloud access.  This broad support gives IT managers the ability to choose their preferred remote protocols, management, and admin tools. This flexibility helps to mitigate the challenges of moving from a private data center to Microsoft Azure by enabling IT managers to work with familiar, preferred solutions and tools.

Shift from CAPEX to OPEX to Manage Costs
With new oil and gas projects already costing tens of billions of dollars, it is important that their associated IT operations deliver infrastructure requirements while being efficient and flexible. By shifting to a Desktop-as-a-Service (DaaS) deployment, a third-party provider like Microsoft Azure provides the IT infrastructure, tests and helps provision and deploy resources for the customer while managing all the necessary hardware in the cloud as-a-service. This makes it possible for IT operations to switch from a rigid CAPEX spending model, requiring the purchase of server hardware, to a more flexible OPEX model, renting cloud-based services on a monthly basis and adjusting as dictated by needs.

Scalability and Rapid Project Initiation

Faced with managing multiple, simultaneous projects distributed around the world, IT departments need effective and reliable tools to scale and deploy infrastructure across different settings. Azure facilitates remote troubleshooting, application updates, and delivery of security patches throughout a project’s lifecycle. Rapid scaling and management of IT resources accelerates production schedules, ensures productivity is enhanced from day one, and makes it possible to eliminate ongoing costs when projects are complete.  

Azure Guaranteed High-Availability to Reduce Costly Downtime
The scale of investment in oil and gas projects means that downtime and delay can quickly accumulate into millions lost. A virtual IT infrastructure provides redundancy, stability, and flexibility that protects against the unforeseen, from minor disruptions to significant man-made or natural occurrences. Vital data resources and applications can remain on-line and accessible to staff who can continue to work remotely and securely. Azure’s guaranteed Service Level Agreements (SLAs) for VMs (Virtual Machines) typically guarantee in excess of 99.9% availability offering organizations assurance of high availability. More information on SLA: click here 

 

Secure Flexibility through SR-IOV
Microsoft Azure fractional GPU capability (GPU-P) is built on Single-root input/output virtualization (SR-IOV) standards, unique to AMD powered NVv4 instances, enhances security capabilities when the GPU resources are shared among multiple users in a public environment. This cloud-native, SR-IOV-based virtualization provides improved security compared to software-based GPU virtualization standards as it enables isolation of PCIe® hardware resources, helping prevent unauthorized access to the data of one VM by users of other VMs sharing the GPU. 

 

License Management
The high cost of specialized software can be a barrier to increasing the number of Geologists/Geophysicists assigned to a project. Leveraging DaaS allows for concurrently licensed software to be brokered and rationalized, e.g., in scenarios where some analysts only require access occasionally. IT managers can maintain greater overall awareness of a distributed environment that may include offices and staff around the world. With greater visibility, IT administrators can better optimize usage of costly software licenses, manage costs, and widen access to precious licenses.

 

NVv4 GPU options - optimized resourcing
Analysts, geologists and engineers rely on a range of 3D and graphical applications to perform complex analysis, including Schlumberger Petrel E&P and INTERSECT; Halliburton DecisionSpace and Nexus; CGG GeoSoftware; Ansys Fluent; Autodesk AutoCAD; Dassault SOLIDWORKS and CATIA; Siemens NX and Teamcenter; ESRI ArcGIS; and Spatial Energy Petra. The requirements of individuals users can vary considerably. Some only view specific datasets, lighter weight or 2D CAD models, while others may undertake GPU-intensive CFD simulations. The range of GPU sizes offered in the NVv4 series provides an opportunity for cost savings by enabling IT managers to adjust VMs to fit the needs of different workloads, upsizing or downsizing resources to adjust to users’ real production workloads.  

Industry Certification and Professional Graphics Drivers Included
The AMD EPYC™ 7002 Series processors have robust compatibility with virtually all software available in the market today. AMD works with the open source community and major software vendors to help ensure key industry applications and enabling software will work exceptionally well with the AMD EPYC™ processors.. All AMD supported Azure instances include professional GPU drivers with no licensing cost. ISV certifications and optimizations for professional industry visualization applications including ERSI help to assure a reliable, productive user experience.

Matching NVv4 to Requirements:

KEY OIL & GAS USER GROUPS

GEOLOGISTS, GEOPHYSICISTS, RESERVOIR ENGINEERS

DRILLING ENGINEERS, CAD/CAE USERS

SUPPORTING STAFF E.G. ACCOUNTING, MARKETING, HUMAN RESOURCES,

USE CASES

For remotely viewing and editing massive datasets and complex 2D/3D images

For remotely viewing and editing 2D and 3D mechanical images

For general purpose Windows 10 virtual desktops and office productivity applications

RECOMMEND

Standard_NV32as_v4

Standard_NV8as_v4

Standard_NV16as_v4

Standard_NV4as_v4 or

 

Other resources to consider: 

 

George Watkins is a Product Marketing Manager for AMD. His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied. Use of third party marks / logos/ products is for informational purposes only and no endorsement of or by AMD is intended or implied.

more
0 0 793
Staff
Staff

Welcome developers to the first in a series of blogs about AMD ROCm. Im Terry Deem, Product Manager for ROCm. In these blogs, I will let you know about upcoming new releases, features, training, and case studies surrounding ROCm. The ROCm SDK is a set of tools, libraries, and API for developing HPC applications using GPUs for computing. You can learn more about ROCm with this introduction video located here 

After watching the introduction video, you might want to know more about HIP. HIP is the API used to develop your application to run on either an AMD or NVIDIA GPU. This powerful API makes it easy to, with minimal effort, let the same source code compile for both AMD and NVIDIA GPU’s. If your application is already in CUDA and you want to expand it to work on AMD GPU’s, use the HIPIFY tool. This tool will automatically convert the source from CUDA to HIP.   

In this blog, I am happy to announce our first set of on demand videos on the ROCm technology. You can find them here below. In these videos you will learn about AMD GPUs and how to develop applications that can utilize their compute power to accelerate your applications. You will learn how the GPU works, how threading works on them and how to write your programs using the HIP API in the ROCm SDK.  

 

ROCm Video Series 

1) Introduction to AMD GPU Hardware: Link 

2) GPU Programming Concepts Part 1 - Porting with HIP: Link 

3) GPU Programming Concepts Part 2 - Device Management, Synchronization and MPI Programming: Link 

4) GPU Programming Concepts Part 3 - Device Code, Shared Memory and Thread Synchronization: Link 

5) GPU Programming Software - Compilers, Libraries and Tools: Link 

6) Porting CUDA to HIP: Link 

 

ROCm and HIP are foundational to the applications that will run on the two Exascale systems that was recently announced, Frontier and El Capitan. You can learn more about ROCm on our documentation site located here. We are excited to see what you can do with HIP and look forward to hearing from you.  

 

Resources:  

 

Terry Deem is a Sr. Product Manager for ROCm at AMD. His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied. 

more
1 1 2,451
Staff
Staff

With all the excitement around the general availability of Microsoft’s Azure NVv4 instances, I wanted to reshare this MxGPU white paper that AMD’s Tonny Wong created when we first launched the SR-IOV based GPU virtualization architecture. This is a great paper for anyone wanting to understand and learn more about the underlining technology within our GPU architecture.  (Note: we have made a few updates to the paper below to keep it current.)

 

White Paper - AMD MULTIUSER GPU

Originally created by Tonny Wong, Radeon Technologies Group

 

 

SR-IOV-BASED GPU VIRTUALIZATION FOR A TRUE WORKSTATION EXPERIENCE

 

 

Overview

Virtual Desktop Infrastructure (VDI) has evolved over the last few years, enabling richer user experiences and improved manageability and deployment ease. Many traditional VDI enterprise customers have gained productivity and lowered Total Cost of Ownership (TCO) for their desktop users. The growth of VDI needs to address the needs of “greenfield” users, those organizations that want the benefits of secure hosted desktops but with a deployment model that is more consistent with their traditional desk-side workstations. These deployments need to abide to existing datacenter standards for hypervisors while leveraging capabilities that match traditional workstations.

 

The Trend Toward VDI

Remote graphics protocols have greatly improved user experiences, delivering the feel of a local workstation computing resource for LAN users and optimizing multimedia and graphics capabilities for WAN users. These remote protocols can deliver GPU-rendered content from the datacenter allowing Virtual Machines with standard desktop OS’s to be the main deployment method for users of all types. From demanding workstation applications with high 3D GPU needs all the way to standard enterprise desktop users who want GPU-enriched desktop experiences, this range of users can take advantage of a vast array of VDI solutions now in the market.

 

VDI is a great way to help improve desktop security by hosting out of an enterprise private cloud (on-premise datacenter) or via offerings from cloud service providers either fully public or via hybrid public/private clouds.  However, the capabilities should match what users expect from their local workstation systems and not be limited to a subset of features. Enterprise VDI deployments should have access to GPU resources in the datacenter or service provider that deliver 3D capabilities across many users while still making all graphics API and compute API standards are available, just like on local workstation systems.

 

What AMD GPUs bring to the Virtual Desktop GPU technology for VDI allows users migrating from physical workstation desktop systems or notebooks to capture the same or better graphics capabilities as their desktop workstation, with good productivity while enabling more user types to migrate to VDI. In supporting this migration to VDI, GPU vendors need to ensure that, when enabling a GPU for virtualization across many users, this GPU must deliver deterministic performance, helping to better gauge user types and numbers of GPU resources needed.

 

AMD has spent the last few years implementing features in our GPU hardware to prepare for virtualized platforms.  Implementation in our silicon allows our new AMD Multiuser GPU technology to share the GPU resource across multiple users or virtual machines while giving the expanded capabilities users expect from local workstations utilizing discrete GPUs. The AMD Multiuser GPU products can provide enterprise customers with a choice for their GPU and 3D processing needs that can help make GPU use more pervasive on VDI deployments.

 

VDI with GPUs: Lifting Performance and User Experience

With Virtual Desktop Infrastructure (VDI), one can gain the benefits of security, manageability, and remote access to deploy and support enterprise desktop users and may additionally experience lower total cost of ownership (TCO). For the knowledge worker and task worker user types, VDI deployments help apply better control of user environments while enabling increased performance by virtue of virtual machines being closer to datacenter, hosted datasets or applications.

 

Users who required higher computing power specifically around GPU technology for 3D and GPU compute applications were either left on physical desktop systems or deployed with comparatively expensive pass through GPU technology, losing the benefit of distributing the graphics card cost among multiple users.  Early virtualized GPU technologies addressed some of these areas by adapting a standard GPU architecture to virtualization via software in the hypervisor, but this isn’t the ideal solution to mimic true discrete GPU-like performance. Features like GPU compute functionality are not available, limiting some applications to fallback to CPU usage when a desktop workstation would have leveraged a GPU.  Initial pricing for these virtualized GPU solutions was compelling compared to multiple pass-thru GPU devices but they can still have much greater costs than multiple desktop discrete GPUs. Standard VDI technologies utilize software-emulated GPUs, specifically in VMware vSphere with Horizon View, where the base level graphics capabilities are limited.  This works fine for knowledge workers where enabling software 3D emulation with Virtual Shared Graphics Acceleration (vSGA) allows basic applications to run, albeit with higher CPU utilization. vSGA performance is further enhanced by leveraging a hardware GPU with appropriate vSGA drivers from graphics vendors. Even with hardware vSGA support, however, it does not necessarily meet the requirements for more intensive 3D Graphics and Compute user needs. Certifications (CAD/CAE as an example) for applications are not available due to limited support level in graphics APIs like OpenGL® or DirectX®.

 

Virtualized GPUs allow workstation and power user categories to migrate to VDI with acceptable GPU performance. Workstation users from CAD/CAE, M&E and specialized segments can leverage workstation-class drivers on applicable platforms to support applications with certification requirements.  Power users who rely on DTP/Desktop Publishing, or internal enterprise applications who need GPU support can migrate to VDI environments.

 

AMD Multiuser GPU – Technology Foundation

Rather than repurposing an existing GPU and adding a software layer to accommodate virtualization requirements, AMD’s Multiuser GPU approach is to create an entirely new class of GPU architecture with virtualization capabilities built into the silicon. AMD challenged the notion that the support of GPU virtualization required a proprietary software solution. Compliant with the well-established PCIE® virtualization standard SR-IOV (Single Root I/O Virtualization) specification, AMD has implemented a hardware-based GPU architecture. The culmination of these efforts resulted in the creation of the industry’s first hardware virtualized GPU. 

 

The SR-IOV specification defines a virtualized PCIE device to expose one or more physical functions

(PF) plus a number of virtual functions (VFs) on the PCIE bus. The specification also defines a standard method to enable the virtual functions by the system software such as the hypervisor or its delegate. These VFs may inherit the same graphics capabilities of the physical GPU, allowing each to become fully capable of supporting the GPU’s graphics functionality. Through the PF, system software controls enablement and access permissions of the VFs, internally mapping resources such as the graphics cores and GPU local memory.

 

The task of GPU virtualization management can therefore leverage the existing standard PCIE device management logic in the hypervisor, unburdening the hypervisor from proprietary and complex software implementations. To further simplify the deployment, an optional driver can be loaded to help the hypervisor to enable/disable virtual functions and to manage the Multiuser GPU’s resources.

 

The PF manages sharing of graphical resources by scheduling the GPU cores across VFs and allocating graphics memory to each of these VFs. The PF also assigns internal register spaces to each VF ensuring an orderly and structured method for the VFs to access hardware resources and data, at the same time helping keep that data secure. Because each GPU VF is designed to inherit the attributes of the physical GPU, it supports full GPU capabilities allowing the support of graphics and compute features.

 

When these VFs are passed through to their assigned virtual machines, they will appear as full-featured graphical devices to the virtual machine’s guest OS. Since the guest OS sees the VFs as a native graphics device, AMD’s native Radeon™ Pro™ graphics driver that are designed for professional graphics devices can be loaded within the virtual machine to unlock the GPU’s graphics and compute capabilities.

 

A number of Radeon Pro graphics products already support passthrough mode, allowing remote users the ability to access a GPU installed on a host server from a client device. AMD Multiuser GPUs evolved this architecture to support from 1 to 16 VFs, allowing each to appear as a passthrough device with added security and quality of service. Mapping one VF to a virtual machine allows the creation of up to 16 independent guest OSs that are accelerated by a single GPU. User density is limited only by the availability of PCIE slots.

Key Benefits

Predictable Performance

A key benefit of hardware-based virtualization is that hardware-controlled scheduling cycles deliver predictable quality of service (QoS). The fixed scheduling cycles apportioned to each VF ensure that each VF receives its fair share of GPU services.

  

Predictable performance or deterministic QoS results in smooth transitions from proof-of-concept pilots to organization-wide deployments. Pilot managers determine the capabilities of the GPU during the proof-of-concept phase and scale up or scale down user density (number of users per GPU) as required. 

Being able to determine the GPU needs of the user base ties back to an organization’s ability to forecast and plan its resources. Under-forecasting results in failing to meet users’ performance expectations; over-forecasting results in under-utilizing a configuration. The predictable nature of AMD’s Multiuser GPU solution helps avoid these unwanted outcomes.

 

Secure Implementation

The push towards virtualization is in part driven by the needs of centralizing and securing data and resources. The cornerstone of AMD’s Multiuser GPU technology is its ability to preserve the data integrity of virtualized desktops and their application data. The hardware-enforced memory isolation logic provides strong data security among the VFs, which helps prevent one VM from being able to access another VM’s data.

  

With security being a bare minimum requirement for any virtualization solution, AMD’s hardware-based virtualized GPU solution offers a strong deterrent to unauthorized users who traverse the software or application layers seeking means to extract or corrupt GPU user data from the virtual machines. Although a VF can access full GPU capabilities at its own GPU partition, it does not have access to the dedicated local memory of its sibling VFs.

 

Uncompromising Support for APIs and Features

The AMD Multiuser GPU technology exposes all graphics functionality of the GPU to the VF at its partition allowing for not only full support for graphics APIs like DirectX and OpenGL but also GPU compute APIs like OpenCL™.  Code written in these standards for the physical device need not be adapted or altered to function in the virtual environment. AMD is the first GPU vendor to support hardware-based native GPU compute features within the virtual environment. Since VFs are allowed access to all of the GPU’s rendering resources during their respective time slices, the need to perform post-processing operations to partition data or tasks is not necessary.

  

AMD operates on the principle of creating customer-centric designs, offering useful features and allowing customers to build usages around these features. Limits are added to control quality, not to constrain utility. Radeon Pro professional graphics, AMD’s workstation brand of graphics products, can drive up to six displays per GPU as a standard offering on select AMD Radeon Pro W-series products. Because the Multiuser GPU resides among the FirePro brand of products, the ability to drive up to six displays is an inherent feature. Multiuser GPU products extend this feature by allowing each VF to drive up to six displays within the virtual machine (note that this may be dependent on the remoting protocol and client being used).

 

 

Conclusion

The desire to share storage and network resources sparked innovation of technologies for these devices. The need to centralize all these resources and to secure them in a remote datacenter continues to drive the migration to virtualization. GPU virtualization is a relatively late participant in this migration with early proprietary software-based solutions offering limited GPU capabilities. To become ubiquitous, GPU virtualization technology has to be transparent and standardized, giving users near-desktop experiences without alerting to the fact that they are in a virtualized environment.

 

AMD Multiuser GPUs push GPU virtualization closer to complete transparency and ubiquity by innovating with a hardware-based solution with conformance to the virtualization industry standard, making it easy 

to be adopted and integrated into the existing hypervisor ecosystems.

more
0 0 1,402

The financial services industry is no stranger to virtualization, having already come to appreciate the advantages it offers for satisfying important IT requirements such as centralized data security, enhanced mobility, and improved disaster recovery capability. The advent of Microsoft’s new NVv4 instance for Microsoft Azure with fractional GPU capability now has the potential to make it feasible to expand the use cases, practicality, and opportunities to use virtual machines (VMs) to support finance operations.

 

The Virtualization Challenge

One of the barriers to broad adoption of virtualization across many more essential financial applications has been the fact that most widely used software solutions such as trading consoles and visual analytics workstations require GPU support to ensure responsive interactivity under real-time demands. Prior to NVv4, this was only possible by providing each user’s computer or workstation with access to a full, dedicated GPU in the data center. This was highly inefficient, as many applications really only require a small, but nonetheless critical, amount of GPU processing to deliver a great user experience. Thus, the approach was expensive on a per-user basis and did not sufficiently improve the maintenance burden on IT departments. The need to offer the highest level of security for these environments has further complicated the switch to virtualized topologies.

 

NVv4 Changes the Virtualization Equation

Azure NVv4 instances powered by AMD 2nd Gen EPYCTM Processors and AMD Radeon InstinctTM GPUs tackles these challenges. Financial services organizations can deploy cost-effective, fully cloud-based desktop environments that meet the performance, flexibility, security, and cost requirements of their critical applications. NVv4 also addresses the management requirements and security standards demanded by IT management and corporate governance. Specific benefits include:

  • AMD’s SR-IOV technologies enable IT managers to deliver the right amount of GPU service to individual desktops and workstations based on application needs while sharing a high-powered GPU among multiple users.
  • Four AMD powered NVv4 options make it possible to provide configurations that align with the particular computing workloads of different users.
  • VMs such as Azure control data because data never leaves the datacenter; only pixel information is sent to the device.
  • With AMD’ SR-IOV-based GPU virtualization architecture, each virtual desktop is physically isolated, even when a single GPU is shared by multiple users.
  • Based in the Cloud, Azure can reduce reliance and expenditure on physical IT infrastructure such as on-premises data centers.
  • NVv4 offers instances that can support 4K displays, 60hz screen refresh rates, and multi-monitor support for up to 4 monitors. 

Let’s consider just a few of the use cases that are now possible to the financial services sector.

 

Branch offices

Azure is centralized in the Cloud, so it enables IT departments of large financial organizations to remotely deliver and update applications and roll-out security patches. This can also help IT retain greater situational awareness of their entire distributed environment, which may include hundreds or thousands of branch offices, affording improved control and compliance oversight. With greater visibility, IT administrators can better optimize usage of costly software licenses and better manage costs. 

 

Azure supports end-users with an ultra-low-latency global data backbone that delivers a highly productive experience. The combination of AMD enterprise-grade CPU and GPU hardware with the NVv4 Windows® 10 virtual instance helps ensure optimal compression for remote protocols that can overcome local limitations in networking and bandwidth, relieving IT of the need to install and modify leased offices. As tablets and other portable devices become common in local banks, a virtualized approach makes it possible for such devices to access powerful tools, enabling staff to assist customers from convenient, comfortable locations rather than behind a bulky workstation at a fixed desk.  

 

Trading environments

The Windows 10 environment and key business applications such as Bloomberg, Capital IQ, FactSet, and Thomson Reuters Eikon, all require GPU support to deliver the responsive, low-latency interactive experience users such as traders demand. Powered by the combination of AMD 2nd Gen EPYC processors and AMD Radeon Instinct GPUs, NVv4 instances address that challenge while providing IT managers with flexibility to choose the right-sized configuration for different types of users.  Unlike on-premises data centers, where IT managers must purchase hardware and licenses, then install and service servers, NVv4 enables IT managers to simply and quickly provision resources from the Cloud when adding new users to the workforce.  

 

Data Security and Regulatory Compliance

Secure remote access provides financial services companies the knowledge that data is locally replicated and can be backed up centrally in the data center avoiding unmanaged end-points.

 

Business Continuity and Disaster Recovery 

In today’s electronic trading environments, downtime can lead to missed opportunities and significant financial loss. If an office, municipality or large region is impacted by a natural or man-made disruption, a virtualized infrastructure can provide critical redundancy.  It can help ensure that vital data sources, compute/simulation resources, real-time analytics tools, and trading desktops remain online and accessible, enabling staff to work remotely and securely. Azure guaranteed Service Level Agreements (SLAs) for VMs typically guarantee in excess of 99.9 percent availability. 

 

Channel Partner Access

Financial products are often sold via brokers or agents, particularly in the consumer insurance and mortgage sectors. Virtualization can allow financial institutions to provide sales channel partners with secure, limited, ring-fenced access to applications or data as needed. This is critical to maintaining compliance with FSA and GDPR legislation. Azure has a proven track record of supporting the compliance needs of enterprise, global financial services, and banking organizations.

 

The Financial services industry faces some of the most challenging IT configuration and management issues. The flexibility of NVv4 is well worth a look by those looking to effectively streamline some of that complexity and better control costs, without sacrificing performance.

 

Other resources to consider:

George Watkins is a Product Marketing Manager for AMD. His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied.

more
0 0 1,181

Microsoft’s announcement of its new NVv4 virtual desktop instances got me thinking about the many industries that may benefit from expanding virtualization. With fractional GPU functionality built on AMD Radeon GPUs, NVv4 suddenly makes it feasible to apply Desktop as a Service (DaaS) to use cases previously burdened with compromises. So, over my next few blogs, I’ll explore some of those industries, beginning with a favorite of mine, Education.

IT Managers in education work magic, forever balancing technical progress, rising user expectations, and, above all, cost. Microsoft Azure NVv4 is exciting because it addresses the breadth of those challenges. By making it possible to share GPU resources in a third-party, cloud-based managed data center, NVv4 enables education IT to:

  • reduce the need to invest in, manage, and upgrade expensive private data centers
  • define and scale virtual data centers to deal with the evolving demands
  • optimize usage of computing resources
  • deliver a custom-fit, great user experience to the differing needs of students and faculty
  • increase security and accessibility on- and off-campus

DaaS--The Right-Sized Approach to Education IT Needs

DaaS shares the appealing capabilities of on-premises VDI (Virtual Desktop Infrastructure), but with the massive added benefit that a third-party provider like Azure now designs, procures, deploys, and manages all the necessary hardware and VDI software. Education facilities instead rent cloud-based services on a monthly basis. 

IT operations can switch from a rigid CAPEX spending model to a flexible OPEX model, paying for only what they use. This may be the answer to the reduced demand of summer holidays, term breaks, and variations in teaching and learning hours. 

Device Flexibility

Virtual desktops are accessible from students’ own devices, regardless of technical specifications. This is possible because all performance and data are in the Cloud. Only the final info needed for display is sent to the user. This can extend the life of devices and make it possible to support affordable low-power PCs, Chromebooks, or tablets without the concern of performance or application compatibility issues. In fact, students can generally choose between Macs, PCs, or Chromebooks for courses without compatibility concerns. IT administrators can be freed from maintaining physical PCs and workstations while centralization also simplifies the management of software licenses. 

Fractional GPU with AMD Changes the Equation for Education

Until NVv4 it was only possible to choose between expensive full-GPU, high-specification VMs or non-GPU VMs. Configurations without any GPU don’t meet the demands of even a basic modern web browser. While a full GPU made sense for high-end workstation applications, that level of service was costly overkill for users of basic productivity software and collaboration who require only a small portion of a GPU to enjoy a great experience. 

GPU partitioning in  Azure NVv4 instances allows IT administrators to fit the needs of application and course requirements. For example, initial undergraduate courses using SolidWorks are unlikely to have the same demanding requirements as professionals in CAD/CAM industries. An NVv4 option with 4GB of GPU is usually sufficient to provide a high-quality experience at a lower cost for many engineering applications as well as Windows 10 and video streaming. Larger GPU options are also available to support heavyweight users and researchers doing more intensive CAD work or sophisticated CFD (Computational Fluid Dynamics) simulations.  

The Tools for Great User Experiences

Remote display application and protocols are key to good user experiences with VDI/DaaS in the Cloud and the NVv4 does not disappoint with Windows Remote Desktop (RDP) 10, Teradici PCoIP, and Citrix HDX 3D Pro for remoting flexibility, regardless of the intended use case. The AMD Radeon GPUs also support native graphics APIs like DirectX 9 through to 12, OpenGL 4.6, and Vulkan 1.1 ensuring a true graphics experience in the Cloud. AMD Radeon Pro professional graphics drivers are included license-free with all AMD GPU enabled Azure instances, with no restrictions on the number of users for multi-user Windows Virtual Desktop and Remote Desktop Session Host, providing IT departments with administrative freedom. 

Addressing the Modern Education Environment

Data Security
Virtual desktop environments are essentially sandboxed and centralized, with Azure running the Hyper-V hypervisor. IT administrators no longer need to worry about the security patching of BYOD laptops and can be assured that educational resources are not abused for gaming, bitcoin mining, or accessing inappropriate material. Azure’s regions and data controls are already proven and trusted for handling sensitive research projects and data in collaboration with military, government, and industrial collaborators.

Increased Access with Virtualized Classrooms, Labs, and Distance Learning

Students can work anywhere--in libraries, residence halls, off-site, or around the globe. NVv4 helps schools overcome weather, distance, time, and increase their capacity to remove barriers to access through online programs. Curricula can be rapidly refreshed, centrally deployed, and managed to enable universities and high schools to provide online courses, and to deploy new course materials and resources instantly. Azure’s high-availability guarantees and regional data centers to provide low latency access globally.  Courses in other time zones may also rely on Microsoft supported infrastructure avoiding not only the need for hardware but also out of hours IT support. 

Support demanding graphical, collaborative and processing-intensive curricula

The new NVv4 instances are powered by the 64-core AMD EPYC 7742 CPU and the AMD Radeon Instinct MI25 GPU, with GPU sizes between 2GB-8GB available and full AMD Radeon Pro professional graphics drivers. By removing the need for students to be tied to high-performance workstations, even design, engineering, animation, and visual effects courses can be supported virtually and use professional 3D software applications including Dassault Systèmes SolidWorks and Catia; Autodesk, PTC, Siemens NX, and Adobe Creative Cloud.  NVv4 similarly delivers a great foundation for modern collaboration applications with rich media. 

 

I believe that NVv4 has the potential to dramatically reshape the IT landscape for education. It creates remarkable new opportunities for IT managers to better balance what have been competing demands for up-to-date technology, security, cost management, and great user experiences for faculty and students.  

Find out more

If you’d like to find out more, please visit Amd.com [hyperlink] 

Additional links 

 

George Watkins is a Product Marketing Manager for AMD. His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied.

more
0 0 4,389

3rd party advertisement

Are you interested in deploying in the cloud? Do you want to learn more about AMD-powered desktop and workstations in the cloud? Well join us Wednesday, April 1st for a live webinar as we launch the all new AMD-powered desktops and workstations on Azure with Workspot’s turnkey enterprise-ready cloud desktop platformWe’ll also discuss how IT organizations can quickly deploy this turnkey VDI solution to their users to work remotely whether at home, in the office or onsite.

pastedImage_1.png

 

Watch the webinar recording now

What: Live webinar broadcast AMD-Powered Workspot workstations on Azure
Who: Hear from the following cloud experts:

  • Adam Glick, DaaS Cloud Tech Marketing at AMD
  • Kevin Raines, HPC Specialist at Microsoft
  • Brad Peterson, VP at Workspot
  • Andy Knauf, CIO at Mead & Hunt
  • Doug Dahlberg, Dir of IT at ASTI

 

A few of the topics examined: 

  • Why move to Azure (the Cloud)?
  • Why choose Workspot cloud desktops and the new AMD-powered offering
  • How do I quickly deploy Workspot cloud desktops & workstations on Azure to address remote working

 

other resources:

AMD.com landing page - Click here

AMD blog - Click here

MSFT blog - Click here

The information contained in blog represents the view of AMD or the third-party presenter as of the date presented. AMD and/or the third-party presenters have no obligation to update any forward-looking content in the above presentations. AMD is not responsible for the content of any third-party presentations and does not necessarily endorse the comments made therein.

 

George Watkins is a Product Marketing Manager for AMD. His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied.

more
1 0 778

Virtualized environments can pose some challenges for companies. In order to bring a more consistent and user friendly experience to virtual environments, AMD and Microsoft have been working together to offer a whole new cloud experience for desktop and workstation users.

Microsoft Azure NVv4 instances are the first desktop as a service (DaaS) Virtual Machines (VMs) powered by the combination of 2nd Gen AMD EPYC processors and AMD Radeon Instinct GPUs. The NVv4, as of today, is now generally available to the public.

 

NVv4 represents a convergence of innovative technologies to make modern desktop experiences possible from the cloud. Enterprises can deploy affordable, cloud-native GPU-accelerated desktop environments that meet the performance and flexibility demands needed for high productivity of their employees. Just as important, NVv4 also offers state-of-the-art IT management tools to help drive success of IT organizations.

 

How is this possible? NVv4 instances are built on three fundamental pillars to enable cloud-native modern desktop and workstation experiences.

 

GPU-Accelerated Performance

Today’s digital workforce relies on modern applications. Modern applications are built with GPU acceleration at their core. From the most powerful 3D design tools, to common office productivity tools, and even web browsing, everyday applications are designed to require or benefit from graphics acceleration support built in. In other words, virtual machines without GPU acceleration will often struggle with some of the most common desktop tasks.

 

As the first VMs on Azure to take advantage of AMD’s SR-IOV technology to enable GPU partitioning, NVv4 provides IT decision-makers with four VM options calibrated to meet the variety of use cases in the modern workplace. Whether they are a professional running a workstation-class design application or support staff using Microsoft Office 365, all users receive the performance and reliability of 2nd Gen AMD EPYC processors and Radeon Instinct GPUs. ISV certifications and optimizations for professional 3D applications further reinforce the user experience.

 

Support for the latest Windows 10, Windows Server and Windows 10 Enterprise multi-session operating systems provides IT with the flexibility to specify single- or multi-session configurations as needs dictate. Even when the GPU is partitioned, the individual user’s experience is indistinguishable from the experience of a locally installed GPU to which they are accustomed.

 

IT managers can continue to rely on the traditional remote protocols, management, and administration tools they prefer. NVv4 instances are fully supported by Windows Virtual Desktop, Citrix Cloud, Teradici Cloud Access and Workspot Cloud VDI so the migration to Azure is both smooth and familiar.

“The flexibility that Azure NVv4 with AMD-powered GPU partitioning provides for users to share and access GPU resources as needed is a valuable feature that we see will benefit many Teradici customers. We are excited to be working with Microsoft and AMD to enable more flexible, cost-effective GPU options for virtual desktop and virtual workstation use cases such as AEC.”

– Ziad Lammam, Vice President of Product Management at Teradici

Uncompromised Security

Security is at the core of nearly every IT conversation. In an infrastructure where resources are shared across users and services, companies need to be confident that individual users data is fully protected. While Azure is built on world-class security technologies, traditional GPUs

 

Security runs deep into the hardware of AMD-powered Azure environments. While traditional GPUs rely on software techniques for security in virtualized environments, NVv4 is powered by SR-IOV-based GPU virtualization, enabling isolation of PCIe hardware resources to prevent unauthorized access to the data of one VM by users of other VMs. Each VM can only access the physical resource that has been allocated to it. Each VM is physically isolated from others, even when a single GPU is shared by multiple users. SR-IOV is recognised and established in the industry as one of the key standards for resource isolation – that’s why Microsoft is including  this technology as part of its comprehensive plan to keep its customers safe and protected when virtualised.

"The diversity of the new AMD-based Workspot cloud desktops on Microsoft Azure is a huge deal for us. Based on the application requirements of each engineer, we can dedicate all or a fraction of the AMD GPU to their Workspot workstation on Azure. This finer resolution of control gives us the financial edge we need to move more people to Workspot cloud desktops on Azure and increase our overall productivity."

– Eric Quinn, CTO at C & S Companies

Cloud-like Affordability

One of the biggest promises of cloud is that businesses can reduce their cost by renting exactly what they need. Yet for businesses looking to deploy GPU-accelerated VMs, this was not possible. Prior to NVv4, users could only choose between more expensive full-GPU VMs or non-GPU VMs. Even if the user didn’t need the entire performance headroom of a full GPU, they would be required to rent it. While the cost of a full GPU could be justified for the highest-end workstation workloads, most desktop experiences need a fraction of the GPU for optimal experience.

 

One of the key benefits of AMD-powered GPU partitioning in Azure is the ability to deliver fractions of a GPU at more affordable price points. Four AMD-powered NVv4 options are available to IT managers, making it possible to provide virtual desktop configurations that perfectly meet the particular computing workloads of different users. NVv4 instances can deliver GPU-powered desktop experiences that enable the GPU to be configured to be used by eight, four, two, or a single user as dictated by their application needs.

“As more organizations start migrating Citrix workloads to Microsoft Azure, they want to ensure that they’re delivering that same level of experience as their previous on-prem deployments. We’re excited to be partnering with AMD and Microsoft with the release NVv4 instance, as this ensures organizations can deliver graphically accelerated Citrix Workspaces with superior user experiences while also optimizing their costs.”

– Nitin Sharma, Sr Product Marketing Manager for Workspace Services at Citrix

Promises Fulfilled

AMD CPU and GPU powered NVv4 instances are the first GPU-accelerated virtual desktops for Azure and provides businesses with productivity, the absolute requirement for security, and the ever-present pressure to manage costs, all while providing users with an adaptable, flexible, high-performance cloud-based work environment that addresses the breadth of expectations of the modern workplace.

 

Businesses interested in assessing and testing DaaS environments for their operations can work with Microsoft partners like Cloud Jumper and Workspot to ensure professional and experienced teams who can help assess your business needs every step of the way from POC to deployment and migration.

 

Find out more:

George Watkins is a Product Marketing Manager for AMD. His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied.

more
0 0 3,551
Staff
Staff

A few months ago at Microsoft Ignite in the AMD booth, I had the opportunity to showcase the first GPU partitioned and shared instances (NVv4) available for Microsoft’s Azure cloud featuring the AMD Radeon Instinct MI25 accelerator, along with AMD’s other EUC (End User Computing) and data center products. News about the Microsoft and IGEL partnership relating to WVD (Windows Virtual Desktop) also attracted interest from our Cloud, Citrix and related customers. Although WVD has been available in preview, no Linux-based WVD client had been available which resulted in increased interest in the IGEL offering. And at the recent Disrupt 2020 event, IGEL announced the first Linux client to support WVD. The Microsoft SDK that makes this integration possible has the potential to enable other thin-client vendors to offer their own solution.

 

While the use of AMD CPUs and server GPUs is well-known, AMD is also a major player in providing the CPU and graphics/GPU hardware within many of the most popular thin clients.

 

The joint IGEL and Microsoft announcement was particularly satisfying for me as it heavily featured IGEL’s flagship UD7 client targeted at graphical use cases which is built around AMD technologies. For example, the technical specifications for the UD7 client features the AMD Embedded RX-216GD 1.6 GHz (Dual-Core) up to 3.0 GHz (boost mode), system on a chip (SoC). With the option for an additional graphics card, the AMD Embedded Radeon™ E9173 discrete GPU can extend the UD7 to support the simultaneous use of up to four digital monitors at 60 Hz by DisplayPort (two in 4K and two in 2K). The flagship UD7 client also features IGEL’s latest security enhancements -- a benefit for scenarios when security is a concern for thin clients. 

 

Last week at IGEL Disrupt Munich, a new version of the UD3 client was announced on BrianMadden.com. The UD3 is supported by a specially optimized AMD Ryzen Embedded R1505G that: uses less power (about 10 watts); features hardware optimizations for PCoIP (PC over IP) Ultra; and leverages the AMD Secure Processor feature checks to help assure the UEFI is signed by IGEL. The availability is expected May 2020, but in the meantime information currently exists about the specifications and IGEL solution architect blogs, including a blog by Fredrik Brattstig.

 

My role at AMD is largely associated with evaluating the performance of our Data Center and Cloud products including AMD Radeon Pro V340 and AMD Radeon Instinct MI25 server GPUs. The evaluations are conducted within the context of the protocols and EUC/VDI environments used in scenarios featuring Azure, RDP, Citrix, VMware, and Teradici. Most remoting protocols have a feature often referred to as “Back-pressure” – a process whereby the end-client is aware of whether it is keeping up with the server frame rate and alerts the server accordingly. It’s widely known that there’s no point churning out frames if the end-point can’t handle the rate. So it’s important to have a suitably powerful end-point that can become the most significant factor in the overall user experience. IGEL, supported by AMD solutions, has proved very popular, You can discover from IGEL about the use cases and features of the UD3 and UD7.

 

The IGEL and Microsoft partnership plus WVD support along the AMD enabled NVv4 Azure instances were all featured by the independent blogger, Bas Van Kaam. The recommended blog offers a suitable summary of Ignite and can be found here.

 

Now that these major events have concluded, I’m eager to get back in the AMD lab to “kick the tires” of WVD and the NVv4 Azure instances with the WVD supported IGEL UD7. My goal is to blog about my findings, but I’m eager to discover others’ experiences with thin clients, especially if there are additional factors for consideration. If you want to try out NVv4 with WVD, I recommend a useful video guide available from Microsoft’s Stefan Georgiev on  YouTube.

 

Recommended Links

 

Joe DaSilva is a Cloud Graphics Solutions Architect for AMD. His/her postings are his/her own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied.

more
0 0 784

AMD GPUs deliver the first shared GPU instances for Microsoft Azure – NVv4 instances

Today, the first Azure instances utilising GPU partitioning technology became available. These instances effectively enable a large server GPU to be partitioned, supplying VMs with an appropriately sized GPU, and opening the way for potential savings in the cost for GPU-enabled cloud VMs.

Key to adoption of AMD GPUs by Microsoft Azure was the alignment of our SR-IOV based MxGPU hardware-sharing technologies to Microsoft Hyper-V’s own GPU-P technologies. This is clear validation of our strategy at AMD to work with Microsoft over many years to align with their roadmap resulting in the first GPU sharing solution on Azure, acceptable in terms of user segregation, security features and quality of service. Our virtualised GPU sharing technologies have already been proven with other hypervisors including VMware ESXi and the Citrix Hypervisor (XenServer). This is however the first time GPU sharing has been enabled for a Hyper-V based platform with Azure.

The result is a portfolio of instances leveraging both AMD CPUs and GPUs that are sized to the realistic needs of users; ranging from smaller instances that align to the needs of Office workers or Mobile CAD workstations (2 and 4 GB equivalent GPU resource) to larger instances that can support heavier graphical needs and session sharing like needs. AMD professional GPU drivers are offered free along with these instances.

vCPUMemoryGPU memoryAzure network
Standard_NV4as_v4414 GB2 GB50 Gbps
Standard_NV8as_v4828 GB4 GB50 Gbps
Standard_NV16as_v41656 GB8 GB50 Gbps
Standard_NV32as_v432112 GB16 GB50 Gbps

 

Initially NVv4 instances will be available in Azure Regions early next year in the South Central US and West Europe Azure regions.

 

Sign-up for preview using this link:  https://aka.ms/NVv4Signup

 

AMD Technology enables Microsoft Azure at Ignite 2019 Microsoft the preview; the interest it attracted in the booth but also in the End User Computing (EUC) and similar communities was fantastic, and it was great to speak to so many users about their enthusiasm for the options. I was cheered to see a blog by cloud community expert Marius Sandbu that covered the announcement but also caught the spirit of what we had hoped to convey.

 

Useful Links:

 

AMD at Microsoft Ignite

pastedImage_5.jpgpastedImage_7.jpg

Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied.

George Watkins is a Datacenter GPU Marketing Manager for AMD. His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied.

more
1 0 3,849

AMD based Microsoft Azure virtual desktops deliver a workstation-class experience in the Cloud

 

Autodesk University is the place to be for professional architects, designers, engineers, and media creators. Of course, AMD will be there, returning as a Gold Sponsor of this important event to provide demonstrations of our most powerful desktop processors and graphics cards, to discuss your biggest challenges, and to reveal the latest technology innovations that enhance the workstation experience.

 

Taking centre stage in our booth, AE310, will be live demonstrations of the Microsoft Azure stack, leveraging the new NVv4 instances. This is the first Windows Azure virtual desktop to be supported by both 2nd Gen AMD EPYC processors and Radeon Instinct MI25 GPUs. If you are one of those people who designs, makes and builds the world around us and relies on the highest performance from applications like Autodesk to make things happen, then you owe it to yourself to learn more about the NVv4 instance.

 

Wondering what NVv4 stands for? “N” = GPU Accelerated VM family in Azure. “V” = Visualization. “4” = Generation 4 – which means the NVv4 is the current latest generation of GPU-enabled virtual desktops services from Azure.

 

Be more productive and collaborate by extending workstations to the Cloud

Modern-day designers, architects, and engineers demand the most of their critical tools. Whether in the office or at home, traveling or onsite, they need a workstation-class experience that provides flexibility and reliability no matter where in the world a project might take them. The NVv4 virtual desktops bring the full power of a traditional workstation configuration to bear whenever and wherever it’s needed. AMD GPU-enabled NVv4 virtual desktops make it possible to finally overcome the difficulty of balancing performance, mobility, and cost when addressing traditional Architecture, Engineering, and Construction (AEC) workloads.

 

Just what are Microsoft Azure NVv4 instances?

The NVv4 is a new, virtual desktop solution in Microsoft Azure that takes advantage of SR-IOV technologies (Single-root input/output virtualization) to introduce, for the first time, GPU-partitioning (or GPU-P). This gives customers maximum flexibility and choice by providing dedicated CPU/GPU-supported virtual desktops that best suit their workloads and price points. In fact, NVv4 will offer four distinct instance options to choose from, scaled to share a single GPU’s resources among as many as eight Virtual Machines. 

 

Alternatively, IT managers can maximize the user density of NVv4 with Windows 10 EVD, supported by Windows Virtual Desktop and available plug-ins from Citrix and Teradici. Anyone interested in trying the NVv4 experience for themselves can do so by signing up for AMD’s customer preview.

 

What will AMD be showing at AU?

Throughout Autodesk University, we will be showcasing our preliminary test environment, based on the planned NVv4 hardware and software stack, available in Microsoft Azure. You will get the chance to see a variety of the latest Autodesk applications for AEC and CAD workloads. AU19 will be a great opportunity to speak to the AMD team and explore how AMD-enabled virtual desktops in Microsoft Azure may help your organization. 

George Watkins is a Datacenter GPU Marketing Manager for AMD. His/her postings are his/her own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied. 

more
0 0 856

AMD technology makes GPU enabled virtual desktops possible across the entire Enterprise!  

Talk about being in the right place at the right time! My first opportunity to participate in the 2019 Microsoft Ignite conference, promises to set a new highwater mark for impactful demonstrations, learning opportunities, and meaningful collaboration between AMD technology and the Microsoft ecosystem.

This year the AMD booth will be packed with technologies and demonstrations of many of the latest AMD solutions with Microsoft, including the latest high-performance laptops and virtual desktops. For me though, the highlight at Ignite is the exciting news around Microsoft Azure NVv4 instances; the first Windows Azure virtual desktop supported by 2nd gen AMD EPYCTM processors and Radeon InstinctTM GPUs.

Wondering what does NVv4 stands for? “N” = GPU Accelerated VM family in Azure “V” = Visualization “4” = Generation 4 – which means the NVv4 is the latest generation of GPU enabled virtual desktops services from Azure.

Modern day applications want more

This is an important distinction because many modern productivity applications like Office 365, video conferencing and web browsing are designed to harness the GPU to deliver the best possible application experience. Many non-GPU VMs however struggle to deliver that experience while previous GPU-accelerated VMs could only be configured, and priced, to deliver a full GPU as a workstation experience – making them too costly for everyday users.   

Re-evaluate GPU enabled Virtual desktops

The introduction of AMD powered NVv4 instances is shifting the expectations for VM deployments and is sure to have IT managers taking note. What’s changed? Well, The NVv4 instance is the first VM on Microsoft Azure to take advantage of SR-IOV technologies (Single-root input/output virtualization) and introduces GPU partitioning across four new options. This gives customers greater flexibility, enabling the entire enterprise to enjoy dedicated CPU/GPU virtual desktops, delivering the best application experience regardless of the workloads. In fact, NVv4 will offer four distinct instance options to choose from, scaled to share a single GPU’s resources among as many as eight Virtual Machines. Alternatively, IT managers can maximize the user density of NVv4 with Windows 10 multi-sessions, supported by Windows Virtual Desktop with available plug-ins from Citrix and Teradici. Anyone interested in trying the NVv4 experience for themselves can do so by signing up to customer preview.

Attending Ignite?

During Ignite, there will be a great opportunity to speak to our team about the benefits of all the AMD supported Azure instances and have the chance to sign up to the NVv4 customer preview at the AMD booth #249. If you want to learn more about the technologies powering NVv4 you might like to join these AMD sessions: Technical (BRK1114, Thursday 7th Nov, 11:30am),Hub (THR1086, 9am, Tuesday 5th Nov) and a NVv4 dedicated session by Microsoft (BRK3121) if you are lucky enough to be there in person.

From Azure to Windows, we love Microsoft! Come visit AMD  Booth #249 and experience all of our technology demonstrations and discuss how we can address your business needs!

more
0 0 2,311
Staff
Staff

 virtualization, single root input/output virtualization or SR-IOV (Single-root input/output virtualization) is a specification that allows the isolation of PCI Express resources between different users. It is already the standard used to share networking resources (NICs) and secure network traffic. Each resource has Virtual Functions (VF) associated and each VM (Virtual machine) can only access the physical resource via its own allocated VF.

The AMD MxGPU (GPU sharing technology) is the industry’s first SR-IOV based GPU sharing technology designed for cloud and datacenter. So why did we choose SR-IOV?

  • Industry standard. SR-IOV is the long-established industry standard for virtualising PCIE devices. As such, the standards are openly scrutinised for security.
  • The isolation provided by VFs helps ensure each VM is isolated from other e.g. memory is secured and not shared.
  • Long-term we believe SR-IOV is a base technology that will allow for scalability and higher user densities long term as a technology that minimises context switching overheads.
  • Stability and reliability. SR-IOV allows us to provide each VM with its own dedicated share of a GPU and it does not compete with other users, helping ensure the resource available is consistent and the same; users can avoid the unreliability associated with noisy neighbours and experience deterministic QoS.

SR-IOV a technology that has evolved with and for cloud

Back in 2009. veteran blogger Scott Lowe wrote an introduction to SR-IOV predicting it would become mainstream, it’s great context to the environment and technology of the time. Whilst we could have accelerated to market using a bespoke proprietary memory management unit (MMU), we instead chose to work with the major hardware, hypervisor and operating system vendors to evolve the technologies to an industry wide fit for our long-term needs.

The evolution of SR-IOV was carefully managed and in2016 was able AMD to release the world’s first SR-IOV based GPU sharing solution for cloud and virtualisation. Beyond the obvious security and quality benefits of aligning to the core technology, the standards offer potential long-term scalability that a bespoke implementation wouldn’t have offered us.

We are seeing increasing rewards from this approach now, as other vendors -- particularly Microsoft -- have placed SR-IOV at the core of their technologies and infrastructure. This alignment has streamlined our joint projects, leading to the announcement of MXGPU into the Azure cloud to enable cost-effectively sized and priced GPU enabled VMs. (You can register interest with Microsoft in the release availability, here.) MxGPU SR-IOV support is also available and proven for Citrix XenServer, XenDesktop and XenApp, VMware ESXi, Horizon View and open source KVM. Read more, here.

SR-IOV and MXGPU at Ignite

Our product management team will be at Microsoft Ignite (4-8 November), and you can find us on booth #249. You might also like to join these AMD sessions: technical session (BRK1114, Friday 8th Nov, 9am) and hub session (THR1086, 9am, Tuesday 5th Nov) if you are lucky enough to be there in person.

Learn More

  • Microsoft high commitment and investment in integrating the SR-IOV standards into the core of their platforms such as Windows and Hyper-V is significant and as such they’ve published significant information on this approach including overviews and architectural deep-dives.
  • Our hypervisor and virtualisation partners have also been investing in core SR-IOV technologies, as well as releasing information as to the benefits and reasons for this approach. In September 2018, Citrix released XenServer 7.6; the release notes are available to read, amongst other features they cover Citrix’s and XenServer’s adoption of SR-IOV for networking (NICs – Network Interface Cards).  

The SR-IOV standard

The SR-IOV standard is controlled and maintained by the PCI-SIG foundation. The regulation and scrutiny of the standard is maintained with cross-industry membership and funding, alongside a compliance programme and certified integrator list.

MXGPU more than SR-IOV

Of course, there is more to MXGPU than SR-IOV, it is just one of core technologies on top of which we have built our GPU sharing and virtualisation products.  We are however pleased that we were the first vendor to achieve GPU sharing the SR-IOV ‘gold-standard’.

more
0 0 10.6K

There have been numerous opinions offered from all corners of the gaming community about the impact of Google Stadia. Gaming and business journalists, bloggers, and avid gamers all have opinions to share. And while a few revert to familiar hardware “spec” comparisons to gauge the value of new technology, the introduction of Google Stadia is clearly about much more. Google Stadia marks an evolution of the gaming landscape that’ll rapidly reshape the industry.

 

In the short time since Stadia was announced, several themes have emerged that will likely drive increased cloud gaming adoption.

  • Consistent Premium Performance
  • Transparent Maintenance
  • On-Demand Gaming & Social Integration
  • Device Access & Mobility
  • Cloud gaming value chain
  • Subscription based services

While this discussion is primarily based on Google Stadia, many of the value propositions introduced can be applied more generally to cloud gaming services such as Microsoft xCloud and Sony’s PlayStation Now.

 

Below we’ll introduce each theme and in future blogs dive deeper into each to explore their impact on the industry.

 

Consistent Premium Performance

Performance and hardware specs will continue to drive conversation near term for two reasons: the industry is familiar with it, and it can be can measured. While this understanding is important, moving forward the conversation will likely shift to focus on delivering a consistent premium experience. Stadia allows gamers to reconsider the entirety of the gaming experience and the context within which we view performance. 

 

The choice of custom AMD “Vega”-based GPUs as a starting point for this service launch reflects Google’s strong commitment to what makes gamers happy and a deep understanding of what makes datacenters tick. Gaming is a part of the AMD DNA, delivering high performance GPUs for the latest game consoles, high-end gaming PCs, and the datacenter.  The AMD “Vega”-based GPUs for Stadia are a proven platform featuring 56 compute units, up to 10.7 teraflops, integrated HBM2 memory, and with the Vulkan® high-performance real-time 3D graphics API part of the driver. That’s easily more power than the top two previous generation consoles combined and a foundation for success that can deliver a next-generation console experience today1.

But for the player, all that matters is the experience, which at resolutions up to 4K and 60 frames per second, with HDR and surround sound, promises to be fantastic and substantially better than what many gamers enjoy today.

 

Transparent Maintenance

How many times has a user tried to launch a game only to be met with a time-consuming multi-gigabyte patch? With cloud gaming, software maintenance happens in the background, transparent to the user. In addition, the centralized design of Stadia also means they will not have to worry about hardware upgrades. The datacenter can be upgraded to keep pace with changing requirements, transparent to the user. In short, more play, less hassle.

 

On-Demand Gaming and Social Integration

Stadia will enable the ~200 million people who watch game-related content such as trailers and live streams on YouTube to lean into their enthusiasm and join the action with just a tap on their phone, tablet, or computer. Social integration allows for instant broadcasting, archiving, and sharing you and your teams’ latest achievements. E-sports fans and stream audiences can simply click a link on their favorite social media site and instantly launch into the latest titles.

 

Game downloads are a thing of the past, like music and movies before it, many games are now available “on-demand”. 

 

Device Access and Mobility

Google Stadia delivers the AAA gaming experience to the widest audience. That means great games, streamed via standard Internet connections, to a variety of devices, and all while enhancing the social aspects of and accessibility to those experiences to better match the preferences of today’s consumer.  

 

This vision is made possible by shifting focus of the gaming world to the datacenter. The organizing principle of gaming is the datacenter rather than the individual’s device. Google’s 7500 edge nodes worldwide will put powerful gaming hardware essentially everywhere and in reach of virtually everyone.

With cloud gaming, if you need to take your gaming on the go, no need to start over. You can simply save state on your home theater or Chromebook and pick up seamlessly on your mobile device. That flexibility promises to change how players weave gaming into our everyday lives.

Evolving Business Model 

The transition of gaming to the Cloud will impact many companies including console providers, game developers, and publishers. Traditionally, publishers have had a variety of platform options on which to distribute their game titles and reach their audience. One challenge they have faced however is the large fees required to gain access to each distinct platform. The introduction of new, high-performance cloud platforms like Stadia gives more choice for the game publishers.

Another interesting consideration which Stadia has introduced for many game developers and publishers is the access to nearly unlimited resources to build their games on. In the past, console hardware has tended to follow a slower refresh rate than gaming PCs. As a result, AAA games that appear later in a console cycle had to be developed to support both older console technologies as well as more recent platforms.  The resource demands sometimes restricted what a game developer could create. It could be proposed that the datacenter is the console when speaking about cloud; better still, it can be continuously updated to maintain the highest levels of performance removing the need to buy the latest GPUs. By default they have access to the best gaming platform for their next blockbuster title. 

Manufacturers building game-specific hardware including consoles have also recognized the potential of cloud gaming. They can see a future where they have an opportunity to shift their efforts away from developing hardware with costly components and fighting expensive PR battles centered on hardware superiority, and instead drive wholeheartedly at creating the best games environment. Datacenter-based gaming provides a new, more cost-efficient and sustainable direction that can consolidate and balance costs. It means the business of games can stop competing on specs and instead compete on content. That’s something every gamer can appreciate.

Subscription Based Gaming

Google Stadia breathes new life into the gaming conversation, triggering a dialogue about the liberation offered by cross-platform play, blurring the lines between gameplay viewers and players, and establishing a flexible infrastructure that adapts to the innovation of developers.

This is a conversation I'm excited to continue over the coming months.

George Watkins I Marketing Manager I Datacenter GPU BU

These views are my own and do not reflect that AMD.

©2019 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, Radeon, and combinations thereof are tradema##rks of Advanced Micro Devices, Inc. Thunderbolt is a trademark of Intel Corporation or its subsidiaries in the U.S. and/or other countries. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.

 

Footnotes:

1) 4th September 2019, based on PS4 Pro GPU performance (4.2 TFLOPS) and Xbox One X GPU performance (6 TFLOPs) compared to With Google Stadia GPU performance (10.7 teraflops)

https://www.digitaltrends.com/gaming/xbox-one-x-vs-ps4-pro/ https://www.techadvisor.co.uk/news/game/google-stadia-news-3693903/

more
1 0 1,018

Capture.JPG

[Originally posted on 11/06/18]

Today in San Francisco, California, AMD held a special event where we announced the newest additions to the Radeon Instinct™ family of compute products. The AMD Radeon Instinct™ MI60 and Radeon Instinct™ MI50 accelerators are the first GPUs in the world that are based on the advanced 7nm FinFET process technology. The ability to go down to 7nm allows us to put more transistors on to an even smaller package than was possible before – in this case, the MI60 contains 13.2 billion transistors on a package size of 331.46mm2, while the previous generation Radeon Instinct™ MI25 had 12.5 billion transistors on a package size of 494.8mm2 – a 58% improvement in number of transistors per mm2. This allows us to provide a more powerful and robust product, capable of tackling a wide range of workloads from training and inference, to high performance computing.

Supercharged Deep Learning Operations – Ideal for Training and Inference

We’ve made numerous improvements on these new products, including optimized deep learning operations. In addition to native half-precision (FP16) performance, the MI60 and MI50 now support INT8 and INT4 operations, delivering up to a whopping 118 TFLOPS of INT4 peak performance on the MI60. The supercharged compute capabilities of these new products are designed to meet today’s demanding system requirements of handling large data efficiently for training complex neural networks and running inference against those neural networks used in deep learning.

Vega20_LBlue_JPEG-768x432.jpg

World’s Fastest Double Precision PCIe® Based Accelerator

On the other end of the compute spectrum are FP64 calculations primarily used in high performance compute workloads. These types of workloads require extreme accuracy and speed, which the MI60 and MI50 deliver. The Radeon Instinct MI60 is the fastest double precision PCIe® based accelerator1, delivering up to 7.4 TFLOPS of FP64 peak performance, while the MI50 is not far behind at 6.7 TFLOPS. In addition to fast FP64 performance, the MI60 and MI50 both sport full-chip ECC memory3 as well as RAS4. This allows scientists and researchers across several industries including life sciences, energy, automotive and aerospace, government and more to achieve results with both speed and accuracy.

RadeonInstinct_FrontAngle_RGB_5inch-768x679.png

Finely Balanced, Ultra-Scalable Datacenter Solution

Most of the improvements we’ve talked about so far have been at the chip level, but we didn’t stop there. We also have a number of new benefits found beyond the chip as well. We meticulously designed the MI60 and MI50 to deliver finely tuned and balanced performance. We took a look at some of the common bottlenecks found in previous generations and made improvements to ensure your data is processed in the most efficient manner possible. This includes making these cards PCIe® Gen 4* capable, delivering up to 2x more bandwidth (64 GB/s vs. 32 GB/s) than PCIe® Gen 3 when communicating over the bus. In addition to improved performance between GPU and CPU, we’ve also built in to these products a peer-to-peer GPU communication feature called Infinity Fabric™ Link technology. Each card includes two physical Infinity Fabric™ Links allowing you to directly connect four GPUs together in a GPU hive ring and up to two of these hives in an 8 GPU server. Each GPU card provides up to 200 GB/s bandwidth between peer GPUs, which is up to 6x faster than PCIe Gen 3 alone2. We have also doubled memory bandwidth speeds from our previous generation Radeon Instinct MI25 accelerator5, delivering up to 1TB/s memory bandwidth on both the MI50 and MI60 accelerator – the first GPUs to achieve this speed.

RadeonInstinct_MI60_Multi4_Ang1_RGB_5inch-768x638.png

With improved performance from both within the GPU and between GPUs and CPUs, these new finely-balanced, ultra-fast and scalable solutions are the ideal datacenter compute solution for all your needs whether they’re inference, training or HPC related.

Learn More About the AMD Radeon Instinct MI60

Learn More About the AMD Radeon Instinct MI50

Learn More About AMD’s “Vega 7nm” Technology

Learn More About ROCm

Warren Eng is a Product Marketing Manager for professional graphics and compute at AMD. His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied. GD-5

more
0 0 1,268

Capture.JPG

[Originally posted on 11/21/17]

This year at SC17, AMD showcased Radeon Instinct™ accelerators, AMD EPYC™ processors and the ROCm open software platform – a complete ecosystem to drive a new era in the datacenter. Our booth was packed with server racks from partners like Inventec, Gigabyte, Supermicro and BOXX. Attendees had the opportunity to check out Project 47, both on display and running demos, offering 1 PetaFLOPS of compute power.

The much anticipated TensorFlow support with ROCm 1.7 was revealed in our booth alongside a demo of deep learning inference from a trained Caffe model. AMD also offered hourly Tech Talks, diving into a wide range of topics – from AMD EPYC™ performance to Radeon technology powering the exploration of dark energy with the CHIME radio telescope.

Thank you to everyone that joined us at SC17. For those that were unable to attend, check out our photo gallery below. We hope to see you next year at SC18!

Capture.JPG

Daniel Skrba, Marketing and Communications Specialist for the Radeon Technologies Group at AMD. His postings are his own opinions and may not represent AMD’s positions, strategies, or opinions. Links to third party sites and references to third party trademarks are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.

more
0 0 621

Capture.JPG

[Originally posted on 10/27/17]

Visit AMD at our SC17 booth #825 and learn how AMD together with our partners is bringing about a new era in the datacenter that is revolutionizing High Performance Computing with our new AMD EPYC™ processors and Radeon Instinct™ accelerators. On top of this year’s show stopping demos, you will have the opportunity to attend one of our interactive and educational booth Tech Talks – check out the schedule below.

Featured AMD Tech Talks

Tuesday, Nov. 14th, 2017

  • 11AM: Reconfigurable Acceleration at Cloud Scale, Manish Muthal, Vice President of Data Center Marketing, Xilinx
  • 1PM: Introducing AMD EPYC™: A New Standard of Performance and Innovation, Girish Kulkarni, Director of Product Marketing, AMD Server Group, AMD
  • 2PM: Exploring Dark Energy with the CHIME Radio Telescope, powered by Radeon™ Technology, Andre Renard, Chime Computing Specialist, Dunlap institute for Astronomy & Astrophysics, University of Toronto
  • 3PM: AMD EPYC™ for HPC, Joshua Mora, PhD, Manager Field Application Engineering, AMD
  • 4PM: AMD Radeon Instinct™ Accelerators, Niles Burbank, Sr. Product Manager, AMD
  • 5PM Redefining HPC Performance with EPYC-based Supermicro Servers, Super Micro Computer, Inc.

Wednesday, Nov. 15th, 2017

  • 11AM: Interconnect Your Future with Mellanox “Smart” Interconnect, Gilad Shainer, Vice president of Marketing, Mellanox Technologies
  • 1:00 PM: Accelerating 3D Acoustics With HCC-C++, Reid Atcheson, Accelerator Software Engineer, NAG
  • 2PM: AMD EPYC™ for HPC, Joshua Mora, PhD, Manager Field Application Engineering, AMD
  • 3PM: Advances in GPU Networking at AMD, Michael Lebeane, Sr. Design Engineer, AMD Research
  • 4PM: Running TensorFlow on AMD’s ROCm software platform with HIP, Ben Sander, Sr. Fellow, Software Engineer, AMD
  • AMD Booth # 825 Tech Talks November 14 – 15, 2017

Venue: COLORADO CONVENTION CENTER (Denver, CO)

We hope to see you in Denver!

more
0 0 963
Staff
Staff

Capture.JPG

[Originally posted on 10/10/17 - by Gregory Stoner]

AMD is excited to see the emergence of the Open Neural Network Exchange (ONNX) format which is creating a common format model to bridge three industry-leading deep learning frameworks (PyTorch, Caffe2, and Cognitive Toolkit) to give our customers simpler paths to explore their networks via rich framework interoperability.

The ONNX format, via its extensible computation graph model, built-in operators, and standard data types will allow our team to focus on more in-depth optimization with our Radeon Instinct Hardware and more productive solution set via our open source MIOpen deep learning solver library and ROCm Compiler technology. It also gives us the path to explore new foundation production beyond traditional frameworks for production to bring lighter weight more optimized solutions for our hardware.

It is great to see the collaboration of Facebook and Microsoft continuing to also follow in the path of open software development practice with ONNX, building on their open source projects PyTorch, Caffe2, and Cognitive Toolkit. Open Software development aligns with our philosophy of bringing out open source software platform, tools, and driver to allow the research community to have more powerful ability to explore broader deep learning design space.

We feel this is an excellent step for the community to open up these platform to a broader set of diverse architectures. We look forward to working with the project and help it grow in the coming months.

Gregory Stoner, is Sr. Director of Radeon Open Compute. Links to third-party sites and references to third-party trademarks are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third-party endorsement of AMD or any of its products is implied. Use of third-party names or marks is for informational purposes only and no endorsement of or by AMD is intended or implied.

more
0 0 1,404
Community Manager
Community Manager

Capture.JPG

[Originally posted on 09/08/17 by Albert J. De Vera]

Deep Learning, an advanced form of machine learning, has generated a lot of interest due to the wide range of applications on complex data sets. Current technologies and the availability of very large amounts of complex data have made analytics on the latter more tractable.

With deep neural networks as basis for deep learning algorithms, GPUs are now being used in deep learning applications because they provide many processing units. These processing units simulate a neural network that does the computation on data. Neural networks can therefore scale and improve the extraction of information from data.

ROCm and The AMD Deep Learning Stack

The AMD Deep Learning Stack is the result of AMD’s initiative to enable DL applications using their GPUs such as the Radeon Instinct product line. Currently, deep learning frameworks such as Caffe, Torch, and TensorFlow are being ported and tested to run on the AMD DL stack. Supporting these frameworks is MIOpen, AMD’s open-source deep learning library built for the Radeon Instinct line of compute accelerators.

AMD’s ROCm platform serves as the foundation of this DL stack. ROCm enables the seamless integration of the CPU and GPU for high performance computing (HPC) and ultra-scale class computing. To achieve this, ROCm is built for language independence and takes advantage of the Heterogenous System Architecture (HSA) Runtime API.3 This is the basis of the ROCr System Runtime, a thin user-mode API providing access to graphics hardware driven by the AMDGPU driver and the ROCk kernel driver.

1.jpg

For now, OS support for ROCm is limited to Ubuntu 14.04, Ubuntu 16.04, and Fedora 23. For these OSs, AMD provides a modified Linux version 4.6 kernel with patches to the HSA kernel driver (amdkfd) and the AMDGPU (amdgpu) kernel driver currently in the mainline Linux kernel.5

Using Docker With The AMD Deep Learning Stack

Docker Containers

Software containers isolate the application and its dependencies from other software installed on the host. They abstract the underlying operating system while keeping its own resources (filesystem, memory, CPU) and environment separate from other containers.

In contrast to virtual machines, all containers running on the same host share a single operating system without the need to virtualize a complete machine with its own OS. This makes software containers perform much faster than virtual machines because of the lack of overhead from the guest OS and the hypervisor.

Docker is the most popular software container platform today. It is available for Linux, macOS, and Microsoft Windows. Docker containers can run under any OS with the Docker platform installed.6

Installing Docker and The AMD Deep Learning Stack

The ROCm-enabled Linux kernel and the ROCk driver, together with other needed kernel modules, must be installed on all hosts that run Docker containers. This is because the containers do not have the kernel installed inside them. Instead, the containers share the host kernel.7

The installation procedure described here is for Ubuntu 16.04. Ubuntu 16.04 is currently the most tested OS for ROCm.

Installing ROCm

The next step is to install ROCm and the ROCm kernel on each host. The procedure described below is based on instructions found in https://rocm.github.io/install.html.

Grab and install the GPG key for the repository:

wget -qO – http://repo.radeon.com/rocm/apt/debian/rocm.gpg.key | sudo apt-key add –

You should get the message ‘OK’. You can check if it’s there using apt-key:

apt-key list

In /etc/apt/sources.list.d, create a file named rocm.list and place the following line in it:

deb [arch=amd64] http://repo.radeon.com/rocm/apt/debian/ xenial main

Update the repository information by running ‘apt update’. If you get a warning because of the key signature, you may ignore it since the repository administrator will update this in the future.

Install the ROCm Runtime software stack using ‘apt install rocm’:

[root@pegasus ~]# apt install rocm

Reading package lists… Done

Building dependency tree

Reading state information… Done

The following packages were automatically installed and are no longer required:

hcblas hcfft hcrng miopengemm

Use ‘sudo apt autoremove’ to remove them.

The following additional packages will be installed:

hcc hip_hcc linux-headers-4.11.0-kfd-compute-rocm-rel-1.6-148 linux-image-4.11.0-kfd-compute-rocm-rel-1.6-148 rocm-dev

rocm-device-libs rocm-profiler rocm-smi rocm-utils

Suggested packages:

linux-firmware-image-4.11.0-kfd-compute-rocm-rel-1.6-148

The following NEW packages will be installed:

hcc hip_hcc linux-headers-4.11.0-kfd-compute-rocm-rel-1.6-148 linux-image-4.11.0-kfd-compute-rocm-rel-1.6-148 rocm rocm-dev

rocm-device-libs rocm-profiler rocm-smi rocm-utils

0 upgraded, 10 newly installed, 0 to remove and 0 not upgraded.

Need to get 321 MB of archives.

After this operation, 1,934 MB of additional disk space will be used.

Do you want to continue? [Y/n]

Get:1 http://repo.radeon.com/rocm/apt/debian xenial/main amd64 rocm-utils amd64 1.0.0 [30.7 kB]

Get:2 http://repo.radeon.com/rocm/apt/debian xenial/main amd64 hcc amd64 1.0.17312 [255 MB]

Get:3 http://repo.radeon.com/rocm/apt/debian xenial/main amd64 hip_hcc amd64 1.2.17305 [876 kB]

Get:4 http://repo.radeon.com/rocm/apt/debian xenial/main amd64 linux-headers-4.11.0-kfd-compute-rocm-rel-1.6-148 amd64 4.11.0-kfd-compute-rocm-rel-1.6-148-1 [10.8 MB]

Get:5 http://repo.radeon.com/rocm/apt/debian xenial/main amd64 linux-image-4.11.0-kfd-compute-rocm-rel-1.6-148 amd64 4.11.0-kfd-compute-rocm-rel-1.6-148-1 [46.5 MB]

Get:6 http://repo.radeon.com/rocm/apt/debian xenial/main amd64 rocm-device-libs amd64 0.0.1 [587 kB]

Get:7 http://repo.radeon.com/rocm/apt/debian xenial/main amd64 rocm-smi amd64 1.0.0-25-gbdb99b4 [8,158 B]

Get:8 http://repo.radeon.com/rocm/apt/debian xenial/main amd64 rocm-profiler amd64 5.1.6400 [7,427 kB]

Get:9 http://repo.radeon.com/rocm/apt/debian xenial/main amd64 rocm-dev amd64 1.6.148 [902 B]

Get:10 http://repo.radeon.com/rocm/apt/debian xenial/main amd64 rocm amd64 1.6.148 [1,044 B]

Fetched 321 MB in 31s (10.1 MB/s)

Selecting previously unselected package rocm-utils.

(Reading database … 254059 files and directories currently installed.)

Preparing to unpack …/rocm-utils_1.0.0_amd64.deb …

Unpacking rocm-utils (1.0.0) …

Selecting previously unselected package hcc.

Preparing to unpack …/hcc_1.0.17312_amd64.deb …

Unpacking hcc (1.0.17312) …

Selecting previously unselected package hip_hcc.

Preparing to unpack …/hip%5fhcc_1.2.17305_amd64.deb …

Unpacking hip_hcc (1.2.17305) …

Selecting previously unselected package linux-headers-4.11.0-kfd-compute-rocm-rel-1.6-148.

Preparing to unpack …/linux-headers-4.11.0-kfd-compute-rocm-rel-1.6-148_4.11.0-kfd-compute-rocm-rel-1.6-148-1_amd64.deb …

Unpacking linux-headers-4.11.0-kfd-compute-rocm-rel-1.6-148 (4.11.0-kfd-compute-rocm-rel-1.6-148-1) …

Selecting previously unselected package linux-image-4.11.0-kfd-compute-rocm-rel-1.6-148.

Preparing to unpack …/linux-image-4.11.0-kfd-compute-rocm-rel-1.6-148_4.11.0-kfd-compute-rocm-rel-1.6-148-1_amd64.deb …

Unpacking linux-image-4.11.0-kfd-compute-rocm-rel-1.6-148 (4.11.0-kfd-compute-rocm-rel-1.6-148-1) …

Selecting previously unselected package rocm-device-libs.

Preparing to unpack …/rocm-device-libs_0.0.1_amd64.deb …

Unpacking rocm-device-libs (0.0.1) …

Selecting previously unselected package rocm-smi.

Preparing to unpack …/rocm-smi_1.0.0-25-gbdb99b4_amd64.deb …

Unpacking rocm-smi (1.0.0-25-gbdb99b4) …

Selecting previously unselected package rocm-profiler.

Preparing to unpack …/rocm-profiler_5.1.6400_amd64.deb …

Unpacking rocm-profiler (5.1.6400) …

Selecting previously unselected package rocm-dev.

Preparing to unpack …/rocm-dev_1.6.148_amd64.deb …

Unpacking rocm-dev (1.6.148) …

Selecting previously unselected package rocm.

Preparing to unpack …/rocm_1.6.148_amd64.deb …

Unpacking rocm (1.6.148) …

Setting up rocm-utils (1.0.0) …

Setting up hcc (1.0.17312) …

Setting up hip_hcc (1.2.17305) …

Setting up linux-headers-4.11.0-kfd-compute-rocm-rel-1.6-148 (4.11.0-kfd-compute-rocm-rel-1.6-148-1) …

Setting up linux-image-4.11.0-kfd-compute-rocm-rel-1.6-148 (4.11.0-kfd-compute-rocm-rel-1.6-148-1) …

update-initramfs: Generating /boot/initrd.img-4.11.0-kfd-compute-rocm-rel-1.6-148

W: mdadm: /etc/mdadm/mdadm.conf defines no arrays.

Generating grub configuration file …

Found linux image: /boot/vmlinuz-4.11.0-kfd-compute-rocm-rel-1.6-148

Found initrd image: /boot/initrd.img-4.11.0-kfd-compute-rocm-rel-1.6-148

Found linux image: /boot/vmlinuz-4.4.0-93-generic

Found initrd image: /boot/initrd.img-4.4.0-93-generic

Found memtest86+ image: /memtest86+.elf

Found memtest86+ image: /memtest86+.bin

done

Setting up rocm-device-libs (0.0.1) …

Setting up rocm-smi (1.0.0-25-gbdb99b4) …

Setting up rocm-profiler (5.1.6400) …

Setting up rocm-dev (1.6.148) …

Setting up rocm (1.6.148) …

KERNEL==”kfd”, MODE=”0666″

Reboot the server. Make sure that the Linux ROCm kernel is running:

Welcome to Ubuntu 16.04.3 LTS (GNU/Linux 4.11.0-kfd-compute-rocm-rel-1.6-148 x86_64)

* Documentation: https://help.ubuntu.com

* Management: https://landscape.canonical.com

* Support: https://ubuntu.com/advantage

0 packages can be updated.

0 updates are security updates.

Test if your installation works with this sample program:

cd /opt/rocm/hsa/sample

make

./vector_copy

You should get an output similar to this:

Initializing the hsa runtime succeeded.

Checking finalizer 1.0 extension support succeeded.

Generating function table for finalizer succeeded.

Getting a gpu agent succeeded.

Querying the agent name succeeded.

The agent name is gfx803.

Querying the agent maximum queue size succeeded.

The maximum queue size is 131072.

Creating the queue succeeded.

“Obtaining machine model” succeeded.

“Getting agent profile” succeeded.

Create the program succeeded.

Adding the brig module to the program succeeded.

Query the agents isa succeeded.

Finalizing the program succeeded.

Destroying the program succeeded.

Create the executable succeeded.

Loading the code object succeeded.

Freeze the executable succeeded.

Extract the symbol from the executable succeeded.

Extracting the symbol from the executable succeeded.

Extracting the kernarg segment size from the executable succeeded.

Extracting the group segment size from the executable succeeded.

Extracting the private segment from the executable succeeded.

Creating a HSA signal succeeded.

Finding a fine grained memory region succeeded.

Allocating argument memory for input parameter succeeded.

Allocating argument memory for output parameter succeeded.

Finding a kernarg memory region succeeded.

Allocating kernel argument memory buffer succeeded.

Dispatching the kernel succeeded.

Passed validation.

Freeing kernel argument memory buffer succeeded.

Destroying the signal succeeded.

Destroying the executable succeeded.

Destroying the code object succeeded.

Destroying the queue succeeded.

Freeing in argument memory buffer succeeded.

Freeing out argument memory buffer succeeded.

Shutting down the runtime succeeded.

Installing Docker

We are installing the Docker Community Edition (also called Docker CE) on the host by using Docker’s apt repository. Our procedure is based on documentation published by Docker.8 There may be some slight differences from the original documentation. Note that the installation is done as the superuser. You can also use sudo to install Docker.

First, remove old versions of Docker:

apt remove docker docker-engine

If they are not installed, you will simply get a message that they are missing.

Install the following prerequisite packages using apt:

apt-transport-https

ca-certificates

curl

software-properties-common

Add the Docker GPG key to your host:

curl -fsSL https://download.docker.com/linux/ubuntu/gpg |

sudo apt-key add –

The GPG fingerprint should be 9DC8 5822 9FC7 DD38 854A E2D8 8D81 803C 0EBF CD88. Use the command

apt-key fingerprint 0EBFCD88

to verify this.

Now add the repository information:

add-apt-repository \

“deb [arch=amd64] https://download.docker.com/linux/ubuntu \

$(lsb_release -cs) \

stable”

Finally, issue the command ‘apt update’.

Installing Docker CE should be done with ‘apt install docker-ce’. After the installation is complete, verify that Docker is properly configured and installed using the command ‘docker run hello-world’.

Running ROCm Docker Images

AMD provides a Docker image of the ROCm software framework.9 The image can be pulled from the official Docker repository:

sudo docker pull rocm/rocm-terminal

The image is about 1.5 GB in size and contains the necessary libraries to run ROCm-based applications. Create a container out of this image and look at the installed software in /opt/rocm:

sudo docker run -it –rm –device=/dev/kfd rocm/rocm-terminal

You can check for the ROCm libraries using ldconfig:

ldconfig -NXv

The command above should list all the libraries in the library path including the ROCm libraries.

The ROCm-docker source is available from GitHub:

mkdir ~/tmp

cd ~/tmp

git clone https://github.com/RadeonOpenCompute/ROCm-docker.git

Creating A ROCm Application Docker Image

We can use the rocm/rocm-terminal Docker image to build our own ROCm application Docker image. In the following examples, we use a couple of the sample applications that come with the

ROCm development package. One of them shall be /opt/rocm/hip/samples/1_Utils/hipInfo.

Assuming the host has the complete ROCm development tools, we just do the following:

cd /opt/rocm/hip/samples/1_Utils/hipInfo

make

The outcome of the make command shall be a binary called hipInfo.

If the compiler complains because of a missing shared library called libsupc++, we will need to install that somewhere in the host’s library path. In our case, we shall place the shared library in /usr/local/lib and make sure that ldconfig can find it. You can simply create a shared library from the installed static library /usr/lib/gcc/x86_64-linux-gnu/4.8/libsupc++.a:

mkdir -p ~/tmp/libsupc++

cd ~/tmp/libsupc++

ar x /usr/lib/gcc/x86_64-linux-gnu/4.8/libsupc++.a

ls -l *.o

gcc -shared -o libsupc++.so *.o

sudo cp -p libsupc++.so /usr/local/lib/

sudo ldconfig -v

Make sure that /usr/local/lib is seen by ldconfig. You may have to specify this directory in /etc/ld.so.conf.d if it is not found. Simply add a file named local_lib.conf with the line /usr/local/lib by itself.

Check the output of hipInfo by running it. You should get something like this (it will be slightly different from the literal output below depending on what type of GPU configuration you have):

$ ./hipInfo

compiler: hcc version=1.0.17312-d1f4a8a-19aa706-56b5abe, workweek (YYWWD) = 17312

——————————————————————————–

device# 0

Name: Device 67df

pciBusID: 1

pciDeviceID: 0

multiProcessorCount: 36

maxThreadsPerMultiProcessor: 2560

isMultiGpuBoard: 1

clockRate: 1303 Mhz

memoryClockRate: 2000 Mhz

memoryBusWidth: 256

clockInstructionRate: 1000 Mhz

totalGlobalMem: 8.00 GB

maxSharedMemoryPerMultiProcessor: 8.00 GB

totalConstMem: 16384

sharedMemPerBlock: 64.00 KB

regsPerBlock: 0

warpSize: 64

l2CacheSize: 0

computeMode: 0

maxThreadsPerBlock: 1024

maxThreadsDim.x: 1024

maxThreadsDim.y: 1024

maxThreadsDim.z: 1024

maxGridSize.x: 2147483647

maxGridSize.y: 2147483647

maxGridSize.z: 2147483647

major: 2

minor: 0

concurrentKernels: 1

arch.hasGlobalInt32Atomics: 1

arch.hasGlobalFloatAtomicExch: 1

arch.hasSharedInt32Atomics: 1

arch.hasSharedFloatAtomicExch: 1

arch.hasFloatAtomicAdd: 0

arch.hasGlobalInt64Atomics: 1

arch.hasSharedInt64Atomics: 1

arch.hasDoubles: 1

arch.hasWarpVote: 1

arch.hasWarpBallot: 1

arch.hasWarpShuffle: 1

arch.hasFunnelShift: 0

arch.hasThreadFenceSystem: 0

arch.hasSyncThreadsExt: 0

arch.hasSurfaceFuncs: 0

arch.has3dGrid: 1

arch.hasDynamicParallelism: 0

peers:

non-peers: device#0

memInfo.total: 8.00 GB

memInfo.free: 7.75 GB (97%)

Now that hipInfo is compiled and has been tested, let us create a Docker image with it. Create a directory for building an image with Docker.

mkdir ~/tmp/my_rocm_hipinfo

cd ~/tmp/my_rocm_hipinfo

Copy the necessary files for the Docker image to run properly:

cp -p /usr/local/lib/libsupc++.so . # If hipInfo needs this

cp -p /opt/rocm/hip/samples/1_Utils/hipInfo/hipInfo .

Create a file named Dockerfile in the current directory. It should contain this:

FROM rocm/rocm-terminal:latest

COPY libsupc++.so /usr/local/lib/

COPY hipInfo /usr/local/bin/

RUN sudo ldconfig

USER rocm-user

WORKDIR /home/rocm-user

ENV PATH “${PATH}:/opt/rocm/bin:/usr/local/bin”

ENTRYPOINT [“hipInfo”]

Build the Docker image:

sudo docker build -t my_rocm_hipinfo .

Create and run a container based on the new image:

sudo docker run –rm –device=”/dev/kfd” my_rocm_hipinfo

The device /dev/kfd is the kernel fusion driver. You should be getting a similar output as if you ran the hipInfo binary directly on the host.

Without the –rm parameter, the container will persist. You can then run the same container again and get some output:

sudo docker run –device=”/dev/kfd” –name nifty_hugle my_rocm_hipinfo

The Docker container shall persist:

sudo docker ps -a

You may get an output that looks like this:

Now, try this command and you should see the output from hipInfo again:

sudo docker start -i nifty_hugle

The second Docker image we shall create will contain the sample binary called vector_copy. The source is in /opt/rocm/hsa/sample. As done with hipInfo, use make to build the binary. Note that this binary also depends on the files with the .brig extension to run.

We do the following before we build the image:

mkdir ~/tmp/my_rocm_vectorcopy

cd ~/tmp/my_rocm_vectorcopy

mkdir vector_copy

cp -p /usr/local/lib/libsupc++.so . # Do this if necessary

cd vector_copy

cp -p /opt/rocm/hsa/sample/vector_copy .

cp -p /opt/rocm/hsa/sample/vector_copy*.brig .

cd .. # Back to ~/tmp/my_rocm_vectorcopy

For our Dockerfile, we have this:

FROM rocm/rocm-terminal:latest

COPY libsupc++.so /usr/local/lib/

RUN sudo mkdir /usr/local/vector_copy

COPY vector_copy/* /usr/local/vector_copy/

RUN sudo ldconfig

USER rocm-user

ENV PATH “${PATH}:/opt/rocm/bin:/usr/local/vector_copy”

WORKDIR /usr/local/vector_copy

ENTRYPOINT [“vector_copy”]

Building the Docker image for vector_copy should be familiar by now.

As an exercise, run the Docker image to see what output you get. Try with or without –rm and with the ‘docker start’ command.

For our last example, we shall use a Docker container for the Caffe deep learning framework. We are going to use the HIP port of Caffe which can be targeted to both AMD ROCm and Nvidia CUDA devices.10 Converting CUDA code to portable C++ is enabled by HIP. For more information on HIP, see https://github.com/ROCm-Developer-Tools/HIP.

Let us pull the hip-caffe image from the Docker registry:

docker pull intuitionfabric/hip-caffe

Test the image by running a device query on the AMD GPUs:

sudo docker run –name my_caffe -it –device=/dev/kfd –rm \

intuitionfabric/hip-caffe ./build/tools/caffe device_query -gpu all

You should get an output similar to the one below. Note that your output may differ due to your own host configuration.

I0831 19:05:30.814853 1 caffe.cpp:138] Querying GPUs all

I0831 19:05:30.815135 1 common.cpp:179] Device id: 0

I0831 19:05:30.815145 1 common.cpp:180] Major revision number: 2

I0831 19:05:30.815148 1 common.cpp:181] Minor revision number: 0

I0831 19:05:30.815153 1 common.cpp:182] Name: Device 67df

I0831 19:05:30.815158 1 common.cpp:183] Total global memory: 8589934592

I0831 19:05:30.815178 1 common.cpp:184] Total shared memory per block: 65536

I0831 19:05:30.815192 1 common.cpp:185] Total registers per block: 0

I0831 19:05:30.815196 1 common.cpp:186] Warp size: 64

I0831 19:05:30.815201 1 common.cpp:188] Maximum threads per block: 1024

I0831 19:05:30.815207 1 common.cpp:189] Maximum dimension of block: 1024, 1024, 1024

I0831 19:05:30.815210 1 common.cpp:192] Maximum dimension of grid: 2147483647, 2147483647, 2147483647

I0831 19:05:30.815215 1 common.cpp:195] Clock rate: 1303000

I0831 19:05:30.815219 1 common.cpp:196] Total constant memory: 16384

I0831 19:05:30.815223 1 common.cpp:200] Number of multiprocessors: 36

Let us now run Caffe in a container. We begin by creating a container for this purpose.

sudo docker run -it –device=/dev/kfd –rm intuitionfabric/hip-caffe

Run the MNIST example in the container. Once the above command is executed, you should be inside the container.

First, get the raw MNIST data:

./data/mnist/get_mnist.sh

Make sure you format the data for Caffe:

./examples/mnist/create_mnist.sh

Once that’s done, proceed with training the network:

./examples/mnist/train_lenet.sh

You should get an output similar to this:

I0831 18:43:19.290951 37 caffe.cpp:217] Using GPUs 0

I0831 18:43:19.291165 37 caffe.cpp:222] GPU 0: Device 67df

I0831 18:43:19.294853 37 solver.cpp:48] Initializing solver from parameters:

test_iter: 100

test_interval: 500

base_lr: 0.01

display: 100

max_iter: 10000

lr_policy: “inv”

gamma: 0.0001

power: 0.75

momentum: 0.9

weight_decay: 0.0005

snapshot: 5000

snapshot_prefix: “examples/mnist/lenet”

solver_mode: GPU

device_id: 0

net: “examples/mnist/lenet_train_test.prototxt”

train_state {

level: 0

stage: “”

}

I0831 18:43:19.294972 37 solver.cpp:91] Creating training net from net file: examples/mnist/lenet_train_test.prototxt

I0831 18:43:19.295145 37 net.cpp:322] The NetState phase (0) differed from the phase (1) specified by a rule in layer mnist

I0831 18:43:19.295169 37 net.cpp:322] The NetState phase (0) differed from the phase (1) specified by a rule in layer accuracy

I0831 18:43:19.295181 37 net.cpp:58] Initializing net from parameters:

name: “LeNet”

state {

phase: TRAIN

level: 0

stage: “”

}

layer {

name: “mnist”

type: “Data”

top: “data”

top: “label”

include {

phase: TRAIN

}

transform_param {

scale: 0.00390625

}

data_param {

source: “examples/mnist/mnist_train_lmdb”

batch_size: 64

backend: LMDB

}

}

layer {

name: “conv1”

type: “Convolution”

bottom: “data”

top: “conv1”

param {

lr_mult: 1

}

param {

lr_mult: 2

}

convolution_param {

num_output: 20

kernel_size: 5

stride: 1

weight_filler {

type: “xavier”

}

bias_filler {

type: “constant”

}

}

}

….….layer {

name: “loss”

type: “SoftmaxWithLoss”

bottom: “ip2”

bottom: “label”

top: “loss”

}

I0831 18:43:19.295332 37 layer_factory.hpp:77] Creating layer mnist

I0831 18:43:19.295426 37 net.cpp:100] Creating Layer mnist

I0831 18:43:19.295444 37 net.cpp:408] mnist -> data

I0831 18:43:19.295478 37 net.cpp:408] mnist -> label

I0831 18:43:19.304414 40 db_lmdb.cpp:35] Opened lmdb examples/mnist/mnist_train_lmdb

I0831 18:43:19.304760 37 data_layer.cpp:41] output data size: 64,1,28,28

I0831 18:43:19.305835 37 net.cpp:150] Setting up mnist

I0831 18:43:19.305842 37 net.cpp:157] Top shape: 64 1 28 28 (50176)

I0831 18:43:19.305848 37 net.cpp:157] Top shape: 64 (64)

I0831 18:43:19.305851 37 net.cpp:165] Memory required for data: 200960

I0831 18:43:19.305874 37 layer_factory.hpp:77] Creating layer conv1

I0831 18:43:19.305907 37 net.cpp:100] Creating Layer conv1

I0831 18:43:19.305912 37 net.cpp:434] conv1 <- data

I0831 18:43:19.305940 37 net.cpp:408] conv1 -> conv1

I0831 18:43:19.314159 37 cudnn_conv_layer.cpp:259] Before miopenConvolution*GetWorkSpaceSize

I0831 18:43:19.319051 37 cudnn_conv_layer.cpp:295] After miopenConvolution*GetWorkSpaceSize

I0831 18:43:19.319625 37 cudnn_conv_layer.cpp:468] Before miopenFindConvolutionForwardAlgorithm

I0831 18:43:19.927783 37 cudnn_conv_layer.cpp:493] fwd_algo_[0]: 1

I0831 18:43:19.927809 37 cudnn_conv_layer.cpp:494] workspace_fwd_sizes_[0]:57600

I0831 18:43:19.928071 37 cudnn_conv_layer.cpp:500] Before miopenFindConvolutionBackwardWeightsAlgorithm

….….I0831 18:43:23.296785 37 net.cpp:228] mnist does not need backward computation.

I0831 18:43:23.296789 37 net.cpp:270] This network produces output loss

I0831 18:43:23.296799 37 net.cpp:283] Network initialization done.

I0831 18:43:23.296967 37 solver.cpp:181] Creating test net (#0) specified by net file: examples/mnist/lenet_train_test.prototxt

I0831 18:43:23.296985 37 net.cpp:322] The NetState phase (1) differed from the phase (0) specified by a rule in layer mnist

I0831 18:43:23.296995 37 net.cpp:58] Initializing net from parameters:

name: “LeNet”

state {

phase: TEST

}

layer {

name: “mnist”

type: “Data”

top: “data”

top: “label”

include {

phase: TEST

}

transform_param {

scale: 0.00390625

}

data_param {

source: “examples/mnist/mnist_test_lmdb”

batch_size: 100

backend: LMDB

}

}……

I0831 18:44:12.620506 37 solver.cpp:404] Test net output #1: loss = 0.0299084 (* 1 = 0.0299084 loss)

I0831 18:44:12.624415 37 solver.cpp:228] Iteration 9000, loss = 0.011652

I0831 18:44:12.624441 37 solver.cpp:244] Train net output #0: loss = 0.011652 (* 1 = 0.011652 loss)

I0831 18:44:12.624449 37 sgd_solver.cpp:106] Iteration 9000, lr = 0.00617924

I0831 18:44:13.055759 37 solver.cpp:228] Iteration 9100, loss = 0.0061008

I0831 18:44:13.055778 37 solver.cpp:244] Train net output #0: loss = 0.0061008 (* 1 = 0.0061008 loss)

I0831 18:44:13.055800 37 sgd_solver.cpp:106] Iteration 9100, lr = 0.00615496

I0831 18:44:13.497696 37 solver.cpp:228] Iteration 9200, loss = 0.00277705

I0831 18:44:13.497715 37 solver.cpp:244] Train net output #0: loss = 0.00277706 (* 1 = 0.00277706 loss)

I0831 18:44:13.497720 37 sgd_solver.cpp:106] Iteration 9200, lr = 0.0061309

I0831 18:44:13.941920 37 solver.cpp:228] Iteration 9300, loss = 0.0111398

I0831 18:44:13.941941 37 solver.cpp:244] Train net output #0: loss = 0.0111398 (* 1 = 0.0111398 loss)

I0831 18:44:13.941946 37 sgd_solver.cpp:106] Iteration 9300, lr = 0.00610706

I0831 18:44:14.386647 37 solver.cpp:228] Iteration 9400, loss = 0.0179196

I0831 18:44:14.386667 37 solver.cpp:244] Train net output #0: loss = 0.0179195 (* 1 = 0.0179195 loss)

I0831 18:44:14.386672 37 sgd_solver.cpp:106] Iteration 9400, lr = 0.00608343

I0831 18:44:14.828459 37 solver.cpp:337] Iteration 9500, Testing net (#0)

I0831 18:44:14.983165 37 solver.cpp:404] Test net output #0: accuracy = 0.9884

I0831 18:44:14.983183 37 solver.cpp:404] Test net output #1: loss = 0.0393952 (* 1 = 0.0393952 loss)

I0831 18:44:14.987198 37 solver.cpp:228] Iteration 9500, loss = 0.00496538

I0831 18:44:14.987211 37 solver.cpp:244] Train net output #0: loss = 0.00496537 (* 1 = 0.00496537 loss)

I0831 18:44:14.987217 37 sgd_solver.cpp:106] Iteration 9500, lr = 0.00606002

I0831 18:44:15.433176 37 solver.cpp:228] Iteration 9600, loss = 0.00308157

I0831 18:44:15.433193 37 solver.cpp:244] Train net output #0: loss = 0.00308157 (* 1 = 0.00308157 loss)

I0831 18:44:15.433200 37 sgd_solver.cpp:106] Iteration 9600, lr = 0.00603682

I0831 18:44:15.878787 37 solver.cpp:228] Iteration 9700, loss = 0.00220143

I0831 18:44:15.878806 37 solver.cpp:244] Train net output #0: loss = 0.00220143 (* 1 = 0.00220143 loss)

I0831 18:44:15.878813 37 sgd_solver.cpp:106] Iteration 9700, lr = 0.00601382

I0831 18:44:16.321408 37 solver.cpp:228] Iteration 9800, loss = 0.0108761

I0831 18:44:16.321426 37 solver.cpp:244] Train net output #0: loss = 0.0108761 (* 1 = 0.0108761 loss)

I0831 18:44:16.321432 37 sgd_solver.cpp:106] Iteration 9800, lr = 0.00599102

I0831 18:44:16.765200 37 solver.cpp:228] Iteration 9900, loss = 0.00478531

I0831 18:44:16.765219 37 solver.cpp:244] Train net output #0: loss = 0.00478531 (* 1 = 0.00478531 loss)

I0831 18:44:16.765226 37 sgd_solver.cpp:106] Iteration 9900, lr = 0.00596843

I0831 18:44:17.204908 37 solver.cpp:454] Snapshotting to binary proto file examples/mnist/lenet_iter_10000.caffemodel

I0831 18:44:17.208767 37 sgd_solver.cpp:273] Snapshotting solver state to binary proto file examples/mnist/lenet_iter_10000.solverstate

I0831 18:44:17.211735 37 solver.cpp:317] Iteration 10000, loss = 0.0044067

I0831 18:44:17.211750 37 solver.cpp:337] Iteration 10000, Testing net (#0)

I0831 18:44:17.364528 37 solver.cpp:404] Test net output #0: accuracy = 0.9902

I0831 18:44:17.364547 37 solver.cpp:404] Test net output #1: loss = 0.0303562 (* 1 = 0.0303562 loss)

I0831 18:44:17.364552 37 solver.cpp:322] Optimization Done.

I0831 18:44:17.364555 37 caffe.cpp:254] Optimization Done.

Conclusion

In this article, we provided with you a guide on how to use AMD’s ROCm framework with Docker container technology. This should serve as a good jumpstart to begin your Deep Learning development using AMDs platform.

Docker has become an essential technology in containing the complexity of Deep Learning development. Deep Learning frameworks and tools have many dependencies. By leveraging Docker to isolate these dependencies within a Linux container leads to not only greater reliability and robustness but also to greater agility and flexibility. There are many frameworks and tools that are emerging and it is best practice to have a robust solution to the management of disparate parts. Docker containers have become a standard practice in Deep Learning and this technology is well supported by AMD’s ROCm framework.

FOOTNOTES:

1. import.io. Andrew Ng, Chief Scientist at Baidu, 2015. https://youtu.be/O0VN0pGgBZM.

2. Smith, Ryan. “AMD Announces Radeon Instinct: GPU Accelerators for Deep Learning, Coming In 2017.” AnandTech: Hardware News and Tech Reviews Since 1997, 12 Dec. 2016, http://www.anandtech.com/show/10905/amd-announces-radeon-instinct-deep-learning-2017/.

3. “ROCm. A New Era in GPU Computing.” ROCm, A New Era in Open GPU Computing, 16 Dec. 2016, https://rocm.github.io/index.html.

4. “RadeonOpenCompute/ROCR-Runtime.” GitHub, https://github.com/RadeonOpenCompute/ROCR-Runtime.

5. “ROCK-Kernel-Driver/README.md at Roc-1.6.0.” GitHub.com, 16 Nov. 2016, https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/blob/roc-1.6.x/README.md.

6. “What Is Docker?” Docker - Build, Ship, and Run Any App, Anywhere, https://www.docker.com/what-docker.

7. “ROCm-Docker.” GitHub - ROCM-Docker, https://github.com/RadeonOpenCompute/ROCm-docker. Accessed 24 Mar. 2017.

8. “Get Docker for Ubuntu.” Docker - Build, Ship, and Run Any App, Anywhere, https://docs.docker.com/engine/installation/linux/ubuntu/. Accessed 27 Mar. 2017.

9. “ROCm-Docker.” GitHub - ROCM-Docker, https://github.com/RadeonOpenCompute/ROCm-docker. Accessed 24 Mar. 2017.

10. “hipCaffe: The HIP Port of Caffe.” GitHub.com, https://github.com/ROCmSoftwarePlatform/hipCaffe/blob/hip/README.ROCm.md. Accessed 01 Jun. 2017.

more
0 0 8,614
Staff
Staff

Capture.JPG

[Originally posted on 06/20/17 by Ogi Brkic]

Back in December 2016, we first announced our Radeon Instinct initiative, combining our strength in compute with our dedication to open software. We later announced our Radeon Vega Frontier Edition, an enabler of Radeon Instinct.

Today, we’re excited to tell you about the next chapter in our vision for instinctive computing. AMD’s Radeon Instinct™ accelerators will soon ship to our partners (including Boxx, Colfax, Exxact Corporation, Gigabyte, Inventec and Supermicro, among others) and power their deep learning and HPC solutions starting in Q3 2017.

Artificial intelligence and machine learning are changing the world in ways we never could have imagined only a few years ago, enabling life-changing breakthroughs that can solve previously unsolvable problems. Radeon Instinct™ MI25, MI8, and MI6, together with AMD’s open ROCm 1.6 software platform, can dramatically increase performance, efficiency, and ease of implementation, speeding through deep learning inference and training workloads. We’re not just looking to accelerate the drive to machine intelligence, but to power the next era of true heterogeneous compute.

New Radeon Instinct Accelerators

Through our Radeon Instinct server accelerator products and open ecosystem approach, we’re able to offer our customers cost-effective machine and deep learning training, edge-training and inference solutions, where workloads can take the most advantage of the GPU’s highly parallel computing capabilities.

We’ve also designed the three initial Radeon Instinct accelerators to address a wide range of machine intelligence applications, which includes data-centric HPC-class systems in academics, government labs, energy, life science, financial, automotive and other industries:

mi25-700x700.png

The Radeon Instinct™ MI25 accelerator, based on the new “Vega” GPU architecture with a 14nm FinFET process, will be the world’s ultimate training accelerator for large-scale machine intelligence and deep learning datacenter applications. The MI25 will deliver superior FP16 and FP32 performance in a passively-cooled single GPU server card with 24.6 TFLOPS of FP16 or 12.3 TFLOPS of FP32 peak performance through its 64 compute units (4,096 stream processors). With 16GB of ultra–high bandwidth HBM2 ECC GPU memory and up to 484 GB/s of memory bandwidth, the Radeon Instinct MI25’s design is optimized for massively parallel applications with large datasets for Machine Intelligence and HPC-class systems.

mi8-1-700x700.png

The Radeon Instinct™ MI8 accelerator, harnessing the high-performance, energy-efficiency of the “Fiji” GPU architecture, is a small form factor HPC and inference accelerator with 8.2 TFLOPS of peak FP16|FP32 performance at less than 175W board power and 4GB of High-Bandwidth Memory (HBM) on a 512-bit memory interface. The MI8 is well suited for machine learning inference and HPC applications.

mi6-700x700.png

The Radeon Instinct™ MI6 accelerator, based on the acclaimed “Polaris” GPU architecture, is a passively cooled inference accelerator with 5.7 TFLOPS of peak FP16|FP32 performance at 150W board power and 16GB of ultra-fast GDDR5 GPU memory on a 256-bit memory interface. The MI6 is a versatile accelerator ideal for HPC and machine learning inference and edge-training deployments.

Radeon Instinct hardware is fueled by our open-source software platform, including:

  • Planned for June 29th rollout, the ROCm 1.6 software platform with performance improvements and now support for MIOpen 1.0 is scalable and fully open source providing a flexible, powerful heterogeneous compute solution for a new class of hybrid Hyperscale and HPC-class systems. Comprised of an open-source Linux® driver optimized for scalable multi-GPU computing, the ROCm software platform provides multiple programming models, the HIP CUDA conversion tool, and support for GPU acceleration using the Heterogeneous Computing Compiler (HCC).

  • The open-source MIOpen GPU-accelerated library available June 29th with the ROCm platform and supports machine intelligence frameworks including planned support of Caffe®, TensorFlow® and Torch®.

Revolutionizing the Datacenter with “Zen”-based Epyc™ Servers and Radeon Instinct Accelerators

The Radeon Instinct MI25, combined with our new “Zen”-based Epyc servers and the revolutionary ROCm open software platform, will provide a progressive approach to open heterogeneous compute and machine learning from the metal forward.

We plan to ship Radeon Instinct products to our technology partners in Q3 for design in their deep learning and HPC solutions, giving customers a real choice of vendors for open, scale-out machine learning solutions.

For more details and specifications on these cards, please check out the product pages below.

Radeon Instinct MI25

Radeon Instinct MI8

Radeon Instinct MI6

more
0 0 859

Capture.JPG

[Originally posted on 07/30/17 - by Mark Hirsch]

1 PetaFLOPS of Performance for the Ultimate Virtualization and Machine Intelligence Solution

Today at Capsaicin SIGGRAPH, AMD showcased what can be achieved when the world’s greatest server CPU is combined with the world’s greatest GPU, based on AMD’s revolutionary “Vega” architecture. Developed by AMD in collaboration with Inventec, Project 47 is based on Inventec’s P-series massively parallel computing platform, and is a rack designed to excel in a range of tasks, from graphics virtualization to machine intelligence.

Project 47 boasts 1 PetaFLOPS of compute power at full 32-bit precision delivering a stunning 30 GigaFLOPS/W, demonstrating dramatic compute efficiency.1 It boasts more cores, threads, compute units, IO lanes and memory channels in use at one time than in any other similarly configured system ever before. The incredible performance-per-dollar and performance-per-watt of Project 47 makes supercomputing a more affordable reality than ever before, whether for machine learning, virtualization or rendering.

p47-1-306x700.jpg

Project 47 is made up of a rack of individual servers, each harnessing one EPYC™ 7601 processor to drive up to four “Vega”-based Radeon Instinct™ MI25 accelerators using 128 PCIe® lanes, in contrast to the costly dual-CPU and PLX switch setups typically needed on competing platforms in order to run four GPUs. With Project 47, AMD showcased the ease with which multiple servers can be daisy-chained, demonstrating a rack of 20 servers running 20 EPYC SoCs and 80 Radeon Instinct MI25 accelerators.

To bring Project 47 to life, AMD worked closely with Samsung Electronics with respect to the HBM2 memory used across the “Vega”-based product lines including the Radeon Instinct MI25 accelerators. Samsung also provided high-performance NVMe SSD storage and high-speed DDR4 memory to enable the 1 PetaFLOPS of performance. AMD also collaborated with Mellanox Technologies, leveraging their InfiniBand solution to deliver 100Gb connectivity through the rack.

Project 47 is expected to be available from Inventec and their principal distributor AMAX in Q4 of this year.

Mark Hirsch, Corporate Vice President, Systems & Solutions for the Radeon Technologies Group at AMD. His postings are his own opinions and may not represent AMD’s positions, strategies, or opinions. Links to third party sites and references to third party trademarks are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.

FOOTNOTES:

1. Project 47 has a total rack power of 34,200 Watts and delivers a performance of 1,027,600 GigaFLOPS for 30.05 GigaFLOPS/W in single precision performance.

more
0 0 1,894
Community Manager
Community Manager

Capture.JPG

[Originally posted on 11/16/17 by Carlos E. Perez]

AMD’s newly released Vega architecture has several unique features that can be leveraged in Deep Learning training and inference workloads.

The first noteworthy feature is the capability to perform FP16 at twice the speed as FP32 and with INT8 at four times as fast as FP32. This translates to a peak performance of 24 teraflops on FP16 and 48 trillion operations per second on INT8. Deep Learning workloads have known to work well with lower precision arithmetic. It is as if AMD architects were aware of this reality and designed VEGA to exploit this characteristic. The second noteworthy feature of Vega is its new memory architecture that permits the addressability of up to 512GB of memory. The third benefit is favorable coupling with AMD’s ThreadRipper and EPYC lines of microprocessors.

On Deep Learning

Deep learning (DL) is a technology that is as revolutionary as the Internet and mobile computing that came before it. The current revival of interest in all things “Artificial Intelligence” (AI) is driven by the spectacular results achieved with deep learning. There are other AI technologies like expert systems, semantic knowledge bases, logic programming and Bayesian systems. Most of classical AI has not changed much, if any, in the last 5 years. The recent quantum leap disproportionately been driven by deep learning progress.

When Google embarked on converting their natural language translation software into using deep learning, they were surprised to discover major gains. This was best described in a recent article published in the New York Times, “The Great AI Awakening”:

The neural system, on the English-French language pair, showed an improvement over the old system of seven points. Hughes told Schuster’s team they hadn’t had even half as strong an improvement in their own system in the last four years. To be sure this wasn’t some fluke in the metric, they also turned to their pool of human contractors to do a side-by-side comparison. The user-perception scores, in which sample sentences were graded from zero to six, showed an average improvement of 0.4 — roughly equivalent to the aggregate gains of the old system over its entire lifetime of development. In mid-March, Hughes sent his team an email. All projects on the old system were to be suspended immediately.

Let’s pause to recognize what happened at Google. Since its inception, Google has used every type of AI or machine learning technology imaginable. In spite of this, their average gain for improvement per year was only 0.4%. In Google’s first implementation, the improvement due to DL was 7 percentage points better.

Google likely has the most talented AI and algorithm developers on the planet. However, several years of handcrafted development could not hold a candle to a single initial deep learning implementation.

ROCm

ROCm is software that supports High Performance Computing (HPC) workloads on AMD hardware. ROCm includes a C/C++ compiler called the Heterogeneous Compute Compiler (HCC). HCC is based on the open-source LLVM compiler infrastructure project. This HCC compiler supports the direct generation of native Radeon GPU instruction set (known as GSN ISA). Targeting native GPU instructions is crucial to get maximum performance. All the libraries under ROCm support GSN ISA.

Included with the compiler is an API called HC which provides additional control over synchronization, data movement and memory allocation. The HCC compiler is based on previous work in heterogeneous computing at the HSA foundation. The design allows CPU and GPU code to be written in the same source file and supports capabilities such as a unified CPU-GPU memory space.

1-848x700.jpg

The diagram above depicts the relationships between the ROCm components. The HCC compiler generates both the CPU and GPU code. It uses different LLVM back ends to generate x86 and GCN ISA code from a single C/C++ source. A GSN ISA assembler can also [1] be used as a source for the GCN target.

The CPU and GPU code are linked with the HCC runtime to form the application (compare this with HSA diagram). The application communicates with the ROCr driver that resides in user space in Linux. The ROCr driver uses a low latency mechanism (packet based AQL) to coordinate with the ROCk Kernel Driver.

To further narrow the capability gap, the ROCm Initiative created a CUDA porting tool called HIP (let’s ignore what it stands for). HIP provides tooling that scans CUDA source code and converts it into corresponding HIP source code. HIP source code looks similar to CUDA code, but compiled HIP code can support both CUDA and AMD based GPU devices.

2-211x300.png

The ROCm initiative provides the handcrafted libraries and assembly language tooling that will allow developers to extract every ounce of performance from AMD hardware. This includes a rocBLAS. This is implemented from scratch with a HIP interface. AMD also provides an FFT library called rocFFT that is also written with HIP interfaces. MIOpen is a native library that is tuned for Deep Learning workloads, it is AMD’s alternative to Nvidia’s cuDNN library. This library includes Radeon GPU-specific optimizations.

hipCaffe

AMD currently has ported Caffe to run using the ROCm stack. You can try examples here. I ran some benchmarks found here and here is a chart of the results:

3.png

Caffe is run on unspecified GPU hardware.

I don’t know the specific hardware that was used in these benchmarks, however, this comparison does show that the performance improvement is quite significant as compared to alternatives. One thing to observe is that the speedup is most impressive with a complex network like GoogleNet as compared to simpler one like VGG. This is a reflection of the amount of hand-tuning that AMD has done on the MIOpen library.

Deep Learning Standard Virtual Machines

Deep learning frameworks like Caffe have internal computational graphs. These graphs specify the execution order of mathematical operations, similar to a dataflow. These frameworks use the graph to orchestrate its execution on groups of CPUs and GPUs. The execution is parallel and this is one reason why GPUs are ideal for this kind of computation. There are however plenty of untapped opportunities to improve the orchestration between the CPU and GPU.

The current state of Deep Learning frameworks is similar to the fragmented state before the creation of common code generation backends like LLVM. In the chaotic good old days, every programming language had to re-invent its way of generating machine code. With the development of LLVM, many languages now share the same backend code. Many programming languages use LLVM as their backend. Several well-known examples of this are Ada, C#, Common Lisp, Delphi, Fortran, Haskell, Java bytecode, Julia, Lua, Objective-C, Python, R, Ruby, Rust, and Swift. The frontend code only needs to parse and translate source code to an intermediate representation (IR).

Deep Learning frameworks will eventually need their own “IR”. The IR for Deep Learning is, of course, the computational graph. Deep learning frameworks like Caffe and TensorFlow have their own internal computational graphs. These frameworks are all merely convenient fronts to the internal graph. These graphs specify the execution order of mathematical operations, analogous to what a dataflow graph does. The graph specifies the orchestration of collections of CPUs and GPUs. This execution is highly parallel. Parallelism is the one reason why GPUs are ideal for this kind of computation. There are however plenty of untapped opportunities to improve the orchestration between the CPU and GPU.

New research is exploring ways to optimize the computational graph in a way that goes beyond just single device optimization and towards more global multi-device optimization. NNVM is one such framework that performs a computation graph optimization framework using an intermediate representation. The goal is for NNVM optimizers to reduce memory and device allocation while preserving the original computational semantics.

A more recent development is the port of NNVM to support AMD GPUs. The NNVM compiler can compile to the TVM stack. The TVM stack is a compilation an end-to-end compilation stack that supports multiple backends. TVM compiles a high-level computation description written in TVM frontend down to an optimized native GPU code. It leverages an LLVM based code generator in TVM and LLVM’s ROCm capabilities. This new project can be found at: https://github.com/ROCmSoftwarePlatform/nnvm-rocm.

The NNVM and TVM stacks perform optimizations in a global manner across either the computational graph or an alternative declarative specification. Conventional DL frameworks, however, have code generation and execution all intertwined with their code base, making opportunities to develop optimization solutions less portable. Ideally, one would like to see a common standard, a DL virtual machine instruction set, where the community can collective contribute optimization routines. Open Neural Network eXchange (ONNX) is one such standard. ONNX is a project supported by Facebook and Microsoft. They are building support for Caffe2, PyTorch and Cognitive Toolkit. The recent TVM port reveals the potential of AMD support for a wider range of DL frameworks:

4-1440x700.png

TVM transforms the computational graph by minimizing memory, optimizing data layout and fusing computational kernels. It is a reusable framework that is designed to support multiple hardware back-ends. NNVM provides a high-level intermediate representation that represents tasks scheduling and memory management. TVM is a low-level IR for optimizing computation. A proof of concept showed that the approach of optimizing low-level operations lead to around a 35% improvement over hand-engineered kernels. This end-to-end optimization combined with AMD’s open sourced computational libraries like MIOpen is a very promising development.

Conclusion

There are many Deep Learning frameworks in existence today. Different frameworks have their own strengths and weaknesses. The field is making good progress to develop standardization that allows interoperability of these frameworks. This is through a common standard Deep Learning virtual machine. ONNX is one of these more recent standards.

In addition to standardization, global optimization of the computational graph found in Deep Learning frameworks is a means towards higher performance. The TVM framework and its integration with AMD’s LLVM based backend opens up the opportunity for end-to-end optimization of not only AMD GPUs but also the combination of CPUs and GPUs.

more
0 0 13.6K
Community Manager
Community Manager

Capture.JPG

[Originally posted on 10/20/17]

The recent release of ROCm 1.6, which includes a cuDNN-like library called MIOpen and a port of the deep learning Caffe framework (the AMD version is called hipCaffe), has opened up the opportunity for running deep learning projects using AMD Radeon GPUs. In this article we demonstrate 6 projects that you can start using with AMDs new hardware accelerators.

Most GPU-enabled deep learning frameworks rely on Nvidia’s CUDA and cuDNN libraries. AMD is however pulling an aggressive effort to port many deep learning frameworks such as Caffe, Torch, MXNet and Tensorflow to run on their hardware. Developers are now able to convert CUDA code to portable C++ code, thanks to AMD’s porting tools and libraries such as HIP.

The deep learning framework Caffe has recently been ported using HIP, allowing Deep Learning practitioners to run Caffe projects on AMD GPUs. This port, can be downloaded from here..

1. Traffic Sign Recognition

traffic.png

Source

An interesting image classification problem is the recognition of traffic signs. This project classifies 43 different German traffic signs. A data set of 50,000 images is used.

2. Image Synthesizer

5.png

Source

University of Wyoming’s Evolving AI Lab has a project whose goal is to understand how deep neural networks (DNNs) work by synthesizing preferred stimuli that highly activates the neurons for a particular image. A deep generator network (DGN) is used as prior to the DNN being studied. This DGN outputs a synthetic image very similar to real images from the ImageNet dataset as possible.

Below are a few results from running the sample scripts in the project:

2.jpg

The project’s paper is available from  here. The code needed to reproduce some of the results in the paper is on GitHub.

3. Traffic Light Detection

6.png

Source

David Brailovsky from Israel writes in Medium about Recognizing Traffic Lights with Deep Learning (see here). Source code for his project can be found here.

4. Cat/Dog Classifier

8.jpg

Source

This introductory tutorial by Adil Moujahid shows how to train a model and how to use a pre-existing model to distinguish cats from dogs in pictures. A Kaggle dataset is used for this tutorial. For the trained model, the BVLC CaffeNet Model is used.

The Caffe project already has pre-trained models (i.e. VGG, ImageNet) that can be used as a starting point for developing other kinds of image classification.

5. Visual Development Environment

9.png

Fabrik is an open source application for building, visualizing and training deep learning models. Fabrik provides simple drag-and-drop tools to streamline deep learning development. The application currently supports importing, editing and exporting of Caffe based models. This is a convenient way to view and edit your models.

6. Model Conversion Tools

Finally, there are vastly more projects that have been developed in frameworks other than Caffe. For these projects, there are some tools that can convert models into one that is compatible with Caffe. This GitHub project provides a listing of conversion tools to convert one frameworks model into another.

MXNet to Caffe

The code from this GitHub repository allows you to convert an MXNet model to a Caffe model.

PyTorch to Caffe

This project allows you to convert between PyTorch, Caffe, and Darknet models.

Torch to Caffe

Facebook has a converter that converts Torch models to Caffe.

Summary

In this article, we explore the many deep learning projects that you can now run using AMD Radeon Instinct hardware. We have included in this list, projects that you can test out with minimal effort. There are other projects that have customized Caffe with custom elements like new kinds of layers and activation function. For these projects, one may require porting CUDA specific code using AMD’s HIP tooling. Aside from the projects explored here, you can find other projects in the Caffe Model Zoo.

The smartest companies in the world are migrating their infrastructure to support this new paradigm. Daily, the press continues to report the amazing progress of AI. Furthermore, you hear about firms like Google and Microsoft changing their entire software DNA to move into AI. The reason for this massive migration is Deep Learning.

Deep Learning is supporting work by not only providing assistive capabilities, but also by enabling more creative generative capabilities. Assistive capabilities can happen in real time as well as in the backend. There are certain professions where the ability to curate and analyze information is extremely valuable. We can enhance these curation and analysis capabilities by reducing the deluge of information into smaller chunks that are more quickly digestible.

Generative capabilities are a new kind of capability that is becoming more pervasive. By now, we’ve all experienced the capabilities of mobile app Prisma that is able to re-render photographs into the style of different artists.

In this article, we highlighted several deep learning projects that explore both assistive and generative capabilities found in Deep Learning. We also covered some tools that allow you to port models from other projects as well as an IDE. Software that supports Radeon Instinct accelerators is still in its infancy. However, despite being out for just a few months, there are now plenty of interesting applications that can be used as a springboard to developing more complex solutions.

Albert J. De Vera and Carlos E.Perez, are Co-Founders at Intuition Machine. They specializes in Deep Learning patterns, methodology and strategy. Many of their other writings on Artificial Intelligence can be found on Medium. Their postings are their own opinions and may not represent AMD’s positions, strategies, or opinions. Links to third party sites and references to third party trademarks are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.

more
1 0 7,787
Community Manager
Community Manager

Capture.JPG

[Originally posted on 04/03/17]

When a company starts using disruptive technology or a disruptive business model, the results can be spectacular and can leave the competition eating dust.

The reason for this is that although the company’s growth seems linear at first, it eventually reveals itself as being exponential. When a company reaches this point, it becomes very difficult, if not impossible, for competitors to catch up.

This article explores AMD’s open source deep learning strategy and explains the benefits of AMD’s ROCm initiative to accelerating deep learning development. It asks if AMD’s competitors need to be concerned with the disruptive nature of what AMD is doing.

On Deep Learning

Deep learning (DL) is a technology that is as revolutionary as the Internet and mobile computing that came before it. One author found it so revolutionary that he described it as “The Last Invention of Man” [KHAT] – strong words indeed!

Currently, the revival of interest in all things “Artificial Intelligence” (AI) is primarily due to the spectacular results achieved with deep learning research. I must however emphasize that this revival is not due to other classical AI technologies like expert systems, semantic knowledge bases, logic programming or Bayesian systems. Most of classical AI has not changed much, if any, in the last 5 years. The recent quantum leap has solely been driven by deep learning successes.

For some perspective on the extent of deep learning development, look at this graph from Google that shows the adoption of deep learning technology in their applications:

deep-learning-google-1-1244x700.png

Source: https://www.slideshare.net/HadoopSummit/machine-intelligence-at-google-scale-tensorflow

As you can see, the adoption at Google has been exponential and the statistics are likely similar for many of the other big Internet firms like Facebook and Microsoft.

When Google embarked on converting their natural language translation software into using deep learning, they were surprised to discover major gains. This was best described in a recent article published in the New York Times, “The Great AI Awakening” [LEW]:

The neural system, on the English-French language pair, showed an improvement over the old system of seven points. Hughes told Schuster’s team they hadn’t had even half as strong an improvement in their own system in the last four years. To be sure this wasn’t some fluke in the metric, they also turned to their pool of human contractors to do a side-by-side comparison. The user-perception scores, in which sample sentences were graded from zero to six, showed an average improvement of 0.4 — roughly equivalent to the aggregate gains of the old system over its entire lifetime of development. In mid-March, Hughes sent his team an email. All projects on the old system were to be suspended immediately.

Let’s pause to recognize what happened at Google.

Since its inception, Google has used every type of AI or machine learning technology imaginable. In spite of this, their average gain for improvement per year was only 0.4%. In Google’s first implementation, the improvement due to DL was 7 percentage points better.

This translates to more gains than the entire lifetime of improvements!

Google likely has the most talented AI and algorithm developers on the planet. However, several years of handcrafted development could not hold a candle to a single initial deep learning implementation.

Deep Learning is unexpectedly, and disruptively, taking over the world

Google’s founder Sergey Brin, an extremely talented computer scientist himself, stated in a recent World Economic Forum [CHA] discussion that he did not foresee deep learning:

“The revolution in deep nets has been very profound, it definitely surprised me, even though I was sitting right there.”

The deep learning progress has been taking the academic community by storm. Two articles by practitioners of classical machine learning have summarized why they think DL is taking over the world. Chris Manning, a renowned expert in NLP, writes about the “Deep learning Tsunami“ [MAN]:

Deep learning waves have lapped at the shores of computational linguistics for several years now, but 2015 seems like the year when the full force of the tsunami hit the major Natural Language Processing (NLP) conferences. However, some pundits are predicting that the final damage will be even worse.

The same sentiment is expressed by Nicholas Paragios, who works in the field of computer vision. Paragios writes in “Computer Vision Research: the Deep Depression“ [PAR]:

It might be simply because deep learning on highly complex, hugely determined in terms of degrees of freedom graphs once endowed with massive amount of annotated data and unthinkable — until very recently — computing power can solve all computer vision problems. If this is the case, well it is simply a matter of time that industry (which seems to be already the case) takes over, research in computer vision becomes a marginal academic objective and the field follows the path of computer graphics (in terms of activity and volume of academic research).

Although I don’t want to detail the many deep learning developments of the past several years, Nell Watson provides a quick, short summary when she writes in “Artificial Intuition” [WAT]:

To sum up, machine intelligence can do a lot of creative things; it can mash up existing content [SHO], reframe it to fit a new context [PARK], fill in gaps in an appropriate fashion [CON], or generate potential solutions given a range of parameters [AUTO].

Make no mistake – Deep Learning is a “Disruptive” technology that is taking over operations of the most advanced technology companies in the world.

On Disruptiveness

Of late, the business world has become much more difficult and competitive. This situation has been made worse by disruptive changes in the global economy. The potential of nimbler competitors to disrupt the businesses of incumbents has never been greater. Peter Diamandis describes the Six D’s of Exponentials as consisting of the following:

  • Digitization – Anything that can be digitized can lead to the same exponential growth we find in computing. Anything that is digitized or virtualized instead is unencumbered by physical law. It thus costs less to mass produce and moves faster in spreading.
  • Deception – Once digitized or virtualized, initial growth deceptively appears linear. However, given time, exponential growth becomes obvious. For many it is too late to react once growth of a competitor hits this transition.
  • Disruption – New markets that are more effective and less costly are created. Existing markets that are tied to the physical world will eventually become extinct. We’ve seen this in music, photography and many other areas.
  • Demonetization – As cost heads towards zero, so does the ability to solicit a payment for it. Thus, a business has to reinvent its revenue model, or come up with new ways of monetization.
  • Dematerialization – Physical products disappear and are replaced by a more convenient and accessible alternative.
  • Democratization — More people now have access to technology at a lower cost. The means of production have become more accessible to everyone. This access is no longer confined to the big corporation, or the wealthy. We see this fragmentation everywhere where producers are publishing their own books, music and videos. This feeds back into itself and smaller players become able to compete.

To survive this disruption, there is an ever-pressing need for enterprises to take drastic action by re-engineering how they run their businesses.

John Hagel proposes four kinds of platforms [HAG] that leverage networking effects as an organizational mechanism to combat disruptive technologies. The four platforms that Hagel proposes are Aggregation platforms (example: Marketplaces), Social platforms (example: Social Networks), Mobilization platforms (example: Complex supply chains) and Learning platforms.

Learning platforms

Learning platforms are dynamic and adaptive environments where people come together to collectively learn how to address complex problems. Members can connect to ask questions, share experiences and offer advice. Open source projects that are actively managed with distributed source control, test-driven development, issue tracking, and continuous integration, is a good example of a learning platform. The key ingredient here is that there is a learning mechanism that gets codified continuously. The fact that we find this in software development should not come as a surprise, as software development is essentially a learning process.

John Hagel describes an intriguing property of a Learning platform:

What if we change the assumption, though? What if each fax machine acquired more features and functions as it connected with more fax machines? What if its features multiplied at a faster rate as more fax machines joined the network? Now, we’d have a second level of network effect — we’d still have the network effects that come by simply increasing the number of fax machines, but now there’s an additional network effect that accrues as each fax machine adds more and more features as a result of interacting with other fax machines.

What Hagel is saying is that the members of the network adaptively become more effective and capable as a participant in the learning network. In other words, not only is there the conventional networking effect, but another mechanism kicks network effects into overdrive. Learning platforms such as an open source community can further accelerate the disruptiveness of an already disruptive technology.

Historically, an open source strategy has been quite effective in many disruptive technology areas. In the Internet, Linux (79%) dominates in back end infrastructure services – Google’s Chrome (58%), Android (65%), Web-servers (65% Apache and Nginx). It should not surprise anyone when an open source strategy in the disruptive deep learning space eventually emerges as the dominant platform.

There are only a few semiconductor manufacturers that have the economies of scale to be competitive in high-performance computing. These are Nvidia, Intel, AMD, Qualcomm and Xilinx. We will now explore AMD’s deep learning solution and detail their unique open source strategy. We will also look at how it gives the company a competitive advantage.

Deep learning as a disruptive technology is critically enabled by hardware. AMD is one of the few semiconductor companies that actually exploits neural network in their hardware. In AMD’s SenseMI Infinity Fabric, an evolution of AMD HyperTransport interconnect technology, the design uses “perceptrons” to support branch prediction. AMD’s GPU hardware has always been competitive against Nvidia hardware. When algorithms are extensively optimized, AMD hardware is in fact favored. This is shown in the many cryptocurrency proof-of-work algorithms that have favored AMD hardware. Raja Koduri, head of AMD Radeon products, recently noted that AMD has had more compute per buck since 2005.

AMD’s Open Source Deep Learning Stack

Before we get into the detail of AMD’s deep learning stack, let’s look at the philosophy behind the development tooling. AMD, having a unique position of being both a CPU and GPU vendor, has been promoting the concept of a Heterogeneous System Architecture (HSA) for a number of years. Unlike most development tools from other vendors, AMD’s tooling is designed to support both their x86 based CPU and their GPU. AMD shares the HSA design and implementations in the HSA foundation (founded in 2012), a non-profit organization that has members including other CPU vendors like ARM, Qualcomm and Samsung.

The HSA foundation has an informative graphic that illustrates the HSA stack:

hsa.png

As you can see, the middleware (i.e. HSA Runtime Infrastructure) provides an abstraction layer between the different kinds of compute devices that reside in a single system. One can think of this as a virtual machine that allows the same program to be run on both a CPU and a GPU.

In November 2015, AMD announced the ROCm initiative to support High Performance Computing (HPC) workloads, and to provide an alternative to Nvidia’s CUDA platform. The initiative released an open source 64-bit Linux driver (known as the ROCk Kernel Driver) and an extended (i.e. non-standard) HSA runtime (known as the ROCr Runtime). ROCm also inherits previous HSA innovations such as AQL packets, user-mode queues and context-switching.

ROCm also released a C/C++ compiler called the Heterogeneous Compute Compiler (HCC) targeted to support HPC applications. HCC is based on the open-source LLVM compiler infrastructure project [WIKI]. There are many other open source versions of languages that use LLVM. Some examples are Ada, C#, Delphi, Fortran, Haskell, Java bytecode, Julia, Lua, Objective-C, Python, R, Ruby, Rust, and Swift. This rich ecosystem opens the possibility of alternative languages on the ROCm platform. One promising development of this kind is the Python implementation called NUMBA.

Added to the compiler is an API called HC which provides additional control over synchronization, data movement and memory allocation. HCC supports other parallel programming APIs, but to avoid further confusion, I will not mention them here.

The HCC compiler is based on work at the HSA foundation. This allows CPU and GPU code to be written in the same source file and supports capabilities such as a unified CPU-GPU memory space.

To further narrow the capability gap, the ROCm Initiative created a CUDA porting tool called HIP (let’s ignore what it stands for). HIP provides tooling that scans CUDA source code and converts it into corresponding HIP source code. HIP source code looks similar to CUDA code, but compiled HIP code can support both CUDA and AMD based GPU devices.

AMD1-211x300.png

AMD took the Caffe framework with 55,000 lines of optimized CUDA code and applied their HIP tooling. 99.6% of the 55,000 lines of code was translated automatically. The remaining code took a week to complete by a single developer. Once ported, the HIP code performed as well as the original CUDA version.

HIP is not 100% compatible with CUDA, but it does provide a migration path for developers to support an alternative GPU platform. This is great for developers who already have a large CUDA code base.

Early this year AMD decided to get even “closer to the metal” by announcing the “Lightning Compiler Initiative.” This HCC compiler now supports the direct generation of the Radeon GPU instruction set (known as GSN ISA) instead of HSAIL.

As we shall see later, directly targeting native GPU instructions is critical to get higher performance. All the libraries under ROCm support GSN ISA.

AMD2.png

The diagram depicts the relationships between the ROCm components. The HCC compiler generates both the CPU and GPU code. It uses different LLVM back ends to generate x86 and GCN ISA code from a single C/C++ source. A GSN ISA assembler can also [1] be used as a source for the GCN target.

The CPU and GPU code are linked with the HCC runtime to form the application (compare this with HSA diagram). The application communicates with the ROCr driver that resides in user space in Linux. The ROCr driver uses a low latency mechanism (packet based AQL) to coordinate with the ROCk Kernel Driver.

This raises two key points about what is required for high-performance computation:

1. The ability to perform work at the assembly language level of a device.

2. The availability of highly optimized libraries.

In 2015, Peter Warden wrote, “Why GEMM is at the heart of deep learning” [WAR] about the importance of optimized matrix libraries. BLAS (Basic Linear Algebra Subprograms) are hand-optimized libraries that trace their origins way back to Fortran code. Warden writes:

The Fortran world of scientific programmers has spent decades optimizing code to perform large matrix to matrix multiplications, and the benefits from the very regular patterns of memory access outweigh the wasteful storage costs.

This kind of attention to every detailed memory access is hard to replicate despite our advances in compiler technology. Warden went even further in 2017 when he wrote, “Why Deep learning Needs Assembler Hackers” [WAR2]:

I spend a large amount of my time worrying about instruction dependencies and all the other hardware details that we were supposed to be able to escape in the 21st century.

Despite being a very recent technology, software that enables deep learning is a complex stack. A common perception is that most deep learning frameworks (i.e. TensorFlow, Torch, Caffe etc) are open source. These frameworks are however built on highly optimized kernels that are often proprietary. Developers can go to great lengths to squeeze every ounce of performance from their hardware.

As an example, Scott Gray of Nervana systems had to reverse engineer Nvidia’s instruction set [GRAY] to create an assembler:

I basically came to the conclusion that it was not possible to fully utilize the hardware I bought with the tools Nvidia provides. Nvidia, unfortunately, doesn’t believe in eating their own dog food and they hand assemble their library routines, rather than use ptxas like the rest of us have to.

Gray used assembly language to write their kernels, thus creating algorithms that bested the proprietary alternatives. Now imagine how much less work he would have to do if the assembly language was available and documented. This is what AMD is bringing to the table.

The ROCm initiative provides the handcrafted libraries and assembly language tooling that will allow developers to extract every ounce of performance from AMD hardware. This includes a rocBLAS [KNOX], an implementation of BLAS that provides these level capabilities:

BLAS Level-1:

  • amax, amin, asum, axpy, copy, dot, nrm2, scal, swap

BLAS Level-2:

  • gemv

BLAS Level-3:

  • gemm, trtri, batched-trtri

This is implemented from scratch with a HIP interface. AMD has even provided a tool (i.e. Tensile) that supports the benchmarking of rocBLAS. AMD also provides an FFT library called rocFFT that is also written with HIP interfaces.

I wonder if Facebook’s fbcunn (Deep learning extensions for CUDA) [GIT], a library that employs FFTs to accelerate convolutions, could be ported using the HIP tooling.

Deep learning algorithms continue to evolve at a rapid pace. In the beginning, frameworks exploited the available matrix multiplication libraries. These finely tuned algorithms have been developed over decades. As research continued, newer kinds of algorithms were proposed.

Thus came the need to go beyond generic matrix multiplication. Convolutional networks came along and this resulted in even more innovative algorithms. Today, many of these algorithms are crafted by hand using assembly language.

Here is a partial list of deep learning specific optimizations that are performed by a proprietary library:

Activation Functions: ReLU, Sigmoid, Tanh, Pooling, Softmax, Log Softmax

Higher Order Tensor Operations: Ordering, Striding, Padding, Subregions

Forward and Backward Convolutions: 2D, FFT, Tiled, 3×3

Small Data Types: FP16, Half2

Normalization: Batch, Local Response

Recurrent Neural Network: LSTM

These low-level tweaks can lead to remarkable performance improvements. For some operations (i.e. batch normalization), the performance increases 14 times compared to a non-optimized solution.

AMD is set to release a library called miOpen that includes handcrafted optimizations. This library includes Radeon GPU-specific optimizations for operations and will likely include many of those described above. MiOpen is scheduled for a release in the first half of this year. Its release will coincide with the release of other popular deep learning frameworks such as Caffe, Torch7, and TensorFlow. This will allow application code that uses these frameworks to perform competitively on Radeon GPU hardware.

Many other state-of-the-art methods have not yet worked their way into proprietary deep learning libraries. These are proposed almost every day as new papers are published in Arxiv.

Here are just a few:

  • CReLU
  • PReLU
  • Hierarchical Softmax
  • Adaptive Softmax
  • Layer Normalization
  • Weight Normalization
  • Wasserstein Loss
  • Z-Loss

It would be very difficult for any vendor to keep up with such a furious pace. In the current situation, given the lack of transparency in development tools, developers are forced to wait, although they would rather be performing the coding and optimizations themselves. Fortunately, the open source ROCm initiative solves the problem.

ROCm includes an open source GCN ISA based assembler and disassembler.

System Wide Optimization

In a recent investor’s meeting by Intel, the company shared some of their statistics:

Among servers used for deep learning applications, the chipmaker says that 91% use just Intel Xeon processors to handle the computations, 7% use Xeon processors paired with graphics processing units, while 2% use alternative architectures altogether.

The mix will change as the value of deep learning is understood better. The point here is that CPUs will always be required, even if most of the computations are performed by GPUs. That being said, it is important to recognize that system-wide optimizations are equally critical. This is where AMD’s original investments in Heterogeneous System Architecture may pay big dividends. I would however like to point out that new research efforts are underway to optimize the code that is emitted by deep learning frameworks further.

Deep learning frameworks like Caffe and Tensorflow have internal computational graphs. These graphs specify the execution order of mathematical operations, similar to a dataflow. These frameworks use the graph to orchestrate its execution on groups of CPUs and GPUs. The execution is parallel and this is one reason why GPUs are ideal for this kind of computation. There are however plenty of untapped opportunities to improve the orchestration between the CPU and GPU.

The current state of Deep Learning frameworks is similar to the state before the creation of a common code generation backend like the LLVM. In the past, every programming language had its own way of generating machine code. With the development of LLVM, many languages now share the same backend code. The frontend code only needs to translate source code to an intermediate representation (IR). Deep Learning frameworks will eventually need a similar IR for Deep Learning solutions. The IR for Deep Learning is the computational graph.

New research is exploring ways to optimize the computational graph in a way that goes beyond just single device optimization and towards more global multi-device optimization.

An example of this is the research project XLA (Accelerated Linear Algebra) from the TensorFlow developers. XLA supports both Just in Time (JIT) or Ahead of Time (AOT) compilation. XLA is a high-level optimizer that performs its work by optimizing the interplay of the CPUs, GPUs and FPGAs.

The optimizations planned include:

  • Fusing of pipelined operations
  • Aggressive constant propagation
  • Reduction of storage buffers
  • Fusing of low-level operators

There are two other open source projects that are also exploring computational graph optimization. NNVM from the MXNet developer is another computation graph optimization framework that, similar to XLA, provides an intermediate representation. The goal is for optimizers to reduce memory and device allocation, while preserving the original computational semantics.

NGraph from Intel is exploring optimizations that include:

  • Kernel fusion
  • Buffer allocation
  • Training optimizations
  • Inference optimizations
  • Data layout
  • Distributed training

There are certainly plenty of ideas around of how to improve the performance.

AMD has developed a runtime framework that takes into account heterogeneous CPU-GPU systems. It is called Asynchronous Task and Memory Interface (ATMI). The ATMI runtime is driven by a declarative description of high-level tasks that will execute the scheduling and memory in an optimal manner.

ATMI is also open source and can be exploited to drive deep learning based computational graphs like the ones found in XLA, NNVM or NGraph. The future of Deep Learning software will revolve around a common computational graph and optimizations will take the orchestration of the entire system into consideration.

Operations and Virtualization

What we have been discussing so far are the opportunities to squeeze as much performance from hardware as possible, but there is more to a complete solution than just raw performance.

Every complex system requires good manageability to ensure continued and sustained operations. The ROCm initiative does not overlook this need and provides open source implementations. ROC-smi, ROCm-Docker and ROCm-profiler are three open source projects that provide support for operations.

AMD’s GPU hardware and drivers have also been designed to support GPU virtualization (see: MxGPU). This permits GPU hardware to be shared by multiple users. I will discuss operational aspects of AMD’s offerings in a next article.

Deployment

Throughout this article, we’ve discussed the promising aspects of the ROCm software stack. When the rubber meets the road, we need to discuss the kind of hardware that software will run on. There are many different scenarios where it makes sense to deploy deep learning. Contrary to popular belief, not everything needs to reside in the cloud. Self-driving cars or universal translation devices need to operate without connectivity.

Deep learning also has two primary modes of operation – “training” and “inference”. In the training mode, you would like to have the biggest, fastest GPUs on the planet and you want many of them. In inference mode, you still want fast, but the emphasis is on economic power consumption. We don’t want to drive our businesses to the ground by paying for expensive power.

In summary, you want a variety of hardware that operates in different contexts. That’s where AMD is in good position. AMD has recently announced some pretty impressive hardware that’s geared toward deep learning workloads. The product is called Radeon Instinct and it consists of several GPU cards: the MI6, MI8, and MI25. The number roughly corresponds to the number of operations the card can crank out. An MI6 can perform roughly 6 trillion floating-point operations per second (aka teraflops).

The Radeon Instinct MI6 with a planned 16GB for GDDR5 memory is a low-cost inference and training solution. MI8 with 4GB HBM is designed primarily for inference-based workloads. MI25 is designed for large training workloads and will be based on the soon to be released Vega architecture. Shuttling data back and forth between GPU and CPU is one of the bottlenecks in training deep learning systems. Vega’s unique architecture, capable of addressing 512TB of memory, gives it a distinct advantage.

There’s also a lot more to say about GPU and CPU integration. I’ll briefly mention some points. On the server-side, AMD has partnered with Supermicro and Inventec to come up with some impressive hardware. At the top of the line, the Inventec K888 (dubbed “Falconwitch”) is a 400-teraflop 4U monster. By comparison, the Nvidia flagship DGX-1 3U server can muster a mere 170 teraflops.

There is also promise at the embedded device level. AMD already supports custom CPU-GPU chips for Microsoft’s Xbox and Sony’s PlayStation. An AMD APU (i.e. CPUs with integrated GPUs) can also provide solutions for smaller form factor devices. The beauty of AMD’s strategy is that the same HSA based architecture is available for the developer in the smallest of footprints, as well as in the fastest servers. This breadth of hardware offerings allows deep learning developers a wealth of flexibility in deploying their solutions. Deep learning is progressing at breakneck speed and one can never predict the best way to deploy a solution.

Conclusion

Deep learning is a disruptive technology like the Internet and mobile computing that came before. Open source software has been the dominant platform that has enabled these technologies.

AMD combines these powerful principles with its open source ROCm initiative. On its own, this definitely has the potential of accelerating deep learning development. ROCm provides a comprehensive set of components that address the high performance computing needs, such as providing tools that are closer to the metal. These include hand-tuned libraries and support for assembly language tooling.

Future deep learning software will demand even greater optimizations that span many kinds of computing cores. In my view, AMD’s strategic vision of investing heavily in heterogeneous system architectures gives their platform a distinct edge.

AMD’s open source strategy is uniquely positioned to disrupt and take the lead in future deep learning developments.

Carlos E. Perez is Co-Founder at Intuition Machine. He specializes in Deep Learning patterns, methodology and strategy. Many of his other writings on Artificial Intelligence can be found on Medium. His postings are his own opinions and may not represent AMD’s positions, strategies, or opinions. Links to third party sites and references to third party trademarks are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.

FOOTNOTES:

[AUTO] Autodesk. http://www.autodesk.com/solutions/generative-design

[CHA] Chainey, Ross. “Google co-founder Sergey Brin: I didn’t see AI coming.” https://www.weforum.org/agenda/2017/01/google-sergey-brin-i-didn-t-see-ai-coming/

[CON] Conner-Simons, Adam. “Artificial intelligence produces realistic sounds that fool humans.”

http://news.mit.edu/2016/06/13/artificial-intelligence-produces-realistic-sounds-0613

[GIT] Facebook FAIR. https://github.com/2017/facebook/fbcunn

[GRAY] Gray, Scott. “Maxas Assembler.” https://github.com/NervanaSystems/2015/01/24/maxas/wiki/Introduction

[HAG] Hagel, John. “Harnessing the Full Potential of platforms.” http://www.marketingjournal.org/2016/04/05/john-hagel-harnessing-the-full-potential-of-platforms/

[HSA] “HSA-Debugger-AMD.”

https://github.com/HSAFoundation/2015/12/24/HSA-Debugger-AMD/blob/master/TUTORIAL.md

[KHAT] Khatchadourian, Raffi. "The Doomsday Invention." http://www.newyorker.com/magazine/2015/11/23/doomsday-invention-artificial-intelligence-nick-bostrom

[KNOX] Knox, Kent. “rocBLAS.” https://github.com/RadeonOpenCompute/2016/11/09/rocBLAS/wiki

[LEW] Lewis-Kraus, Gideon. “The Great A.I Awakening.” https://www.nytimes.com/2016/12/14/magazine/the-great-ai-awakening.html?_r=1

[MAN] Manning, Christopher. “Computational Linguistics and Deep Learning.” http://www.mitpressjournals.org/doi/pdf/2016/06/10.1162/COLI_a_00239

[PAR] Paragios, Nikos. “Computer Vision Research: ‘The deep depression.’”

https://www.linkedin.com/2016/06/05/pulse/computer-vision-research-my-deep-depression-nikos-paragios

[PARK] Parkinson, Hannah Jane. “Computer algorithm recreates Van Gogh painting in one hour.”

https://www.theguardian.com/technology/2015/sep/02/computer-algorithm-recreates-van-gogh-painting-pi...

[SHO] Shontell, Alyson. “A start up that uses robots to write news gets acquired for $80 million in cash.” http://www.businessinsider.com/2015/02/23automated-insights-gets-acquired-by-vista-for-80-million-20...

[WAR] Warden, Pete. “Why GEMM is at the heart of deep learning.” https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/

[WAR2] Warden, Pete. “Why Deep Learning Needs Assembly Hackers.” https://petewarden.com/2017/01/03/why-deep-learning-needs-assembler-hackers/

[WAT] Watson, Nell. “Artificial Intuition; The Limitations (and ridiculous power) of Deep Learning Creativity.” https://medium.com/intuitionmachine/artificial-intuition-/2017/03/-3418fac2eb9c#.wrr6unq5g

[WIKI] https://en.wikipedia.org/wiki/LLVM

more
0 0 2,122

Today in San Francisco, California, AMD held a special event where we announced the newest additions to the Radeon Instinct™ family of compute products; the AMD Radeon Instinct™ MI60 and Radeon Instinct™ MI50. In step with the new hardware, the Radeon Open eCosystem (ROCm) has been updated with massive improvements in the device drivers, the compilers and supporting tools. The low-level math libraries, along with MIOpen, the machine intelligence library, have been optimized to really make deep learning  applications sing.

ROCm is an open software platform for GPU-enabled HPC computing. It was created with developers in mind to accommodate future technologies including machine learning and artificial intelligence. As an open platform, the ROCm ecosystem provides a rich foundation of modern programming languages, designed to speed development of high-performance, energy-efficient heterogeneous computing systems.

We enabled AMD’s ROCm capable GPUs in the Linux ecosystem for easy deployment of deep learning applications in Linux distributions. The amdkfd device driver is now supported in the mainline kernel and this kernel is picked up by all the major distributions for their standard releases. Now we also support MI60 and MI50, based on the new Vega architecture, in the linux-next repository. For distributions not using the latest kernel, a DKMS build is still a viable option to add support for the MI60 and MI50 GPUs.

We have updated the LLVM based clang compiler to support the new GPU architecture, including the new compute instructions targeted to accelerate machine learning computations. These low-level instructions implement compute operations all the way from single bit precision to 64-bit floating point. The most beneficial instruction for the acceleration of deep learning training is a float 16 dot product which accumulates into a 32-bit result, maintaining the accuracy of the operation.

Profiling and debugging tools required updates to support the new hardware. These tools enable developers to get the most out of the GPU compute cycles and understand where the bottlenecks occur in their applications. Follow the development on our github site.

Math libraries were customized with the hardware architecture in mind, resulting in an very optimized solution. There are many different ways to optimize these math operations, and each specific matrix and convolution size needs to be tuned, so AMD built a tool to help automate the optimization process. This tool is called Tensile and is very useful for creating a library for GEMMs, GEMM-like problems (such as batched GEMM), N-dimensional tensor contractions, and anything else that multiplies two multi-dimensional objects together on a GPU. MIOpen also underwent massive optimizations and updates to realize the incredible benefits of the foundational math libraries when integrated with deep learning frameworks.

One of the most exciting developments over the past year is the integration and progress with the machine learning frameworks. ROCm has been updated to support the TensorFlow framework API v1.11 and is actively upstreaming the code into the main repository. Check out the TensorFlow github to follow the updates or see our github page for PyTorch, Caffe2, Caffe and other framework developments.

To try out the newest packages, develop an application and easily deploy a ROCm solution, get the most recent Docker images here - which saves you the time of collecting all the libraries and building them specifically for your platform.

We are always looking for skilled developers excited to work in this rapidly changing field. Check out our job listings at amd.com.

more
0 2 4,304