Using AI to Optimize HPC Compute Capacity
As an industry leader in providing efficient Data Center compute for HPC and AI data centers, AMD products offer unique advantages for our customers in developing their next-generation products using AMD CPUs and GPUs.
The Forecasting Dilemma
AMD invests significant financial resources to provide on-premises and multi-cloud EDA compute infrastructure for its engineers. At AMD IT, we continuously optimize our cloud/data center components – compute servers, storage, and networking equipment (the GRID) – to meet the exponential need for EDA workloads that are used by Engineering teams.
Our workloads vary over the lifecycle of each product’s evolution. To meet the peak capacity demands to deliver these projects each year, we rely on engineering program managers to assess, estimate, and roll up their projected demands. These estimates are aggregated into a company-wide resource request to help optimize the allocations across all groups.
As with most complex Silicon designs, the precision and uncertainty of demand estimates vary. Teams with deep history in product design and EDA workflows provide accurate demand estimates. Products with quickly changing requirements tend to have greater uncertainty in predicted demand. Some projects de-risk their plan by overestimating their demand. Forecasting can get complex and dynamic.
AMD IT has used historical data to validate demand forecasts and provide guidance on necessary capacity. We traverse the narrow path between wasting compute resources due to over-investment, and risking project delays due to under-investment. Our CIO’s goal is to make continuous improvements to the science in this process, enabling IT to optimize the bottom line while helping product teams grow the top line.
Predicting the future with AI
We have traditionally used machine learning (ML) and business intelligence (BI) tools to predict low-level capacity metrics like memory and CPU usage. For demand planning, our ML/AI-based analytics is being used to provide usage projections across multiple time horizons. These projections are compared with human forecasts, data from past generations of products, and real-time data. This has allowed us to better determine our computing requirements and improve the demand prediction accuracy. With Generative AI, we can show our users how to use the GRID effectively, explain the efficiencies, and respond to user inquiries conversationally.
Looking ahead
With human-guided AI, AMD IT delivers better demand predictions, optimizes investments, and speeds up the delivery of breakthrough products. In future articles, we will explore the details of our AI approaches that we use in partnership with our Engineering teams to advance the business outcomes.