cancel
Showing results for 
Search instead for 
Did you mean: 

Developer Blog: Accurate and Efficient Collaborative Optimizations for Fast Generative AI on AMD GPU

AMD_AI
Staff
Staff
0 0 2,640

Authors: Eddie Wu (AMD), Cheng Ling (AMD), George Wang (AMD), Xu Wang (Heyintelligence, AI optimization technical lead), Yuncong Yang (Heyintelligence, GPU technical lead)

 

Accurate and Efficient Colaborative Optimizations for Faster Generative AI - Blog Banner .jpg

 

AMD is advancing AI with an open ecosystem through its open-source software ROCmTM, which is designed for GPUs, with a collection of drivers, software tools, libraries and APIs that enable GPU programming with ease. AMD RadeonTM GPUs, and AMD ROCm software, are inherently designed to support a balance of accuracy and efficiency, empowering developers to rapidly build high-performance, large-model applications through underlying hardware architectures and software innovations. This offers an opportunity for more partners to co-innovate in the AMD AI ecosystem.

 

HEYINTELLIGENCE Delivers Generalized Co-optimization of LLMs on AMD GPU Platforms 

 

HEYINTELLIGENCE delivers highly optimized AI solutions in both hardware and software.  Founded in 2017 with deep and rich experience in GPU architecture design and AI algorithm optimization, HEYINTELLIGENCE is developing Generalized Co-optimization Technology (GCT) for specific LLMs, that provides different optimized kernels and hybrid quantization combinations for LLMs based on their structural characteristics. GCT is designed to achieve a significant improvement in performance with almost no loss of accuracy. Recently, HEYINTELLIGENCE optimized the inference of ChatGLM2-6B on an AMD Radeon™ RX 7900XTX GPU.

 

AMD_AI_0-1714627871310.png

Figure 1: Key kernels in ChatGLM2-6B selected by GCT (proposed by HEYINTELLIGENCE). Different colors indicate different types of optimization kernels.

 

As shown in Figure 1, the original implementation of RMSNorm, MatMul fused with Rotary-EMB, MatMul fused with SwiGLU and Decoding Attention in ChatGLM2-6B were selected, which is based on their proportion in computing or bandwidth of the entire inference process. GCT developed four optimized kernels to implement these functions. All four kernels are designed to provide significant performance gains, thanks to the flexibility of the HIP and ROCm components. The kernels are compiled into efficient backend instructions and are well adopted with the high efficiency of AMD GPUs. The key elements of the optimized kernels are as follows:

 

1. RMSNorm - it regularizes the summed inputs to a neuron in one layer according to the Root Mean Square (RMS). Avoiding synchronization between warps is a key to improving the performance.

2. MatMul fused with Rotary-EMB - The integration of matrix multiplication (MatMul) and rotary operation can greatly reduce the launch cost of multiple kernels. Designing the kernel according to the rotary's granularity is the key factor to increasing data sharing and improving compute efficiency.

3. MatMul fused with SwiGLU- The integration of matrix multiplication and SwiGLU can reduce the launch cost of two separate kernels. Designing the entire optimization process from the output perspective can also reduce the memory-to-register load time.

4. Decoding Attention - Flexible design of thread processing granularity based on the calculating characteristics of attention, optimization of the synchronization method between thread warps in SoftMax, and the rational use of shared memory are three key factors to improve performance.

 

These kernel optimization techniques have a minimal impact on accuracy and are independent of the quantization strategy, so they can be used as a standalone plug-in with various quantization algorithms to give next-level performance along with GCT. On the other hand, for accuracy-sensitive applications, quantization may lead to a decrease in model generalization, thus posing unpredictable risks. In this case, quantization techniques need to be used with caution, but GCT techniques can still be employed to optimize performance.

 

Accuracy Matters  

 

In LLM applications, quantization strategies can be used to reduce GPU memory usage and increase the number of simultaneous users that can be served. While aggressive quantization can significantly reduce the amount of data, sometimes the price paid in accuracy is unacceptable, especially in practical LLM applications.

 

However, the GCT technology offers optimizations without sacrificing accuracy, that matters for LLM applications like ChatGLM2-6B. Since the multiplication operation takes up a large amount of computational data, the GCT uses the ‘SmoothQuant’ method to obtain the per-channel, 8-bit weight, and stores its FP16 scaled value into a file. After quantization, the parameter volume of ChatGLM2-6B is reduced significantly with limited impact to accuracy.

 

AMD_AI_1-1714628325951.png

Figure 2: Quantization result of parameters volume and C-Eval (a comprehensive Chinese evaluation suite for foundation models) accuracy

 

Further Optimizations

 

HEYINTELLIGENCE has accumulated a wealth of experience in the application of AI models and hardware platforms to real-world scenarios. There are many sub-techniques within the GCT, such as LLM-serving techniques, quantization/de-quantization kernel fusion techniques, pipeline optimizations, etc. Further optimizations can be done based on customer requirements. The core idea is to enable different optimization techniques to work together to obtain maximum performance improvement with minimum accuracy loss under the constraints of real scene data and timing cost.

 

Conclusion 

 

The optimized implementations mentioned above further enrich the AMD AI developer community, helping with the highly efficient AMD AI accelerators in processing complex AI workloads such as LLMs, thus making is possible to provide data center users with a complete set of inference solutions that can meet high-throughput, low-latency performance. AMD is empowering more ecosystem partners and AI developers by building open software platforms, such as ROCm, ZenDNNTM, VitisTM AI, and RyzenTM AI software for innovations on GPUs, CPUs and adaptive SoCs.  

 

To get more details about HEYINTELLIGENCE’s Generalized Co-optimization Technology, or use cases in LLM optimization and application, please contact biz@heyintelligence.com. If you need more insight on AMD AI acceleration solutions and developer ecosystem plans, please email amd_ai_mkt@amd.com