In the ever-evolving landscape of artificial intelligence, large language models (LLMs) like GPT-4 and Llama have garnered significant attention for their impressive capabilities in natural language processing and generation. However, small language models (SLMs) are emerging as an essential counterpart in the AI model community offering a unique advantage for specific use cases. AMD is excited to release its very first small language model, AMD-135M with Speculative Decoding. This work demonstrates the commitment to an open approach to AI which will lead to more inclusive, ethical, and innovative technological progress, helping ensure that its benefits are more widely shared, and its challenges more collaboratively addressed.
AMD-135M: First AMD Small Language Model
AMD-135M is the first AMD small language model that was trained from scratch on AMD Instinct™ MI250 accelerators utilizing 670B tokens and divided into two models: AMD-Llama-135M and AMD-Llama-135M-code.
- Pretraining: The AMD-Llama-135M model was trained from scratch with 670 billion tokens of general data over six days using four MI250 nodes.
- Code Finetuning: The AMD-Llama-135M-code variant was fine-tuned with an additional 20 billion tokens of code data, taking four days on the same hardware.
The training code, dataset and weights for this model are open sourced so that developers can reproduce the model and help train other SLMs and LLMs.
Optimization with Speculative Decoding
Large language models typically use an autoregressive approach for inference. However, a major limitation of this approach is that each forward pass can only generate a single token, resulting in low memory access efficiency and affecting overall inference speed.
The advent of speculative decoding has solved this problem. The basic principle involves using a small draft model to generate a set of candidate tokens, which are then verified by the larger target model. This approach allows each forward pass to generate multiple tokens without compromising performance, thereby significantly reducing memory access consumption, and enabling several orders of magnitude speed improvements.
Inference Performance Acceleration
Using AMD-Llama-135M-code as a draft model for CodeLlama-7b, we tested the inference performance with and without speculative decoding on the MI250 accelerator for data center, and Ryzen™ AI processor (with NPU) for AI PC. For the particular configurations that we tested using AMD-Llama-135M-code as the draft model, we saw a speedup on the Instinct MI250 accelerator, Ryzen AI CPU[2], and on Ryzen AI NPU[2] versus the inference without speculative decoding.[3] The AMD-135M SLM establishes an end-to-end workflow, encompassing both training and inferencing, on select AMD platforms.
Next Steps
By providing an open-source reference implementation, AMD is not only advancing its AI capabilities but also fostering innovation within the AI community. To learn more about AMD-135M, read the full technical blog: Introducing the First AMD SLM (Small Language Model): AMD-135M Model Fuels AI Advancements
Additional Resources
- For information about the training, inferencing and insights of this model, please visit repository to get access to the code.
- Visit Hugging Face to download the model file.
- Apply for Instinct accelerator card access on the .
- For any questions, contact us by email .
Explore, innovate, and together, let us push the boundaries of AI.
Footnotes
[1] The training code for AMD-135M is based on TinyLlama, utilizing multi-node distributed training with PyTorch FSDP.
[2] Test ran on AMD Ryzen 9 PRO 7940HS with Radeon 780M Graphics. The Ryzen AI APU Architecture includes CPU and NPU kernels.
[3] These are the configurations that we tested. You might get different results on other configurations.
[4] The performance had been tested on AMD Instinct MI250 + ROCmTM 6.0 using standardized tests with lm-evaluation-harness. Additionally, the model performance tests are independent of the hardware environment.
[5] Hellaswag is dataset and metrics that tests how well that LLMs can reason about physical situations;
WinoGrande is a dataset and codebase for evaluating natural language understanding models on a challenging task of Winograd Schema;
SciQ is a dataset of closed-domain question answering tasks with text inputs and outputs;
MMLU is a dataset of multiple-choice questions on abstract algebra topics, such as groups, rings, fields, and polynomials;
ARC-Easy is a dataset of grade-school level science questions for testing advanced question answering systems.
SlimPajama is a deduplicated version of RedPajama and sources from Commoncrawl, C4, GitHub, Books, ArXiv, Wikpedia and StackExchange. We drop the Books data from SlimPajama due to license issues;
[6] Test ran on AMD Ryzen 9 PRO 7940HS w/ Radeon 780M Graphics. The Ryzen AI APU Architecture includes CPU and NPU kernels.