Running Llama 3.1 on Gaudi 2 with tgi-gaudi: A Comprehensive Guide

5 min readAug 8, 2024

Introduction

In the rapidly evolving landscape of artificial intelligence, staying at the cutting edge often means combining the latest models with the most efficient hardware. Today, we’re diving into an exciting fusion of state-of-the-art technology: running Llama 3.1 on Intel’s Gaudi 2 AI accelerators using tgi-gaudi.

Llama 3.1, the latest iteration in Meta’s open-source large language model series, has been making waves in the AI community. Building upon the strengths of its predecessors, Llama 3.1 offers enhanced performance, improved coherence, and a broader knowledge base. It represents a significant step forward in open-source AI, promising capabilities that rival some of the most advanced proprietary models.

On the hardware front, Intel’s Gaudi 2 accelerators have emerged as a formidable competitor in the AI chip market. Designed specifically for deep learning workloads, Gaudi 2 offers an attractive alternative to traditional GPU-based setups, boasting impressive performance and energy efficiency.

Bridging these two innovations is tgi-gaudi, a specialized framework that optimizes text generation inference for Gaudi hardware. It’s the key to unlocking the full potential of Llama 3.1 on this cutting-edge AI accelerator.

In this comprehensive guide, we’ll walk you through the entire process of getting Llama 3.1 up and running on Gaudi 2 using tgi-gaudi. We’ll explore the differences between Llama 3 and 3.1, explore the intricacies of building and configuring tgi-gaudi for optimal performance, and provide hands-on examples to test your setup.

Whether you’re an AI researcher looking to leverage open-source models, a data scientist exploring efficient inference solutions, or a tech enthusiast curious about the latest in AI hardware, this guide has something for you. By the end, you’ll have a clear understanding of how to harness the power of Llama 3.1 on Gaudi 2, opening up new possibilities for your AI projects.

Let’s embark on this journey to the forefront of AI technology, where open-source software meets specialized hardware to push the boundaries of what’s possible in language modeling and inference.

Llama 3 vs Llama 3.1: Key Differences

Context Window
- Llama 3: 8,000 tokens
- Llama 3.1: 128,000 tokens (16x increase)

Model Sizes
- Llama 3.1 introduces the 405B parameter model
- Updates to 70B and 8B models

Faster Response Time
Llama 3.1 can respond to questions 35% faster than its predecessor, reducing processing time and increasing overall usability.

Improved RAG Performance
Early evaluations suggest significant improvements in Retrieval-Augmented Generation (RAG) tasks:
- Increased faithfulness to source material
- Higher answer correctness
- Better overall performance in information retrieval and synthesis tasks

Benchmark Improvements
Llama 3.1 shows improvements across various tasks, especially for the larger model versions.

Multilingual Capabilities
Llama 3.1 supports more than eight languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai (and Swedish)

Key Improvements
1. Enhanced reasoning and tool use
2. Advanced code generation
3. Better long-form text generation
4. Synthetic data generation
5. Model distillation capabilities
6. “Imagine Me” feature for image creation
7. Integration with search engine APIs

Training
- Llama 3.1 405B trained on 15+ trillion tokens
- Used 16,000+ H100 GPUs

User Perceptions
- Mixed feedback on complex reasoning tasks
- Possible trade-off between “intelligence” and context window size

Practical Implications
1. Extended applications for document analysis and content generation
2. Improved RAG performance
3. Open-source advantages: accessibility and customization
4. Competitive pricing compared to proprietary models

Conclusion
Llama 3.1 offers significant improvements in context handling and multilingual support. While it does not universally outperform competitors, it provides a strong open-source alternative, especially for long-context and multilingual applications.

Prerequisites

You need access to a Gaudi 2 server. Update to the latest SynapseAI 1.17 (when writing this).

Setting Up the Environment

git clone https://github.com/bjornrun/tgi-gaudi-fixed-llama3.1.git
cd tgi-gaudi-fixed-llama3.1
docker build -t tgi_gaudi .
volume=$PWD/data
model=meta-llama/Meta-Llama-3.1-70B-Instruct
docker run -p 8091:80 -v $volume:/data -e LIMIT_HPU_GRAPH=true --runtime=habana -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HUGGING_FACE_HUB_TOKEN=<...> --cap-add=sys_nice --ipc=host tgi_gaudi --model-id $model --dtype bfloat16 --max-total-tokens 70536 --max-input-length 65000 --max-batch-prefill-tokens 65536 --max-batch-total-tokens 128000 --sharded true --num-shard 8 --quantize gptq

Note that the parameters are not optimal, but they work very well for our use case. We have already doubled the input context size to 65,000 tokens, which makes it even better for advanced RAG solutions. In the next article, we will optimize the parameters.

Testing the Setup

I use this simple test application to perform a sanity test and a benchmark simultaneously. It is important that we can handle multiple calls simultaneously without affecting performance.

#!/bin/bash

# Number of requests to send
NUM_REQUESTS=100

# Number of parallel requests
PARALLEL_REQUESTS=5

# URL of your TGI-Gaudi instance
URL="http://10.9.10.5:8091/generate"

# Function to send a single request
send_request() {
  local i=$1
  echo "Request $i:"
  curl -s "$URL" \
    -X POST \
    -d '{"inputs":"What is the capital of France?", "parameters":{"max_new_tokens":50}}' \
    -H 'Content-Type: application/json' \
    | jq -r '.generated_text'
  echo -e "\n------------------------\n"
}

# Export the function and URL so they can be used by xargs
export -f send_request
export URL

# Use xargs to run requests in parallel
seq 1 $NUM_REQUESTS | xargs -P $PARALLEL_REQUESTS -I {} bash -c 'send_request {}' bash

Let the model explain who it is

curl 10.9.10.5:8091/generate  -X POST  -d '{"inputs":"Explain what makes Llama 3.1 unique compared to other language models.","parameters":{"max_new_tokens":1000}}'  -H 'Content-Type: application/json' | jq -r '.generated_text'

Output:

Llama 3.1 is a large language model developed by Meta, designed to process and generate human-like language. Here are some key features that make Llama 3.1 unique compared to other language models:

Conversational dialogue: Trained on a massive dataset of conversational dialogue, enabling more human-like responses to questions and statements.
Common sense and world knowledge: Extensive training on diverse text data provides a broad range of common sense and world knowledge.
Emotional understanding and empathy: Designed to understand and respond to emotions, including empathy and sympathy, for more relatable interactions.
Long context window: Allows consideration of more context when generating responses, leading to more accurate and informative answers.
Meta’s proprietary technology: Built using Meta’s proprietary algorithms and models, designed to improve performance and efficiency.
Continuous learning: Adapts to new topics, trends, and language patterns over time through continuous learning.
Multilingual support: Supports multiple languages, making it versatile for various applications from translation to cross-cultural communication.
Explainability and transparency: Provides explanations and justifications for its responses, enhancing transparency and trustworthiness.
Safety and security: Incorporates features like content filtering and moderation to prevent generation of harmful or offensive content.
Scalability and flexibility: Suitable for a wide range of applications, from chatbots and virtual assistants to language translation and content generation.

Overall, Llama 3.1’s unique combination of conversational dialogue, common sense, emotional understanding, and proprietary technology make it a powerful tool for natural language processing and generation.

The performance is much better than anything I have used before. The following article will find the sweet spots of batching and our required context window. We will explore complex instructions and function calling with as many parallel sessions as possible.

Stay tuned….

Model description: https://huggingface.co/blog/llama31