Optimizing Mixtral 8x22b for Gaudi 2

Björn Runåker
6 min readJun 2, 2024

How do you get a text generation task that takes 1 minute and 10 seconds to take 27 seconds?

Previously, we showed how to run Mixtral 8x22b using tgi-gaudi. However, we focused on demonstrating the functionality of running Mixtral 8x22b using tgi-gaudi, without applying any optimization.

This time, the focus is on making it run faster, and many tuning parameters improve performance. In this article, we will briefly explore various ways to enhance the serving of LLMs using TGI.

Can I get your Flash Attention, please?

First, we need to discuss a technique that substantially improves LLM inference performance. This is especially true when using models with huge context memory, such as Mixtral 8x22B, which has 64K context tokens. Let's go through the six most essential attributes of Flash Attention:

1. Speed and Memory Efficiency

Flash Attention is designed to be both fast and memory-efficient. Traditional attention mechanisms have quadratic memory complexity, which can be a bottleneck when dealing with long sequences. Flash Attention reduces this complexity to linear one, optimizing memory usage and allowing for faster processing of large-scale models.

2. IO-Awareness

--

--

Björn Runåker
Björn Runåker

Written by Björn Runåker

Software developer into deep learning in combination of Big Data and security