Optimizing Mixtral 8x22b for Gaudi 2
How do you get a text generation task that takes 1 minute and 10 seconds to take 27 seconds?
Previously, we showed how to run Mixtral 8x22b using tgi-gaudi. However, we focused on demonstrating the functionality of running Mixtral 8x22b using tgi-gaudi, without applying any optimization.
This time, the focus is on making it run faster, and many tuning parameters improve performance. In this article, we will briefly explore various ways to enhance the serving of LLMs using TGI.
Can I get your Flash Attention, please?
First, we need to discuss a technique that substantially improves LLM inference performance. This is especially true when using models with huge context memory, such as Mixtral 8x22B, which has 64K context tokens. Let's go through the six most essential attributes of Flash Attention:
1. Speed and Memory Efficiency
Flash Attention is designed to be both fast and memory-efficient. Traditional attention mechanisms have quadratic memory complexity, which can be a bottleneck when dealing with long sequences. Flash Attention reduces this complexity to linear one, optimizing memory usage and allowing for faster processing of large-scale models.