Mistral 7B vs. Mixtral 8x7B

By Anurag Vishwakarma
in
AI & ML
—
Mar 26, 2024

Two LLMs, Mistral 7B and Mixtral 8x7B from Mistral AI, outperform other models like Llama and GPT-3 across benchmarks while providing faster inference and longer context handling capabilities.

Mistral 7B vs. Mixtral 8x7B — Image is subject to copyright!

A French startup, Mistral AI has released two impressive large language models (LLMs) – Mistral 7B and Mixtral 8x7B. These models push the boundaries of performance and introduce a better architectural innovation aimed at optimizing inference speed and computational efficiency.

Mistral 7B: Small yet Mighty

Mistral 7B is a 7.3 billion parameter transformer model that punches above its weight class. Despite its relatively modest size, it outperforms the 13 billion parameters Llama 2 model across all benchmarks. It even surpasses the larger 34 billion parameter Llama 1 model on reasoning, mathematics, and code generation tasks.

Two foundations of Mistral 7B’s efficiency:

Grouped Query Attention (GQA)
Sliding Window Attention (SWA)

GQA significantly accelerates inference speed and reduces memory requirements during decoding by sharing keys and values across multiple queries within each transformer layer.

SWA, on the other hand, enables the model to handle longer input sequences at a lower computational cost by introducing a configurable “attention window” that limits the number of tokens the model attends to at any given time.

Name	Number of parameters	Number of active parameters	Min. GPU RAM for inference (GB)
Mistral-7B-v0.2	7.3B	7.3B	16
Mistral-8X7B-v0.1	46.7B	12.9B	100

Mixtral 8x7B: A Sparse Mixture-of-Experts Marvel

While Mistral 7B impresses with its efficiency and performance, Mistral AI took things to the next level with the release of Mixtral 8x7B, a 46.7 billion parameter sparse mixture-of-experts (MoE) model. Despite its massive size, Mixtral 8x7B leverages sparse activation, resulting in only 12.9 billion active parameters per token during inference.

LLM Bechmark Graph — Image Credit: Mistral.ai

The key innovation behind Mixtral 8x7B is its MoE architecture. Within each transformer layer, the model has eight expert feed-forward networks (FFNs). For every token, a router mechanism selectively activates just two of these expert FFNs to process that token. This sparsity technique allows the model to harness a vast parameter count while controlling computational costs and latency.

According to Mistral AI’s benchmarks, Mixtral 8x7B outperforms or matches the large language models like Llama 2 70B and GPT-3.5 across most multiple tasks, including reasoning, mathematics, code generation, and multilingual benchmarks. Additionally, it provides 6x faster inference than Llama 2 70B, thanks to its sparse architecture.

Both Mistral 7B and Mixtral 8x7B are good at code generation tasks like HumanEval and MBPP, with Mixtral 8x7B having a slight edge and it’s better. Mixtral 8x7B also supports multiple languages, including English, French, German, Italian, and Spanish, making them valuable assets for multilingual applications.

On the MMLU benchmark, which evaluates a model’s reasoning and comprehension abilities, Mistral 7B performs equivalently to a hypothetical Llama 2 model over three times its size.

LLMs Benchmark Comparison Table

Model	Average MCQs	Reasoning	Python coding	Future Capabilities	Grade school math	Math Problems
Claude 3 Opus	84.83%	86.80%	95.40%	84.90%	86.80%	95.00%
Gemini 1.5 Pro	80.08%	81.90%	92.50%	71.90%	84%	91.70%
Gemini Ultra	79.52%	83.70%	87.80%	74.40%	83.60%	94.40%
GPT-4	79.45%	86.40%	95.30%	67%	83.10%	92%
Claude 3 Sonnet	76.55%	79.00%	89.00%	73.00%	82.90%	92.30%
Claude 3 Haiku	73.08%	75.20%	85.90%	75.90%	73.70%	88.90%
Gemini Pro	68.28%	71.80%	84.70%	67.70%	75%	77.90%
Palm 2-L	65.82%	78.40%	86.80%	37.60%	77.70%	80%
GPT-3.5	65.46%	70%	85.50%	48.10%	66.60%	57.10%
Mixtral 8x7B	59.79%	70.60%	84.40%	40.20%	60.76%	74.40%
Llama 2 – 70B	51.55%	69.90%	87%	30.50%	51.20%	56.80%
Gemma 7B	50.60%	64.30%	81.2%	32.3%	55.10%	46.40%
Falcon 180B	42.62%	70.60%	87.50%	35.40%	37.10%	19.60%
Llama 13B	37.63%	54.80%	80.7%	18.3%	39.40%	28.70%
Llama 7B	30.84%	45.30%	77.22%	12.8%	32.6%	14.6%
Grok 1	–	73.00%	–	63%	–	62.90%
Qwen 14B	–	66.30%	–	32%	53.40%	61.30%
Mistral Large	–	81.2%	89.2%	45.1%	–	81%

This model comparison table was last updated in March 2024. Source

When it comes to fine-tuning for specific use cases, Mistral AI provides “Instruct” versions of both models, which have been optimized through supervised fine-tuning and direct preference optimization (DPO) for careful instruction following.

👍

The Mixtral 8x7B Instruct model achieves an impressive score of 8.3 on the MT-Bench benchmark, making it one of the best open-source models for instruction.

Deployment and Accessibility

Mistral AI has made both Mistral 7B and Mixtral 8x7B available under the permissive Apache 2.0 license, allowing developers and researchers to use these models without restrictions. The weights for these models can be downloaded from Mistral AI’s CDN, and the company provides detailed instructions for running the models locally, on cloud platforms like AWS, GCP, and Azure, or through services like HuggingFace.

LLMs Cost and Context Window Comparison Table

Models	Context Window	Input Cost / 1M tokens	Output Cost / 1M tokens
Gemini 1.5 Pro	128K	N/A	N/A
Mistral Medium	32K	$2.7	$8.1
Claude 3 Opus	200K	$15.00	$75.00
GPT-4	8K	$30.00	$60.00
Mistral Small	16K	$2.00	$6.00
GPT-4 Turbo	128K	$10.00	$30.00
Claude 2.1	200K	$8.00	$24.00
Claude 2	100K	$8.00	$24.00
Mistral Large	32K	$8.00	$24.00
Claude Instant	100K	$0.80	$2.40
GPT-3.5 Turbo Instruct	4K	$1.50	$2.00
Claude 3 Sonnet	200K	$3.00	$15.00
GPT-4-32k	32K	$60.00	$120.00
GPT-3.5 Turbo	16K	$0.50	$1.50
Claude 3 Haiku	200K	$0.25	$1.25
Gemini Pro	32K	$0.125	$0.375
Grok 1	64K	N/A	N/A

This cost and context window comparison table was last updated in March 2024. Source

💡

Largest context window: Claude 3 (200K), GPT-4 Turbo (128K), Gemini Pro 1.5 (128K)

💲

Lowest input cost per 1M tokens: Gemini Pro ($0.125), Mistral Tiny ($0.15), GPT 3.5 Turbo ($0.5)

For those looking for a fully managed solution, Mistral AI offers access to these models through their platform, including a beta endpoint powered by Mixtral 8x7B.

Conclusion

Mistral AI’s language models, Mistral 7B and Mixtral 8x7B, are truly innovative in terms of architectures, exceptional performance, and computational efficiency, these models are built to drive a wide range of applications, from code generation and multilingual tasks to reasoning and instruction.

Source link

About Author

News Nationals

See author's posts

Mistral 7B vs. Mixtral 8x7B

Mistral 7B: Small yet Mighty

Mixtral 8x7B: A Sparse Mixture-of-Experts Marvel

LLMs Benchmark Comparison Table

Deployment and Accessibility

LLMs Cost and Context Window Comparison Table

Conclusion

About Author

News Nationals

Breakout stocks to buy or sell: Saregama India to Shemaroo — Sumeet Bagadia recommends five shares to buy today | Stock Market News – Mint

Pier seeks to empower emerging VCs with its new senior managers

Swiggy to launch IPO in November, say reports; firm likely to raise ₹11,000 crore – Upstox

Leave a Reply Cancel reply

Breakout stocks to buy or sell: Saregama India to Shemaroo — Sumeet Bagadia recommends five shares to buy today | Stock Market News – Mint

Pier seeks to empower emerging VCs with its new senior managers

Swiggy to launch IPO in November, say reports; firm likely to raise ₹11,000 crore – Upstox

Salesforce snatches up Zoomin, a tool for organizing company knowledge

Anil Ambani's son Jai Anmol Ambani slapped with ₹1 crore fine by Sebi in Reliance Home Finance case – Hindustan Times

Breakout stocks to buy or sell: Saregama India to Shemaroo — Sumeet Bagadia recommends five shares to buy today | Stock Market News – Mint

Gordon Hayward still fan of esports, not yet investor

League of Legends global power rankings through July 16

Fortnite Summer Skirmish Series struggles in opening week

Rise Nation intent to ‘stay humble, stay hungry’

Breakout stocks to buy or sell: Saregama India to Shemaroo — Sumeet Bagadia recommends five shares to buy today | Stock Market News – Mint

Pier seeks to empower emerging VCs with its new senior managers

Swiggy to launch IPO in November, say reports; firm likely to raise ₹11,000 crore – Upstox

Salesforce snatches up Zoomin, a tool for organizing company knowledge

Anil Ambani's son Jai Anmol Ambani slapped with ₹1 crore fine by Sebi in Reliance Home Finance case – Hindustan Times

Breakout stocks to buy or sell: Saregama India to Shemaroo — Sumeet Bagadia recommends five shares to buy today | Stock Market News – Mint

Pier seeks to empower emerging VCs with its new senior managers

Swiggy to launch IPO in November, say reports; firm likely to raise ₹11,000 crore – Upstox

Salesforce snatches up Zoomin, a tool for organizing company knowledge

Anil Ambani's son Jai Anmol Ambani slapped with ₹1 crore fine by Sebi in Reliance Home Finance case – Hindustan Times

Currency Exchange Rate

Mistral 7B: Small yet Mighty

Mixtral 8x7B: A Sparse Mixture-of-Experts Marvel

LLMs Benchmark Comparison Table

Deployment and Accessibility

LLMs Cost and Context Window Comparison Table

Conclusion

About Author

More Stories

Leave a Reply Cancel reply

You may have missed

Currency Exchange Rate