Meta debuts third-generation Llama large language model

Meta debuts third-generation Llama large language model

Meta has unleashed its latest large language model (LLM) – named Llama 3 – and claims it will challenge much larger models from the likes of Google, Mistral, and Anthropic.

Revealed in a lengthy announcement on Thursday, Llama 3 is available in versions ranging from eight billion to over 400 billion parameters. For reference, OpenAI and Google’s largest models are nearing two trillion parameters.

For now, we’re only getting access to Llama 3’s eight billion and 70 billion parameter text variants. Meta isn’t done training its largest and most complex models just yet, but hints they will be multilingual and multimodal – meaning they’re assembled from multiple smaller domain-optimized models.

Even with a mere 70 billion parameters, Meta claims Llama 3 is more than capable of going toe-to-toe with much larger models.

Meta claims Llama3-8B and 70B can outperform far larger models including Gemini Pro and Antrhopic's Claude 3

Meta claims Llama3-8B and 70B can outperform far larger models including Gemini Pro and Antrhopic’s Claude 3 – Click to enlarge

Better data, better model

One of the biggest gains, according to Meta, comes from the use of a tokenizer with a vocabulary of 128,000 tokens. In the context of LLMs, tokens can be a few characters, whole words, or even phrases. AIs break down human input into tokens, then use their vocabularies of tokens to generate output.

Meta explained that its tokenizer helps to encode language more efficiently, boosting performance significantly. Additional gains were achieved by using higher-quality datasets and additional fine-tuning steps after training to improve the performance and overall accuracy of the model.

Specifically, Meta revealed Llama 3 was pre-trained on more than 15 trillion tokens collected from publicly available sources.

Llama 3’s training dataset is more than seven times larger and contains four times more code than Llama 2, which launched just nine months ago. But, as the saying goes, “garbage in, garbage out” – so Meta claims it developed a series of data-filtering pipelines to ensure Llama 3 was trained on as little bad information as possible.

Those quality controls included both heuristic and NSFW filters, as well as data deduplication, and text classifiers used to predict the quality of the information prior to training. Meta even used its older Llama 2 model – which it said was “surprisingly good at identifying high-quality data” – to help separate the wheat from the chaff.

Five percent of the training data came from more than 30 languages, which Meta predicted will in future help to bring more substantial multilingual capabilities to the model. For now, the Social Network™️ says users shouldn’t expect the same degree of performance in languages other than English.

Training small models on such a large dataset is generally considered a waste of computing time, and even to produce diminishing returns in accuracy. The ideal mix of training data to compute resources is referred to as the “Chinchilla optimal” [PDF] amount. According to Meta, for an eight billion parameter model like Llama3-8B, this would be about 200 billion tokens.

However, in testing, Meta found that Llama 3’s performance continued to improve even when trained on larger datasets. “Both our eight billion and our 70 billion parameter models continued to improve log-linearly after we trained them on up to 15 trillion tokens,” the biz wrote.

The result, it seems, is a relatively compact model capable of generating results comparable to far larger models. The tradeoff in compute was likely considered worthwhile, as smaller models are generally easier to inference and thus easier to deploy at scale.

At 8-bit precision, an eight billion parameter model requires just 8GB of memory. Dropping to 4-bit precision – either using hardware that supports it or using quantization to compress the model – would drop memory requirements by about half.

Meta trained the model on a pair of compute clusters each containing 24,000 Nvidia GPUs. As you might imagine, training on such a large cluster, while faster, also introduces some challenges – the likelihood of something failing in the middle of a training run increases.

To mitigate this, Meta explained it developed a training stack that automates error detection, handling, and maintenance. The hyperscaler also added failure monitoring and storage systems to reduce the overhead of checkpoint and rollback in case a training run is interrupted. And once completed, Meta subjected the models to a series of post-training testing and fine-tuning steps.

Alongside Llama3-8B and 70B, Meta also rolled out new and updated trust and safety tools – including Llama Guard 2 and Cybersec Eval 2, to help users safeguard the model from abuse and/or prompt injection attacks. Code Shield is another addition that provides guardrails designed to help filter out insecure code generated by Llama 3.

As we’ve previously reported, LLM-assisted code generation has led to some interesting attack vectors that Meta is looking to avoid.

Availability

Over the next few months, Meta plans to roll out additional models – including one exceeding 400 billion parameters and supporting additional functionality, languages, and larger context windows. The latter will allow users to ask larger, more complex queries – like summarizing a large block of text.

Llama3-8B and 70B are currently available for download from Meta’s website. Amazon Web Services, Microsoft Azure, Google Cloud, Hugging Face, and others also plan to offer the model for deployment on their platforms.

If you want to test out Llama3 on your machine, you can check out our guide on running local LLMs here. Once you’ve got it installed, you can launch it by running:

ollama run llama3

Have fun and let us know how it went. ®

Time Stamp:

More from The Register