OpenAI’s GPT-4o Mini isn’t much better than rival LLMs

OpenAI’s GPT-4o Mini isn’t much better than rival LLMs

AI Roundup OpenAI has made available GPT-4o Mini – a smaller and cheaper version of its GPT-4o generative large language model (LLM) – via its cloud.

The Microsoft-backed super lab said Thursday GPT-4o Mini is like regular GPT-4o in that it’s multimodal – it can handle more than just the written word – and has a context window of 128,000 tokens and was trained on materials dated up to October 2023. The Mini version can emit up to 16,000 tokens of output.

While GPT-4o, OpenAI’s top-end model, costs $5 and $15 per million input and output tokens, respectively, the Mini edition costs 15 and 60 cents. You can halve those numbers if using delayed batch processing.

We’re told the cut-down version is not fully featured yet, supporting just text and vision via its API. Other input and output formats, such as audio, are coming in the indeterminate future. In creating GPT-4o Mini, OpenAI emphasized how safe it had made the thing, claiming to filter out offensive data from training materials and giving it the same guardrails as GPT-4o.

Furthermore, OpenAI claimed GPT-4o Mini is ahead of comparable LLMs in benchmarks. Compared to Google’s lighter-weight Gemini Flash and Anthropic’s Claude Haiku, Mini was usually between five and 15 percent more accurate in tests such as MMLU. In two outliers it was nearly twice as accurate as the competition, and in another a little worse than Gemini Flash but still ahead of Claude Haiku, allegedly.

OpenAI's GPTo Mini benchmark scores

OpenAI’s GPTo Mini benchmark scores against its competitors … Some good, some pretty close –` Click to enlarge

Competition between OpenAI and Anthropic has a personal edge, as the latter was co-founded and built in part by executives and engineers from the former.

GPT-4o Mini looks good in the graph above for sure, though it doesn’t have an overall commanding lead – and that’s indicative of OpenAI’s recent loss of absolute leadership in the LLM arena. As veteran open source developer Simon Willison detailed in his keynote at the AI Engineer World’s Fair last month, 2024 has seen many of OpenAI’s competitors release their own GPT-4-class models.

“The best models are grouped together: GPT-4o, the brand new Claude 3.5 Sonnet and Google Gemini 1.5 Pro,” Willison declared. “I would classify all of these as GPT-4 class. These are the best available models, and we have options other than GPT-4 now. The pricing isn’t too bad either – significantly cheaper than in the past.”

At 82 percent accuracy in MMLU and a cost of 15 cents per million tokens, GPT-4o Mini is mostly ahead of the pack. However, Willison noted the LMSYS Chatbot Arena benchmark provides a more realistic evaluation of LLM quality because actual humans are asked to compare outputs and choose which is better – a brute-force but effective way of ranking different models.

GPT-4o Mini is too new to be included in the tournament-style benchmark, though he noted that full-size GPT-4o is only barely ahead of its rivals. Anthropic’s flagship Claude 3.5 Sonnet currently has 1,271 points to GPT-4o’s 1,287. Gemini 1.5 Pro isn’t far behind at 1,267. Slightly less performant but still respectable models include Nvidia and Mistral’s brand-new Nemotron 4 340B Instruct at 1,209 points, and Meta’s LlaMa 3 70B Instruct at 1,201.

Willison also noted the Mini is cheaper than Claude 3 Haiku and Gemini 1.5 Flash.

OpenAI may be the best, in terms of these test scores, from small to big LLMs, but it no longer has the dominating lead it once had. That’s probably a good thing – between costly AI hardware and high power usage, the last thing AI needed was an LLM monopoly. ®

In other neural network news …

French upstart Mistral along with Nvidia has trained and released NeMo – described as the lab’s “new best small model” – under the Apache 2.0 open source license. It has 12 billion parameters and a 128,000-token context window, it can be dropped in to replace compatible models such as Mistral 7B and, as you’d expect, it can generate text, code, and similar output.

German lab DeepL has released a model capable of translating English to and from Japanese, Mandarin, and German, that it claims outperforms OpenAI’s GPT-4, Google, and Microsoft for “translation quality.” The biz has also been working on support for French and Spanish AI-powered writing, with Italian and Portuguese to follow.

Last month, Meta revealed it won’t bother training its AI models on European data due to the Union’s privacy laws. Now the Facebook giant’s decided against releasing a forthcoming multimodal Llama model – which can work with audio, video, and pictures as well as text – in the EU, again blaming regulations.

“We will release a multimodal Llama model over the coming months – but not in the EU due to the unpredictable nature of the European regulatory environment,” a Meta spokesperson said.

Meta also axed its generative AI in Brazil for similar reasons.

Google, IBM, Intel, Microsoft, Nvidia, Amazon, Anthropic, Cisco, Cohere, OpenAI, Wiz, and others have formed a so-called Coalition for Secure AI, which is expected to give everyone out there “guidance and tools” to deploy machine-learning systems securely. We’re told this is an open source initiative.

OpenAI is reportedly holding talks with Broadcom and other chip designers – including those who worked on Google’s TPU accelerators – about producing an AI server processor.

And OpenAI has written a paper [PDF] about an approach it developed that basically encourages LLMs – such as the GPT series – to produce text output that’s not only complex and correct, but still easy for humans to understand and use. Optimized output can at times be confusing for people, which this algorithm intends to tackle by challenging a “prover” model against a “verifier” model.

Time Stamp:

More from The Register