OpenAI's GPT-5.4 Hits 83% on Professional Knowledge Benchmark as AI Race Intensifies
OpenAI released GPT-5.4 with record scores on workplace tasks and a massive 1 million token context window, while China's MiniMax launched M2.1 claiming industry-leading multi-language coding. The simultaneous launches highlight an accelerating competition for enterprise AI dominance.
OpenAI released GPT-5.4 on Thursday, claiming a record 83% score on its GDPval benchmark for professional knowledge work — a significant leap in the company's push to dominate enterprise AI. The new model comes in three versions: standard, a reasoning variant called GPT-5.4 Thinking, and GPT-5.4 Pro optimized for high performance, according to TechCrunch.
The launch positions OpenAI squarely against a surging field of competitors, including a simultaneous release from Chinese AI firm MiniMax, which claims its M2.1 model has reached "industry-leading levels" in multi-language programming. The timing isn't coincidental — both companies are racing to capture the lucrative market for AI that can actually do professional work, not just answer questions.
OpenAI's most striking technical achievement is the 1 million token context window available through its API — by far the largest the company has ever offered. For context, that's roughly 750,000 words, or about ten novels' worth of information the model can process at once. The company also emphasized improved token efficiency, saying GPT-5.4 solves problems with "significantly fewer tokens" than its predecessor, which translates directly to lower costs for enterprise customers.
The model achieved record scores on OSWorld-Verified and WebArena Verified, benchmarks that test how well AI can actually use computers and navigate websites — capabilities essential for autonomous agents. Mercor CEO Brendan Foody said GPT-5.4 "excels at creating long-horizon deliverables such as slide decks, financial models, and legal analysis," delivering top performance while running faster and cheaper than competitive models, according to his statement reported by TechCrunch.
OpenAI has also introduced a new system called Tool Search to address a practical bottleneck. Previously, every API call would include definitions for all available tools, consuming tokens even for tools the model wouldn't use. The new system lets models look up tool definitions only as needed — a seemingly minor change that could dramatically reduce costs for companies building complex AI systems with dozens of available functions.
The company claims GPT-5.4 is 33% less likely to make errors in individual claims compared to GPT-5.2, with overall responses 18% less likely to contain errors. That's a meaningful improvement, though the fact that OpenAI is still highlighting incremental progress on hallucinations underscores how persistent the problem remains across the industry.
Meanwhile, MiniMax's M2.1 release takes direct aim at OpenAI's perceived weakness in non-Python programming languages. The Chinese company claims it has "systematically enhanced capabilities in Rust, Java, Golang, C++, Kotlin, Objective-C, TypeScript, JavaScript, and other languages," reaching what it calls "industry-leading levels" that cover "the complete chain from low-level system development to application layer development."
MiniMax specifically emphasized mobile development — "a widely recognized weakness across the industry" — claiming M2.1 "significantly strengthens native Android and iOS development capabilities." The company also touted improved performance in Agent frameworks, saying the model "exhibits consistent and stable results in tools such as Claude Code, Droid (Factory AI), Cline, Kilo Code, Roo Code, and BlackBox."
The testimonials MiniMax published suggest it's already gaining traction with developer tool companies. "We're excited for powerful open-source models like M2.1 that bring frontier performance (and in some cases exceed the frontier) for a wide variety of software development tasks," said Eno Reyes, co-founder and CTO of Factory AI. Saoud Rizwan, founder and CEO of Cline, noted that "Minimax M2 series has demonstrated powerful code generation capability, and has quickly became one of the most popular model on Cline platform during the past few months."
OpenAI included a new safety evaluation specifically for chain-of-thought reasoning — the running commentary models provide to show their thought process. AI safety researchers have worried that reasoning models could deliberately misrepresent their thinking, and OpenAI's testing confirms "it can happen under the right circumstances." However, the company claims deception is less likely in GPT-5.4 Thinking, "suggesting that the model lacks the ability to hide its reasoning and that CoT monitoring remains an effective safety tool," according to TechCrunch.
That's a reassuring finding, but also a reminder that OpenAI is now explicitly testing whether its models can lie about their reasoning — a concern that would have seemed absurd just a few years ago.
The simultaneous launches from OpenAI and MiniMax reveal how the AI race is fragmenting along geographical and architectural lines. OpenAI is betting on massive context windows and reasoning capabilities to dominate enterprise knowledge work. MiniMax is positioning itself as the superior choice for actual software development, particularly in the multi-language, mobile-first environments where most real-world code is written.
Both strategies could succeed — the market for enterprise AI is vast enough to support multiple winners. But the intensity of the competition is driving rapid capability gains that seemed impossible even six months ago. A model that scores 83% on professional knowledge tasks, or one that can genuinely write production-quality code in a dozen programming languages, represents a fundamental shift in what's economically feasible to automate.
The question isn't whether these models work — the benchmarks and testimonials suggest they increasingly do. The question is how quickly enterprises can actually integrate them into workflows, and whether the technology is advancing faster than organizations can adapt. Based on the pace of releases this week, that adaptation challenge is only going to get harder.