Skip to content
FAQ

OpenAI's GPT-5.4 Surpasses Human Performance at Computer Control, Reshaping What AI Agents Can Do

OpenAI's GPT-5.4, released March 5, is the first general-purpose AI model to surpass human baseline performance on computer-use benchmarks, achieving a 75% success rate on OSWorld-Verified versus the 72.4% human baseline. With a 1-million-token context window and native support for mouse, keyboard, and screenshot interactions, GPT-5.4 marks a turning point for autonomous AI agents in enterprise and developer workflows.

6 min read

The Line Has Been Crossed: AI Can Now Operate a Computer Better Than Most Humans

When OpenAI released GPT-5.4 on March 5, the most significant number in the launch announcement was not its score on a math reasoning benchmark or a coding leaderboard. It was 75.0% — the model’s success rate on OSWorld-Verified, a benchmark that tests an AI system’s ability to complete real desktop computing tasks. The human baseline on that same benchmark is 72.4%.

For the first time, a general-purpose AI model can operate a computer more reliably than the average human can.

That is not a marginal improvement in an arcane capability. It is a crossing of a threshold that researchers and practitioners have been anticipating for years — one with profound implications for how software gets built, how enterprises automate workflows, and what it means to employ people for computer-centric knowledge work.

What GPT-5.4 Actually Does

GPT-5.4 is the first model in OpenAI’s flagship series to ship with native computer-use capabilities. Rather than interacting with software through specialized APIs or custom integrations, the model operates computers the way a human does: by observing the screen through screenshots, moving a virtual cursor, clicking interface elements, typing text, and navigating across applications.

This approach — sometimes called “screenshot-driven computer use” — is significantly more generalizable than API-based automation. An API integration only works for software that exposes structured endpoints. Computer use works for anything that has a visual interface, which is to say, virtually everything a knowledge worker touches in a day.

OpenAI reports that GPT-5.4 achieves:

  • 75.0% on OSWorld-Verified, the leading desktop task completion benchmark (human baseline: 72.4%; GPT-5.2: 47.3%)
  • 67.3% on WebArena-Verified, testing autonomous browser navigation
  • 92.8% on Online-Mind2Web, testing browser use via screenshot observations alone

The jump from GPT-5.2’s 47.3% on OSWorld to GPT-5.4’s 75.0% in under four months represents a 28-percentage-point improvement — a pace of capability gain that makes year-over-year comparisons feel almost quaint. As recently as early 2025, the best AI systems were failing roughly 70% of basic desktop tasks. Today’s best is failing roughly 25%.

The Architecture Behind the Leap

OpenAI has been relatively tight-lipped about the specific architectural changes driving GPT-5.4’s computer-use gains, but several factors are evident from the model’s behavior and benchmark profile.

The model ships with a 1-million-token context window — matching Google’s Gemini 3.1 Pro and Anthropic’s Claude Opus 4.6 — which is critical for computer use. Navigating a complex workflow across multiple applications generates substantial context: screenshots, DOM snapshots, action histories, and planning traces. A smaller context window forces the model to discard information that may be needed many steps later in a long-horizon task. With 1 million tokens, GPT-5.4 can hold an entire workday’s worth of context simultaneously.

The model was also trained with significantly expanded computer-interaction data. OpenAI’s internal “Atlas” agent infrastructure — used to generate the synthetic demonstrations that train computer-use models — has been running at scale for over a year, accumulating vast datasets of successful task completions across operating systems, browsers, productivity suites, and developer tools. This training data advantage, more than any architectural novelty, likely accounts for the largest share of the benchmark improvements.

Additionally, GPT-5.4 is 33% less likely to produce false individual claims and 18% less likely to contain errors compared to GPT-5.2. For computer-use agents, factual accuracy matters enormously — a hallucinated file path or an incorrect form field entry can cascade into a task failure many steps down the line.

Model Family: Thinking, Pro, Mini, and Nano

GPT-5.4 shipped initially in two variants: GPT-5.4 Thinking (the full reasoning model for complex, multi-step tasks) and GPT-5.4 Pro (the flagship variant optimized for a balance of speed and capability). On March 17, OpenAI released GPT-5.4 mini and GPT-5.4 nano.

GPT-5.4 mini is available to free-tier ChatGPT users and is designed for high-volume, lower-latency applications. GPT-5.4 nano — the smallest variant — is available exclusively through the OpenAI API and is priced at $2.50 per million input tokens, targeting developers building agents that need to make thousands of fast, inexpensive calls.

The stratified model family matters for the computer-use story. Complex, multi-hour computer tasks benefit from the Thinking variant’s extended reasoning chains. But many computer-use applications — filling out a form, extracting data from a website, scheduling a meeting — are simple enough that GPT-5.4 mini can complete them at a fraction of the cost, making widespread autonomous deployment economically viable.

Native Financial Plugins and Enterprise Integration

Beyond core computer use, GPT-5.4 ships with native financial plugins for Microsoft Excel and Google Sheets — a product decision that signals where OpenAI believes near-term enterprise value will be captured. Financial analysts, accountants, operations teams, and data scientists spend enormous portions of their working lives in spreadsheet software. A model that can navigate, manipulate, and generate complex spreadsheet content autonomously represents an immediate productivity lever for Fortune 500 deployments.

OpenAI has also deepened integrations with Microsoft’s Copilot ecosystem, where GPT-5.4’s computer-use capabilities can be invoked directly from within Microsoft 365 applications. This positions GPT-5.4 as the cognitive layer underlying Microsoft’s enterprise AI strategy — a relationship that has already generated billions in OpenAI revenue and shows no sign of slowing.

The Competitive Landscape

GPT-5.4’s computer-use performance is a significant move in what has become a fiercely contested product category. Anthropic’s Claude Opus 4.6 holds the top spot on the LMSYS Chatbot Arena’s overall Elo rankings, but GPT-5.4’s computer-use lead is meaningful: Claude’s computer-use implementation, while capable, has not yet matched GPT-5.4’s OSWorld numbers.

Google’s Project Mariner — built on Gemini 3.1 Pro — is the other serious competitor in autonomous browser use, and its WebArena performance is competitive with GPT-5.4. The contest among the three labs over computer-use benchmarks is likely to intensify through 2026 as each works to release successive model updates.

What makes GPT-5.4’s position particularly strong is the combination of computer use with the 1-million-token context window and the new nano pricing tier. That combination — long-context multi-step task completion at sub-cent-per-task pricing — is what enterprise automation buyers have been waiting for.

What This Means for Knowledge Work

The crossing of the human-baseline threshold on OSWorld is a symbolic and practical milestone. Practically, it means that for a significant class of common desktop tasks — navigating file systems, filling web forms, operating productivity software, collecting data from websites — a well-prompted GPT-5.4 agent is now more reliable than an average human worker on the same task.

This does not mean AI agents will immediately replace knowledge workers at scale. Computer use is still unreliable for novel, highly variable task environments. Agents still fail on tasks requiring nuanced judgment, interpersonal context, or creativity that goes beyond recombination of observed patterns. And enterprise deployment of autonomous agents at scale introduces security, compliance, and auditability concerns that are not yet fully resolved.

But the trend is unambiguous. From 47.3% to 75.0% in four months. The curve is steep, the direction is clear, and the implications for how enterprises structure computer-centric work are beginning to move from theoretical to urgent.

The Agentic Era Has Begun in Earnest

Reflecting on the release, observers noted that GPT-5.4 is the first OpenAI model that could, in principle, perform many of the tasks on a standard software engineer’s ticket queue — running tests, filing bug reports, updating documentation, navigating CI/CD dashboards — without human supervision on each step.

That framing captures something important. The debate over AI and knowledge work has often been abstract: “AI will change jobs” without specificity about which jobs, which tasks, and on which timeline. GPT-5.4’s computer-use performance provides the specificity. The tasks most immediately in scope are not the creative, judgment-heavy, relationship-intensive tasks at the core of most knowledge roles. They are the high-volume, rule-bounded, interface-navigation tasks that consume 20–40% of the average knowledge worker’s time.

Automating that slice efficiently, reliably, and cheaply is not a trivial productivity gain. It is a structural shift in the economics of knowledge work — and based on OSWorld, we have just crossed the performance threshold that makes it real.

OpenAI GPT-5.4 computer use AI agents benchmarks agentic AI
Share

Related Stories

JetBrains Survey: Over Half of All GitHub Code Is Now AI-Generated

A JetBrains survey of 10,000+ developers finds that 51% of code committed to GitHub is now AI-generated or substantially AI-assisted, with the AI coding tools market hitting $12.8B. GitHub Copilot retains the widest adoption but satisfaction growth has stalled, while Claude Code has surged to lead all tools on satisfaction metrics with 91% CSAT.

5 min read