US Government Will Test Google, Microsoft, and xAI Models Before Release Under New NIST Agreements
NIST's Center for AI Safety and Innovation has signed pre-deployment testing agreements with Google DeepMind, Microsoft, and xAI — expanding a program that began with OpenAI and Anthropic in 2024. Under the deals, companies hand over unreleased models with reduced safety guardrails so government evaluators can assess national security risks before the public ever sees them.
The U.S. government has quietly secured one of its most significant expansions of AI oversight authority since the Biden-era executive order on AI. NIST’s Center for AI Safety and Innovation (CAISI) announced on May 5 that it has signed pre-deployment testing agreements with Google DeepMind, Microsoft, and xAI — giving federal evaluators early access to frontier models before those systems are released to the public.
The agreements expand a program that started in 2024 when CAISI signed its first such deals with OpenAI and Anthropic. Since then, the center has completed more than 40 model evaluations, including assessments of cutting-edge systems not yet publicly available. Bringing in Google DeepMind, Microsoft, and xAI means the program now covers virtually every major frontier lab operating in the United States.
What the Agreements Actually Allow
The technical scope of these agreements is more expansive than most AI governance announcements. Under the deals, companies provide CAISI with model versions that have had their safety guardrails reduced or removed — the idea being that evaluators need to see what a model is genuinely capable of without the artificial ceiling imposed by RLHF-based alignment constraints.
This is a materially different kind of access than what safety researchers typically get from voluntary red-teaming programs. When an external team stress-tests a model through a standard API, they see the aligned, filtered version. What CAISI receives is closer to a raw capability evaluation — revealing the underlying potential for generating bioweapons instructions, planning cyberattacks, or assisting with mass casualty scenarios that the production safety layers are designed to suppress.
Evaluators from across the federal government participate through the TRAINS Taskforce — the Taskforce on Responsible AI National Security — an interagency group that coordinates AI security assessments and feeds findings back to policymakers. The taskforce draws expertise from defense, intelligence, and civilian agencies, giving the evaluations a breadth of threat-modeling perspectives that no single agency could replicate.
Why This Matters Beyond the Headlines
AI governance announcements frequently get celebrated in press releases and ignored in practice. This one is different for two reasons.
First, the voluntary nature of these agreements is not as soft as it sounds. All five of the labs now covered by CAISI agreements — OpenAI, Anthropic, Google DeepMind, Microsoft, and xAI — have built their public reputations around responsible AI development. Defecting from a CAISI agreement, or being seen as obstructing federal oversight, carries real reputational costs in an environment where enterprise and government contracts depend on trust. The agreements function partly as a market discipline mechanism.
Second, the practical output of 40-plus evaluations over roughly 18 months is a growing institutional body of knowledge about what frontier models can and cannot do. That knowledge base does not exist in any single company’s AI safety team. By sitting at the center of evaluations across all major labs, CAISI is accumulating comparative intelligence about capability trajectories — the kind of information that is essential for anticipating risks before they materialize, rather than after.
The Absence of Chinese Labs
The glaring gap in the CAISI program is the absence of any Chinese frontier labs — most obviously DeepSeek. DeepSeek’s V4 model, released in late April and widely benchmarked as competitive with GPT-5 on several dimensions, operates outside any U.S. oversight framework. Its training data, safety alignment methodology, and capability ceiling are known only through the company’s own technical reports and external reverse-engineering efforts.
The asymmetry is increasingly uncomfortable. U.S. labs are now subject to pre-deployment government review that constrains their operational flexibility, while Chinese competitors face no equivalent scrutiny in their domestic market and face no international framework that would impose such scrutiny.
CAISI officials have repeatedly flagged this gap in testimony and public statements. The challenge is not finding the will to include Chinese models in evaluation frameworks — it is that Chinese labs have no legal obligation to cooperate, and diplomatic mechanisms for AI safety coordination between Washington and Beijing remain embryonic at best.
The xAI Surprise
The inclusion of xAI — Elon Musk’s AI company — in this round of CAISI agreements is the most politically unexpected element of the announcement. Musk has been one of the most vocal critics of federal AI regulation, and his public posture has frequently been adversarial toward the kinds of oversight mechanisms these agreements represent.
The agreement’s timing is also notable: it comes amid the ongoing legal proceedings between Musk and OpenAI, and as xAI prepares what is widely expected to be a major Grok model release in the coming months. Participating in CAISI testing could serve multiple purposes for xAI — demonstrating good faith to government clients, providing a preview of regulatory compliance posture ahead of potential IPO scrutiny, and positioning Grok as a “safe” option for government deployment.
What Comes Next
CAISI has framed these agreements as a foundation for a more durable AI oversight system, but the current structure has no enforcement teeth. If an evaluation identifies a serious national security risk in an unreleased model, there is no automatic legal mechanism that prevents the company from releasing it anyway. The agreements are collaborative, not regulatory.
Congress has been attempting to pass legislation that would formalize pre-deployment review requirements for frontier AI systems since 2024. Multiple bills have stalled in committee. The voluntary framework that CAISI has assembled is in part a hedge against legislative inaction — demonstrating that oversight is possible without mandated review — and in part a structural argument for why legislation should codify and fund what is already working.
The Biden-era framework that birthed CAISI has survived into the current administration, which has otherwise taken a skeptical posture toward AI regulation. Whether it survives the next political cycle, and whether it acquires any binding authority in the meantime, will depend on developments that go far beyond the technical competence of any AI lab.