Generative AI in the Public Sector: A Policy Roadmap
In June 2025, the U.S. State Department announced that it would begin using StateChat, a generative AI chatbot built on Palantir and Microsoft technology, to assist panels that determine promotions and assignments for foreign service officers. The announcement drew little of the public attention that might have accompanied a similar move in the private sector, in part because it arrived amid a much larger wave of federal AI adoption. According to the Government Accountability Office, the total number of reported AI use cases across eleven selected federal agencies nearly doubled from 571 in 2023 to 1,110 in 2024, an increase GAO characterized as a ninefold jump in generative AI use specifically.
The StateChat example is worth pausing on because it captures, in a single decision, the central tension this article examines. A government agency has deployed a large language model to assist with one of the most consequential personnel decisions an employee will experience, a promotion panel that shapes the trajectory of a career in public service, using technology built by private vendors, governed by evolving and incomplete federal policy, and operating with limited public visibility into how its outputs are weighted or reviewed. None of this means the deployment is wrong. It may well improve consistency and reduce the kind of interpersonal bias that has long affected promotion panels. But it illustrates why generative AI adoption in government cannot be evaluated by the same standard as adoption in a private firm. The stakes, the accountability relationships, and the public's expectations are categorically different, and the policy framework needs to reflect that difference explicitly rather than treating government AI adoption as a slower-moving version of the private sector trend.
The Adoption Curve Has Outpaced the Policy Infrastructure
The federal government's approach to generative AI has moved through three distinct phases in a remarkably short period, and each phase has left policy gaps that the next phase inherited.
The first phase, in 2023, was largely exploratory. Agencies stood up pilot projects, often in partnership with major AI vendors, to test generative AI on internal tasks: document summarization, code assistance, and customer service triage. The FDA's Elsa tool, designed to help employees read, write, and summarize regulatory documents, exemplifies this phase: a contained, internally-focused application with relatively low stakes if it underperformed.
The second phase, through 2024 and into 2025, was characterized by rapid scaling and procurement innovation. The General Services Administration's OneGov strategy removed cost as a primary barrier to adoption, making popular generative AI tools available to agencies for as little as a dollar, including an arrangement with xAI that gave federal agencies access to Grok models for $0.42 per organization. This phase dramatically lowered the friction for agencies to adopt generative AI tools, but it did so faster than the policy guidance needed to govern appropriate use could be developed and disseminated.
The third phase, beginning in late 2025, has been defined by an attempt to catch policy up to adoption, with mixed results. The December 2025 executive order on a National Policy Framework for Artificial Intelligence and the accompanying OMB memorandum M-26-04 represent the most significant federal attempt yet to establish governing principles, but as the next section discusses, the content of those principles raises its own questions about what "accountability" and "public trust" mean in this context.
Throughout all three phases, GAO's review found a consistent pattern: officials at ten of twelve selected agencies reported that existing federal policy, particularly around data privacy and IT acquisitions, was not designed with generative AI in mind and often presented obstacles that agencies had to work around rather than guidance that helped them proceed safely. This is the structural problem any roadmap for public sector AI must address: the policy framework has been retrofitted onto adoption that already happened, rather than shaping adoption as it occurred.
Accountability: What "Human in the Loop" Means When the Loop Is Hard to See
The most common policy response to concerns about high-stakes AI use, in government and the private sector alike, is to require a "human in the loop," a requirement that a human reviews and can override AI-generated outputs before they take effect. This requirement is necessary. It is not, by itself, sufficient, and the StateChat example illustrates why.
A human in the loop is only a meaningful accountability mechanism if the human has the information, time, and institutional incentive to meaningfully exercise judgment rather than simply ratify the AI's output. A promotion panel member reviewing an AI-generated assessment of dozens of candidates, under time pressure, with the AI's output presented as a default recommendation, faces a structurally different decision than one starting from a blank slate. Research on automation bias, the tendency of human reviewers to defer to algorithmic recommendations even when they have the formal authority to override them, suggests that the mere presence of a human reviewer does not guarantee that the human's judgment is the operative factor in the final decision.
For government agencies, this creates a documentation and audit requirement that goes beyond what "human in the loop" policies typically specify. A defensible accountability framework needs to record not just that a human reviewed an AI output, but what information the human had access to, how much time was available for review, whether the AI's recommendation was presented as a default or as one option among several, and how often human reviewers in practice diverge from AI recommendations. Without this level of documentation, "human in the loop" becomes a compliance checkbox rather than a genuine accountability mechanism, and agencies deploying generative AI in personnel, benefits determination, law enforcement, or regulatory contexts should be required to maintain it.
Brookings' analysis of federal AI adoption raises a related concern specific to the current policy moment: the combination of the absence of comprehensive federal AI legislation, federal preemption of state-level AI regulation, and limited agency reporting on risk mitigation for high-risk AI systems creates a public trust problem that exists independently of whether any individual AI deployment is functioning well. A public that does not see binding accountability structures is unlikely to extend trust based on agency assurances alone, regardless of how those individual systems actually perform.
Data Sovereignty: Where Government Data Goes When It Talks to a Model
Data sovereignty, the question of where government data is processed, who has access to it, and under what legal jurisdiction it falls, has become one of the most technically complex and politically significant issues in public sector AI adoption, and the GSA OneGov arrangements illustrate why.
When a federal agency uses a commercial large language model, whether through a direct subscription, a OneGov arrangement, or an API integration, the agency's prompts and the data included in those prompts are transmitted to the vendor's infrastructure for processing. For low-sensitivity use cases, document summarization of publicly available materials, for instance, this raises minimal concern. For higher-sensitivity use cases, personnel records, law enforcement data, regulatory enforcement information, or anything covered by the Privacy Act, the question of what happens to that data once it leaves agency systems becomes a central policy question, not a peripheral one.
Research on applying generative AI in sensitive government contexts has highlighted on-premises and locally-hosted model deployment as one approach to addressing data sovereignty concerns directly: running open-source language models on agency-controlled infrastructure, rather than transmitting data to external commercial vendors, preserves the auditability and data control that sensitive use cases require, at the cost of the performance and capability advantages that the largest commercial models currently offer. This represents a genuine tradeoff that procurement policy has not yet adequately addressed: agencies handling the most sensitive data may need to accept less capable models in exchange for data sovereignty, while agencies with lower-sensitivity use cases can access more capable commercial models with correspondingly different risk profiles. A one-size-fits-all procurement approach, treating all generative AI use cases as equivalent regardless of data sensitivity, is not adequate to this distinction.
The rapid, low-cost procurement arrangements that have characterized the second phase of federal AI adoption, while valuable for lowering barriers to experimentation, have generally not been structured around this distinction. A policy roadmap for public sector AI needs to establish data sensitivity tiers as a foundational element of AI procurement and deployment policy, with correspondingly different requirements for vendor data handling, model hosting location, and audit access at each tier.
Public Trust: The Neutrality Mandate and Its Complications
The OMB memorandum M-26-04, issued in December 2025, establishes a requirement that all large language models procured by government agencies adhere to principles of being truth-seeking and ideologically neutral. On its face, this is a reasonable requirement: a government tool used to draft regulatory language, summarize public comments, or assist in personnel decisions should not embed a political viewpoint that systematically advantages one perspective over another.
In practice, operationalizing "ideological neutrality" for a large language model is a genuinely difficult technical and definitional problem, and the policy roadmap needs to grapple with that difficulty rather than treating the requirement as self-explanatory. Language models are trained on large corpora of text that reflect the distribution of perspectives present in that text, and any model will reflect that distribution in some form. Defining "neutral" requires a reference point, and the choice of reference point is itself a judgment that can be contested. A policy that mandates neutrality without specifying how neutrality is measured, who measures it, and what recourse exists if a model's outputs are found to diverge from the standard, risks becoming either unenforceable or selectively enforced.
This matters for public trust in a specific way: if the neutrality mandate is perceived as a mechanism for political oversight of government AI tools rather than a genuine quality standard, it could undermine rather than build the public trust it is intended to protect. A policy roadmap should pair the neutrality principle with a transparent, technically grounded measurement methodology, ideally one developed with input from outside the procuring agency, and with a public reporting mechanism that allows the standard to be audited rather than simply asserted.
A Roadmap for Responsible Integration
Drawing together the accountability, data sovereignty, and public trust dimensions examined above, a responsible framework for generative AI integration in government should rest on four foundations.
First, accountability documentation requirements should scale with the stakes of the decision an AI system informs. Personnel decisions, benefits determinations, law enforcement applications, and regulatory enforcement should carry the highest documentation burden, including records of what information human reviewers had access to and how often they diverged from AI recommendations, while low-stakes internal productivity tools can operate under lighter requirements.
Second, data sensitivity tiers should be a foundational element of AI procurement, with on-premises or locally-hosted model options available and appropriately resourced for the highest-sensitivity use cases, even where this means accepting less capable models than the commercial state of the art.
Third, neutrality and bias standards for procured AI systems should be paired with transparent, externally-validated measurement methodologies and public reporting, so that the standard functions as a genuine quality benchmark rather than an assertion that cannot be independently verified.
Fourth, and perhaps most importantly, the National AI Use Case Inventory process that GAO and OMB have begun to formalize should be expanded into a standing public reporting requirement, not just an internal management tool, so that the pace of adoption documented by GAO, nearly doubling in a single year, is matched by a corresponding pace of public visibility into what these systems do, how they are reviewed, and how they perform over time. The technology will continue to move quickly. The public's ability to see how the government is using it should move with it, not behind it.
Key Takeaways
- Federal AI use cases nearly doubled from 571 in 2023 to 1,110 in 2024, with a ninefold jump in generative AI specifically.
- "Human in the loop" is only meaningful when reviewers have time, context, and incentive to diverge from AI defaults — otherwise it becomes a compliance checkbox.
- Data sensitivity tiers should be foundational to AI procurement, with on-premises options for the highest-sensitivity use cases.
- Neutrality mandates need transparent, externally-validated measurement — otherwise they risk eroding the public trust they aim to protect.
- The National AI Use Case Inventory should become a standing public reporting requirement, not just an internal management tool.