Chapter 7: Establishing Justified Confidence in AI Systems

Following is a summary of Part I, Chapter 7 of the National Security Commission on Artificial Intelligence's final report. Use the below links to access a PDF version of the Chapter, Blueprint for Action, and the Commission's Full Report.

Artificial intelligence (AI) systems must be developed and fielded with justified confidence.1

If AI systems routinely do not work as designed, or are unpredictable in ways that can have significant negative consequences, then leaders will not adopt them, operators will not use them, Congress will not fund them, and the American people will not support them.

Achieving acceptable AI performance often is linked to the decision to accept some level of risk.

As departments and agencies rely more heavily on machines, a central guiding principle across national security scenarios is the continued centrality of human judgment. Those charged with utilizing AI need an informed understanding of risks, opportunities, and tradeoffs.

Ultimately, they need to formulate an educated answer to this question: In the given circumstance, how much confidence in the machine is enough confidence?

Five Key Challenges and Recommendations

The Commission has produced a detailed framework to guide the responsible development and fielding of AI across the national security community. To assist agencies in meeting baseline criteria for responsible AI, we highlight the main challenges and key recommendations in our framework across five issue areas.

1. Robust and Reliable AI

Current AI systems, such as those used for perception and classification, have different kinds of failure––characterized as rates of false positives and false negatives. They are often brittle when operating at the edges of their performance competence, and it is difficult to anticipate their competence boundaries.2 They are also vulnerable to attack, and they can exhibit unwanted bias in operation.

Focus more federal R&D investments on advancing AI security and robustness.

These investments should also advance the interpretability and explainability of AI systems, so users can better understand whether the systems are operating as intended.

Consult interdisciplinary groups of experts to conduct risk assessments, improve documentation practices, and build overall system architectures to limit the consequences of system failure.3

Such architectures should securely monitor component performance and handle errors when anomalies are detected4; contain AI components that are self-protecting (validating input data) and self-checking (validating data passed to the rest of the system); and include aggressive stress testing.

2. Human-AI Interaction and Teaming

The government needs AI systems that augment and complement human understanding and decision-making so that the complementary strengths of humans and AI can be leveraged as an optimal team. Achieving this remains a challenge.

For instance, humans are prone both to over-trusting and to under-trusting machines depending on context. Challenges also exist for measuring the performance of human-AI teams, conveying enough information while avoiding cognitive overload, enabling humans and machines to understand the circumstances in which they should pass control between each other, and maintaining appropriate human engagement to preserve situational awareness and meaningfully take action when needed.

Agencies will also need to determine machine performance standards and expectations as compared with humans.

Pursue a sustained, multidisciplinary initiative through national security research labs to enhance human-AI teaming.

This initiative should focus on maximizing the benefits of human-AI interaction; better measuring human performance and capabilities when working with AI systems, including testing through continuous contact and experimentation with end users; and helping AI systems better understand contextual nuances of a situation.

Clarify policies on human roles and functions, develop designs that optimize human-machine interaction, and provide ongoing and organization-wide AI training.

3. Testing and Evaluation, Verification and Validation (TEVV)

Having justified confidence in AI systems requires assurances that they will perform as intended, including when interacting with humans and other systems. The TEVV of traditional legacy systems is not sufficient at providing these assurances. As a result, agencies lack common metrics to assess trustworthiness that AI systems will perform as intended.

To minimize performance problems and unanticipated outcomes, an entirely new type of TEVV will be needed. This is a priority task, and a challenging one. The federal government will need to increase R&D investments to improve our understanding of how to conduct AI and software-related TEVV.

DoD should tailor and develop TEVV policies and capabilities to meet the changes needed for AI as AI-enabled systems grow in number, scope, and complexity in the Department.

This should include establishing a TEVV framework and culture that integrates continuous testing; making TEVV tools and capabilities more readily available across the Department of Defense (DoD); updating or creating live, virtual, and constructive test ranges for AI-enabled systems; and restructuring the processes that underlie requirements for system design, development, and testing.5

National Institute of Standards and Technology (NIST) should provide and regularly refresh a set of standards, performance metrics, and tools for qualified confidence in AI models, data, and training environments, and predicted outcomes.

NIST should lead the AI community in establishing these resources, closely engaging with experts and users from industry, academia, and government to ensure their efficacy.

4. Leadership

Responsible development and fielding of AI requires end users and senior leaders to be aware of system capabilities and limitations so that they are not misused. It also requires subject-matter experts to support training, acquisition, risk assessment, and adoption of best practices as they evolve.

Today, only the DoD has a dedicated lead for Responsible AI; employees in national security agencies taking on these roles typically do so on a voluntary, part-time basis. Without full-time dedicated staff, agencies will not succeed in fully adopting and implementing these recommended practices.

Appoint a full-time, senior-level Responsible AI lead in each department or agency critical to national security and each branch of the armed services.

Such an official should drive Responsible AI training, provide expertise on Responsible AI policies and practices, lead interagency coordination, and shape procurement policies.

Create a standing body of multidisciplinary experts in the National AI Initiative Office.

The standing body would provide advice to agencies as needed on responsible AI issues. The group should include people with expertise at the intersection of AI and other fields such as ethics, law, policy, economics, cognitive science, and technology, including adversarial AI techniques.

5. Accountability and Governance

Congress and the public need to see that the government is equipped to catch and fix critical flaws in systems in time to prevent inadvertent disasters and hold humans accountable, including for misuse. Agencies need the ability to monitor AI performance as systems run (to assess if they are performing as intended) and to build systems with the necessary instrumentation to do so.6

Departments and agencies critical to national security and oversight entities have all expressed challenges with having visibility into their systems, while vendors are calling for clarity on instrumentation/auditability requirements.

Adapt and extend existing accountability policies to cover the full lifecycle of AI systems and their components.

Establish policies that allow individuals to raise concerns about irresponsible AI development and institute comprehensive oversight and enforcement practices.

These should include auditing and reporting requirements, a review mechanism for the most sensitive or high-risk AI systems, and appeals and grievance processes for those affected by the actions of AI systems.



1 The term “justified confidence,” taken from a widely used international standard, uses a specific definition of assurance as being “grounds for justified confidence.” It notes that “stakeholders need grounds for justifiable confidence prior to depending on a system” and that “the greater the degree of dependence, the greater the need for strong grounds for confidence.” ISO/IEC/IEEE International Standard – Systems and Software Engineering – Systems and Software Assurance, IEEE/ISO/IEC 15026-1 (2019),

2 Like other intelligent systems, including software and humans, AI systems have competency limitations. However, we have less science to understand the performance limitations of AI systems including why, when, and how they fail.

3 Such interdisciplinary teams should explore the possibility of documentation/labels specifying the narrow task/mission for which a system was designed and tested. As noted in the Appendix on Key Considerations for Responsible Development & Fielding of AI, documentation of the AI lifecycle should include information about the data used to train and test a model and the methods used to test a model, both based on the context in which it will be used. It also should include requirements for re-testing, retraining, and tuning when a system is used in a different scenario or setting.

4 Monitoring can add a layer of robustness, but must itself also be guarded to prevent new openings for external espionage or tampering with AI systems.

5 Upgrades to digital infrastructure, as outlined in Chapter 2 of this report, will be required to augment physical test ranges to create digital testing environments that can leverage digital twins.

6 Cases in which new sensors and instrumentation are added can also introduce new vulnerabilities. It is especially important to ensure that the overall architecture of such systems is secure against external espionage and tampering.