Red Teaming

June 22, 2024

Last month, 28 countries and several industry stakeholders convened at the second iteration of the AI Safety Summit in Seoul, South Korea. The convention aimed to promote collaboration amongst governments and the private sector to develop robust standards and methodologies to improve AI safety. The Summit resulted in a commitment from 16 AI giants to develop the technology safely. Thus far, testing AI Safety has largely remained in hands of the developers of the models with independent organizations and academia playing a limited role. However, governments play an increasing role in guaranteeing the safety of AI systems as the technologies are adopted more widely, aiming to mitigate potential threats, such as misinformation, privacy, and national security. In the US, President Joe Biden signed an Executive Order on Safe, Secure, and Trustworthy AI this past October to promote AI safety standards, calling upon the National Institute for Standardized Technologies to create a standardized approach to safety tests. AI Safety tests include a range of benchmarks that models must pass, as well as a practice called “red teaming”.

Red teaming describes an adversarial attack on an AI model to expose weak spots in the guardrails of the system. Guardrails are a set of filters devised by the AI developer to prevent the system from producing content that violates their safety and ethics policies. The goal of red teaming lies in manipulating the model into producing outputs that it has been programmed to censor, including harmful or illegal content, false information, or sensitive data. Circumventing the safety safeguards of a model to elicit a response that contains illicit or dangerous information is known as jailbreaking. Red teaming describes the deliberate attempt to jailbreak a model to identify the vulnerabilities of the system’s defenses to similar attacks, ensuring its safety and security.

Originating from military simulations during the Cold War, red teaming was later adopted into cybersecurity vocabulary to describe the simulation of real-world threats to test the robustness of a software system. AI red teaming is less metric based than other benchmarks for AI safety, as researchers prompt the system in the same way as a regular user. It involves a wide array of techniques, such as getting the model to role-play characters or having the model respond in code to bypass its restrictions against explicitly revealing harmful or sensitive information. However, testing through user prompting does not address all potential vulnerabilities of the models. AI remains exposed to conventional cyber-attacks that hack into the system and manipulate or copy the model’s infrastructure or underlying data. Additionally, Stanford’s Artificial Intelligence Index released in March highlights that the previous focus on prompts that are comprehendible to humans misses vulnerabilities. Less intuitive attacks, such as prompting models to repeat a word infinitely have also exposed weak points in the system.

In contrast to most software systems, AI delivers probabilistic rather than deterministic responses to similar prompts. As a result, identical interactions with an AI model seldom produce identical results. The system often alters its response slightly, even when prompted with the same question. In fact, researchers are currently looking into the phenomenon known as AI drift, an observed deterioration of system responses over time. The implications for security and safety risks are severe, as a system must pass safety tests multiple times to guarantee the effectiveness of the guardrails. Consequently, companies are looking to automate red teaming to validate the consistency of robust guardrails. However, a host of additional challenges accompany automation. Large parts of red teaming are currently done qualitatively under the guidance of domain experts. Developing quantitative metrics to evaluate a system over iterative testing is a necessary precondition for automating red teaming, but quantifying model misalignment remains a challenge of its own. Moreover, the emergence of multimodal systems necessitates rigorous testing with text, audio, and images to guarantee the robustness of safety measures across all modalities. Indeed, AI models are not exclusively language-based, complicating the verification process of prompts and responses. Additionally, the role of human oversight remains ambiguous, presenting a further challenge to automation efforts.

The broad and diverse use cases of AI models require red teaming across a range of specific domains. Red teaming seeks to expose the vulnerabilities to deliberate attacks and any potential incidents stemming from non-malicious actors that accidentally derail the model from its safety guardrails. The wide variety of interactions with AI makes it critical to consider the diversity of user perspectives when testing the model’s safety. However, the qualitative nature of red teaming makes the process to eliminate all possible incidents and prompts that can lead to jailbreaking the model, considering the limited capacity of a single red teaming team. As a result, the public oftentimes publishes instances of jailbreaks on social media after developers release a new model. Entire online communities have been established around the challenge to jailbreak new models. Additionally, red teaming requires domain-specific expertise to manage complex threats such as child safety, misinformation, and national security. Oftentimes, developers recruit external expertise to meet the demand for domain-specific testing. Misinformation remains a significant challenge for AI systems, considering their training data stems from an online ecosystem suffering from an information crisis. Nevertheless, spreading misinformation threatens the integrity of major political events, such as elections.

There has been an increasing willingness from both the private and public sectors to share the frameworks for red teaming publicly. Google, NVIDIA, Microsoft, NIST, and most recently Anthropic have published their approaches. Meanwhile, governments such as Singapore have developed their own “red teams” to provide external capacities for companies to comply with safety standards. Singapore is building an open-source toolkit for red teaming services alongside other benchmarks and tests for AI safety. The EU has included red teaming as part of the regulatory standards of their AI Act (Annex XI, Section 2). The increased transparency and collaborations are steppingstones to create necessary standardized practices for red teaming.

Standardized practices to guarantee AI safety are vital to build trust in AI systems. Further cross-sector collaboration helps define standard metrics for safety and security, while diversifying the viewpoints. Alongside standardized testing, building out the capacity for red teaming remains paramount. A single organization struggles to meet the demand for comprehensive testing across a wide range of domains. Instead, third-party organizations can provide specific expertise to test the model safety, while also increasing the credibility of its testing as an independent auditor. Furthermore, AI designers should think of designing a system of incentives to crowdsource red teaming by allowing users to report incidences when the model get jailbroken. Finally, red teaming is one of many approaches to improve AI safety, and it should be improved alongside other metrics, benchmarks, and tests that aim to make models safe, secure, and trustworthy.