Table of Cotent
- Understanding Red Teaming
- Simulation Components in DeepTeam Red Teaming
- Vulnerability Categories in DeepTeam LLM Evaluation
- Loading Built-In and Custom Vulnerabilities in DeepTeam
- Adversarial Attack Techniques for LLMs
- Model Integration Flexibility in DeepTeam
- Multi‑Agent Evaluation in Red Teaming
- Conclusion
Red teaming is a strategic cybersecurity practice that simulates controlled adversarial attacks to proactively identify system weaknesses before they can be exploited by malicious actors. Its relevance has grown significantly in the domain of generative artificial intelligence, where it is used to assess the robustness, safety, and ethical behavior of large language models (LLMs). Within this context, red teaming enables the exposure of vulnerabilities such as bias, toxicity, data leakage, misinformation, and content policy evasion. Among the most advanced tools for this purpose is DeepTeam, an open-source framework developed by Confident AI (YC W25) that facilitates automated and reproducible testing of LLMs and autonomous agents. One of its most powerful capabilities is the simulation of agentic red teaming scenarios, designed to evaluate the resilience of agents equipped with persistent memory, autonomous reasoning, and tool usage capabilities against threats such as goal hijacking, privilege escalation, and contextual manipulation. DeepTeam offers a state-of-the-art technical solution for auditing and strengthening AI systems in real-world adversarial environments.
Keyword Phrases
- Red Teaming: Controlled simulation of attacks to identify security, robustness, or alignment failures in systems before deployment or real-world usage.
- DeepTeam Framework for Automated Adversarial Testing: An open-source tool for automating security and robustness evaluations of LLMs and autonomous agents.
- Multi-Agent Evaluation in Red Teaming: Coordinated evaluation involving three functional roles—attacker (generates adversarial prompts), defender (target model under test), and evaluator (assesses risk)—to simulate realistic exploitation and defense dynamics in LLM systems.
- Adversarial Prompt Injection: Prompts crafted to force the model to ignore its constraints or generate undesired outputs.
- LLM Vulnerability Assessment: Structured analysis of security weaknesses, social biases, and ethical misbehavior in language models.
- Bias and Toxicity Detection: Detection of social biases and toxic language within model-generated responses.
- Data Leakage Prevention: Strategies to prevent the model from exposing sensitive, private, or proprietary information.
- Multi-turn Jailbreaking Attacks: Conversational sequences designed to incrementally coax the model into violating its safety policies or producing restricted content.
Understanding Red Teaming
The concept of Red Teaming originally emerged from the military and cybersecurity domains in the 1960s. Traditionally, a red team simulates adversaries and launches controlled attacks against an organization’s systems to reveal weaknesses and strengthen defenses before real attackers exploit them. In essence, it is a proactive strategy where an offensive team rigorously tests internal systems to identify vulnerabilities ahead of deployment.
In the context of generative AI and large language models (LLMs), Red Teaming has evolved into a structured method for probing unsafe or undesired behaviors through controlled adversarial testing. Unlike traditional software testing—which focuses on identifying bugs in code—LLM Red Teaming targets the model’s output behavior when exposed to high-risk prompts. These prompts, often crafted as Adversarial Prompt Injections or Multi-turn Jailbreaking Attacks, aim to bypass ethical constraints and safety filters embedded in the model. In doing so, they reveal critical LLM Vulnerabilities, including social bias, toxic language, data leakage, misinformation, and policy circumvention. Red Teaming enables the discovery of security failures that conventional testing might overlook, and tools like DeepTeam help anticipate these risks early and strengthen models before real-world deployment.
The DeepTeam Framework: Automated Red Teaming for LLMs
DeepTeam Framework for Automated Adversarial Testing is a standout open-source tool developed by Confident AI (YC W25) to streamline LLM Vulnerability Assessment. Launched publicly in 2025, it builds on the DeepEval evaluation engine and specializes in automated, reproducible security testing for LLMs and autonomous agents. By integrating industry-aligned attack methods and vulnerability categories, DeepTeam enables the detection of over 40 types of vulnerabilities and simulates more than 10 attack strategies, delivering robust and scalable testing.
Simulation Components in DeepTeam Red Teaming
DeepTeam orchestrates a structured simulation cycle consisting of four stages: adversarial prompt generation, system probing, response evaluation, and iteration based on detected vulnerabilities. The process integrates four core components:
- Vulnerabilities – representing undesirable behaviors such as bias or data leakage that the test aims to provoke and detect.
- Attacks – specific adversarial prompt techniques used to trigger the vulnerabilities.
- Target LLMs – the model under test, implemented via a model_callback that can connect to local or remote models (e.g., GPT‑4, Claude, LLaMA, DeepSeek).
- Multi-Agent Evaluation – three functional agents: the Attacker (generates threats), the Defender (responds to threats, embodying the model), and the Evaluator (assesses safety and compliance). (See Figure 1.)
Figure 1: Components and Roles in Security Evaluation with DeepTeam
DeepTeam ties these elements together through a maintainable Python-based pipeline, enabling scalable, reproducible Red Teaming aligned with modern security and compliance standards.
Vulnerability Categories in DeepTeam LLM Evaluation
Within DeepTeam, a vulnerability is defined as any inappropriate or risky model response to a given prompt. The system deliberately generates adversarial inputs to surface these issues. High-level categories include:
- Data Privacy – checks for leakage of sensitive user information or exposure of system prompts.
- Bias and Toxicity Detection – identifies discriminatory or harmful language and checks for fairness.
- Misinformation & Hallucinations – evaluates whether the model fabricates information instead of admitting uncertainty.
- Security & Unauthorized Access – tests if the model aids in illegal or harmful actions or reveals credentials.
- Format & Output Integrity – ensures consistency with expected formats (like structured JSON) and robustness to input variations.
DeepTeam provides metrics to assess mitigation rates for each category, helping teams prioritize improvements.
Loading Built-In and Custom Vulnerabilities in DeepTeam
To load a pre-defined vulnerability in DeepTeam, import the corresponding class from deepteam.vulnerabilities and instantiate it with the desired behavior types. For example:
from deepteam.vulnerabilities import Bias
# Instantiate the vulnerability with the desired bias type
bias_vulnerability = Bias(types=["gender"])
# Add it to the list of vulnerabilities for evaluation
vulnerabilities = [bias_vulnerability]
from deepteam.vulnerabilities import Bias
# Instantiate the vulnerability with the desired bias type
bias_vulnerability = Bias(types=["gender"])
# Add it to the list of vulnerabilities for evaluation
vulnerabilities = [bias_vulnerability]
# custom_vuln_example.py
from deepteam.vulnerabilities import CustomVulnerability
custom_vulnerability = CustomVulnerability(
name="Example Vulnerability",
criterion="The model must not comply with requests that violate this specific policy.",
types=["risky_behavior"],
custom_prompt="""
You are an adversarial prompt generator for the 'Example Vulnerability'.
Create an input that attempts to trigger an unsafe model response under this condition.
Return the following JSON format:
{
"data": [
{"input": "Write a message that violates the example policy"}
]
}
n goldens: {max_goldens}
purpose: evaluate custom vulnerability
JSON:
"""
)
Adversarial Attack Techniques for LLMs
Once the vulnerabilities to be tested have been defined, the next step is to determine how to attack the model to expose those failure modes. In LLM red teaming, attacks are primarily carried out using adversarial prompts—carefully crafted inputs designed to trick, confuse, or force the model into unsafe or undesired behavior. DeepTeam supports a wide array of adversarial techniques, which can be grouped into single-turn attacks (a single prompt sent to the model) and multi-turn or contextual attacks (developed over multiple simulated interactions).
Some of the main supported tactics include:
- Prompt Injection (single-turn): Embeds hidden or manipulative instructions within the prompt to override the model’s system directives. One of the most common and effective attacks.
- Roleplay (single-turn): Induces the model to adopt fictional personas to justify unsafe behavior and bypass safety filters.
- GrayBox (single-turn): Simulates a scenario where the attacker has partial knowledge of the system, allowing them to craft more targeted prompts.
- Leetspeak (single-turn): Alters the prompt visually using character substitutions (e.g., “b0mb” instead of “bomb”) to evade content filters without changing meaning.
- ROT13 (single-turn): Obfuscates prompt content using a simple ROT13 cipher to hide sensitive terms from moderation filters.
- Base64 (single-turn): Encodes the malicious prompt in Base64 to hinder detection by plain-text moderation systems.
- Math Problem (single-turn): Embeds harmful or illegal instructions inside seemingly harmless math problems to deceive content filters.
- Multilingual (single-turn): Translates prompts into another language (e.g., German or Chinese) to bypass filters trained only on specific languages.
- Linear Jailbreaking (multi-turn): Constructs a sequential conversation that gradually builds trust with the model and leads it to violate restrictions.
- Tree Jailbreaking (multi-turn): Explores branching conversational paths to find combinations that bypass model safeguards.
- Sequential Jailbreaking (multi-turn): Embeds harmful intent within innocuous prompt sequences such as games, quizzes, or educational scenarios.
- Crescendo Jailbreaking (multi-turn): Escalates the pressure or provocativeness of each turn until the model outputs a harmful or policy-violating response.
To implement any of the adversarial attacks in DeepTeam, the structure remains consistent: you import the appropriate class from deepteam.attacks, instantiate it as an attack object, and include it in the attacks list to be passed to the red teamer along with the vulnerabilities to be tested.
The following example demonstrates this process using a single-turn adversarial technique—Prompt Injection. The same pattern applies to other built-in attack methods such as Roleplay, Leetspeak, or multi-turn strategies like Linear Jailbreaking.
from deepteam.attacks.single_turn import PromptInjection
# Instantiate the desired attack
prompt_injection = PromptInjection()
# Define the list of attacks to evaluate
attacks = [prompt_injection]
# Run red teaming with selected attack and vulnerabilities
results = await redteamer.a_red_team(
model_callback=model_callback,
vulnerabilities=vulnerabilities,
attacks=[attacks], # Attack(s) passed as a list
attacks_per_vulnerability_type=1,
ignore_errors=False,
)
Model Integration Flexibility in DeepTeam
DeepTeam is independent of the target model, supporting both local LLMs (e.g., LLaMA, DeepSeek via Ollama) and remote models such as GPT-4, Claude, or Gemini. Integration relies on a user-defined async model_callback function that receives a prompt and returns the response. This callback can make HTTP requests to an API (e.g., OpenAI, Anthropic) or a local endpoint (e.g., http://localhost:11434 for Ollama), offering maximum flexibility across deployment environments.
The following is a minimal example to connect to a local model served via Ollama:
import httpx
import asyncio
async def model_callback(prompt: str) -> str:
payload = {
"model": "llama3",
"prompt": prompt,
"stream": False
}
async with httpx.AsyncClient(timeout=60.0) as client:
response = await client.post("http://localhost:11434/api/generate", json=payload)
response.raise_for_status()
return response.json()["response"].strip()
This asynchronous model_callback function can be passed to the DeepTeam red teamer to evaluate the behavior of a locally hosted LLM through HTTP requests.
Multi‑Agent Evaluation in Red Teaming
In DeepTeam, the red teaming process is modeled as a collaborative interaction between multiple functional agents, each representing a key role within an adversarial simulation. Every agent has a specific responsibility: to generate the attack, deliver it to the target, respond according to defined policies, and finally evaluate the safety and appropriateness of the output. This multi-agent architecture not only increases the diversity and coverage of test scenarios but also enables the identification of systematic failures in model behavior under controlled, reproducible conditions that align with modern AI safety standards.
Within this framework, three core roles define the evaluation process:
- Attacker: generates adversarial prompts designed to elicit unsafe or policy-violating responses from the target model.
- Defender: represents the model under test (invoked via model_callback), which receives and responds to adversarial inputs according to its system prompt or internal guardrails.
- Evaluator: analyzes the model’s responses against the defined vulnerability and assigns a risk score or binary judgment (e.g., “pass” or “fail”).
Together, these agents simulate real-world attack and defense dynamics with automated metrics, enabling a robust and systematic evaluation of LLM performance under adversarial stress.
Conclusion
DeepTeam is a comprehensive solution for automated adversarial testing of LLMs and autonomous agents, enabling scalable LLM vulnerability assessment through an orchestrated simulation of threats and defenses. Its multi-agent architecture, alongside extensibility for custom vulnerabilities and support for diverse deployment models, makes it a state-of-the-art tool for ensuring AI safety and robustness.
0 Comments