Behind the Scenes: Evaluating LLM Red Teaming Techniques and LLM Vulnerabilities
Ensuring the safety of large language models (LLMs) across languages is crucial as AI becomes more integrated into our lives. This article presents our red teaming study at Kili Technology, evaluating LLM vulnerabilities against adversarial prompts in both English and French. Our findings reveal critical insights into multilingual weaknesses and highlight the need for improved safety measures in AI systems.
A Journey Through Our Red Teaming Study
It started with a critical challenge in AI safety: ensuring large language models (LLMs) maintain ethical boundaries and safety across different languages and contexts. As these models become increasingly prevalent in various applications, the need for comprehensive safety evaluation has become increasingly urgent.
Our team at Kili Technology conducted a systematic evaluation of LLM vulnerabilities, with a specific focus on cross-lingual performance. We examined three prominent models—CommandR+, Llama 3.2, and GPT4o—using a diverse set of adversarial prompts in both English and French.
Check out the full reportWhy This Study Matters
The rapid advancement of large language models has brought significant capabilities in natural language processing, along with growing concerns about their safety and ethical implications. As these models become more integrated into real-world applications, the need for robust evaluations of their resistance to adversarial attacks and potential misuse has become increasingly urgent.
While much of the existing research on red teaming focuses on a single language (often English), we expanded our exploration by adding French. This additional linguistic dimension enriches the diversity of red teaming and allows us to assess whether the vulnerabilities exposed through various techniques vary when models operate in different languages.
Setting Up the Experiment
Our methodology involved a carefully structured approach to testing and evaluation. We developed:
A dataset of 102 prompts per language
Seven distinct harm categories, from misinformation to illegal activities
Twelve different red teaming techniques
The harm categories were carefully defined to encompass key areas of concern:
We've used the following red teaming techniques below:
Our testing process was rigorously structured. Each adversarial prompt was systematically inputted into our annotation tool, generating two responses from the same AI model. This approach allowed us to observe any inconsistencies in model behavior.
The Challenge of Dataset Development
We collaborated with in-house machine learning experts and experienced annotators to craft our diverse set of adversarial prompts. The development process included:
Careful attention to ensuring translations were both linguistically precise and culturally relevant
Involvement of linguistic and cultural experts throughout the review process
A focus on maintaining the integrity of prompts while allowing for cultural differences in responses
English prompts, drawn from American and British contexts, were translated into French and adapted to reflect France's cultural and geopolitical nuances. This attention to cultural context was crucial for ensuring valid cross-linguistic comparisons.
Early Findings
Our initial results revealed significant variations in model vulnerabilities. The overall success rates for adversarial prompts varied considerably across models:
CommandR+: 83.82% overall success rate (85.29% English, 82.35% French)
Llama 3.2: 41.18% overall success rate (50.98% English, 31.37% French)
GPT4o: 28.92% overall success rate (36.27% English, 21.57% French)
These numbers highlighted significant disparities in model performance between English and French prompts, underscoring the need for robust multilingual safety measures. The consistent pattern of higher success rates in English across all models suggested systematic differences in how these systems process and respond to adversarial inputs in different languages.
Unexpected Patterns and Discoveries
As our team delved deeper into the testing phase, distinct patterns emerged. Some confirmed our initial hypotheses, while others revealed surprising vulnerabilities we hadn't anticipated. Let us walk you through our most significant discoveries.
The Erosion Effect
One of our most intriguing findings emerged not from a single test, but from observing how models behaved over extended conversations. We noticed something remarkable with CommandR+: the model would initially approach potentially harmful content with clear disclaimers and caution. However, as conversations progressed, these safety measures would begin to fade.
Consider this concrete example from our testing: When first prompted about graphic content, CommandR+ carefully prefaced its response with a clear disclaimer about creative writing and research purposes.
Yet in subsequent exchanges, these protective barriers eroded, and the model produced similar content without the initial safeguards. This pattern of diminishing safety measures over time became a recurring theme in our observations.
The Language Gap Mystery
While we expected some variation between English and French responses, the actual data told a fascinating story. All three models showed consistent but varying degrees of language disparity:
CommandR+ demonstrated the smallest gap, with success rates of 85.29% for English and 82.35% for French prompts. In contrast, Llama 3.2 showed a much wider disparity: 50.98% for English versus 31.37% for French. GPT4o presented yet another pattern, with rates of 36.27% and 21.57% respectively.
They represented a fundamental insight into how these models process safety constraints across different languages. The consistent pattern of higher success rates in English raised important questions about the underlying architecture of these systems.
The Power of Patterns
As we analyzed different attack techniques, two emerged as surprisingly effective across all our tests. The Few/Many Shot Attack proved particularly potent, achieving a remarkable 92.86% success rate against Llama 3.2 and 78.57% against GPT4o. Following closely behind was the Bait and Switch Attack, with consistent success rates of 80% against Llama 3.2 and 70% against GPT4o.
An example of a response from a Few/Many Shot Attack prompt.
A common factor is that the techniques exploited fundamental aspects of how these models process information—pattern recognition and contextual understanding. We witnessed the models' own learning mechanisms being turned against their safety protocols.
The Vulnerability Triangle
Through careful analysis, we identified three categories where models showed consistent vulnerability:
Manipulation and Coercion stood out with the highest average success rate and remarkably low variability
Misinformation and Disinformation followed closely, showing the most consistent performance across different instances
Illegal Activities rounded out the top three, though with higher variability in results
Each model showed its own unique susceptibility profile. CommandR+ proved most vulnerable to graphic content (100% success rate), while GPT4o's highest vulnerability was to manipulation and coercion (45%). Llama 3.2 showed equal highest vulnerability to bias/discrimination and manipulation (50% each). That said, it could be that Command R+ intentionally did not have a content filter for graphic content.
A Deeper Look at Model Behavior
Perhaps most concerning was how models handled prompts targeting different demographic groups. Our testing revealed inconsistent behavior across topics related to race, religion, gender, and other demographic factors. The data showed varying levels of resistance depending on both the target demographic and the language of the prompt.
For instance, CommandR+ showed a 76.67% success rate for bias and discrimination prompts, while GPT4o demonstrated stronger defenses with only a 16.67% success rate in this category.
What We Learned and Where We Go From Here
As our research drew to a close, we found ourselves with answers to our initial questions, but also with new and pressing challenges for the field of AI safety.
Beyond Single-Language Testing
One of our study's most significant contributions was demonstrating why multilingual testing matters. The stark differences in success rates between English and French prompts across all models weren't just statistical variations – they point to a fundamental challenge in AI safety.
Think about it: if a model is significantly more vulnerable to English prompts than French ones (as we saw with Llama 3.2's 50.98% versus 31.37% success rates), what does this mean for global deployment? These findings highlight a critical gap in current safety measures: what works in one language might fail in another.
The Pattern Recognition Challenge
Our discovery about the effectiveness of Few/Many Shot Attacks (reaching up to 92.86% success rate in some cases) and Bait and Switch Attacks revealed a fundamental tension in how these models work. The very mechanisms that make LLMs powerful – their ability to recognize and learn from patterns – can become vulnerabilities when exploited by adversarial prompts.
It's a design challenge that goes to the heart of building these systems. How do we maintain the pattern recognition capabilities that make these models useful while preventing their exploitation for harmful purposes?
Limitations and Future Work
Our study, while comprehensive, had its boundaries. We focused on three specific models and two languages, which gives us a solid foundation but also points to areas needing further exploration:
Model Coverage: Expanding beyond CommandR+, Llama 3.2, and GPT4o would provide a more complete picture of vulnerabilities across different model architectures.
Linguistic Scope: Comparing English and French provided valuable insights, including a broader range of languages representing various language families, which would help develop more globally robust safety measures.
Dataset Size: Our current dataset of 102 prompts per language served as a starting point. A larger, more diverse dataset would allow for a more comprehensive evaluation.
The Road Ahead
Our findings point to several critical areas for future research and development:
Dynamic Safety Measures: The erosion of ethical safeguards over extended conversations suggests we need more adaptive and persistent safety mechanisms.
Cross-Lingual Robustness: The consistent disparity in performance between languages calls for more sophisticated approaches to multilingual safety training.
Specialized Defenses: The high success rates of certain techniques like Few/Many Shot Attacks indicate the need for targeted defenses against specific types of manipulation.
A Call to Action
The vulnerabilities we've uncovered aren't just academic concerns – they have real implications for the deployment of AI systems in our increasingly connected world. As these models become more integrated into various applications, ensuring their safety across linguistic and cultural boundaries becomes crucial.
Our research underscores the necessity for more dynamic and adaptive safety mechanisms in LLMs. The current static approaches to ethical constraints may not be sufficient for maintaining consistent safety standards, especially across different languages and cultural contexts.
What's Next?
As we look to the future, our team is already planning expanded studies to address the limitations we've identified. We aim to:
Systematically broaden our language coverage
Include a more diverse range of models
Develop more comprehensive testing methodologies
The journey to creating safer, more reliable AI systems is ongoing, and each study brings us closer to understanding how to build more robust safeguards for these powerful tools.
This research represents a starting point in understanding LLM vulnerabilities across languages. We encourage the AI safety community to build upon these findings and continue exploring solutions to the challenges we've identified.
Want to read the whole report?
Get the full report here