Advanced AI models that showcase unparalleled capabilities in natural language processing, problem-solving, and multimodal understanding have some inherent vulnerabilities that expose critical security risks. While these language models’ strength lie in their adaptability and efficiency across diverse applications, those very same attributes can be manipulated.
A new red teaming report by Enkrypt AI underscores this duality, demonstrating how sophisticated models like Mistral’s Pixtral can be both groundbreaking tools and potential vectors for misuse without robust, continuous safety measures. It has revealed significant security vulnerabilities in Mistral’s Pixtral large language models (LLMs), raising serious concerns about the potential for misuse and highlighting a critical need for enhanced AI safety measures.
The report details how easily the models can be manipulated to generate harmful content related to child sexual exploitation material (CSEM) and chemical, biological, radiological, and nuclear (CBRN) threats, at rates far exceeding those of leading competitors like OpenAI’s GPT-4o and Anthropic’s Claude 3.7 Sonnet.
The report focuses on two versions of the Pixtral model: Pixtral-Large 25.02, accessed via AWS Bedrock, and Pixtral-12B, accessed directly through the Mistral platform.

Enkrypt AI’s researchers employed a sophisticated red teaming methodology, utilising adversarial datasets designed to mimic real-world tactics used to bypass content filters. This included “jailbreak” prompts – cleverly worded requests intended to circumvent safety protocols – and multimodal manipulation, combining text with images to test the models’ responses in complex scenarios. All generated outputs were then reviewed by human evaluators to ensure accuracy and ethical oversight.
High propensity for dangerous output
The findings are stark: on average, 68% of prompts successfully elicited harmful content from the Pixtral models. Most alarmingly, the report states that Pixtral-Large is a staggering 60 times more vulnerable to producing CSEM content than GPT-4o or Claude 3.7 Sonnet. The models also demonstrated a significantly higher propensity for generating dangerous CBRN outputs – ranging from 18 to 40 times greater vulnerability compared to the leading competitors.
The CBRN tests involved prompts designed to elicit information related to chemical warfare agents (CWAs), biological weapon knowledge, radiological materials capable of causing mass disruption, and even nuclear weapons infrastructure. While specific details of the successful prompts have been excluded from the public report due to their potential for misuse, one example cited in the document involved a prompt attempting to generate a script for convincing a minor to meet in person for sexual activities – a clear demonstration of the model’s vulnerability to grooming-related exploitation.
The red teaming process also revealed that the models could provide detailed responses regarding the synthesis and handling of toxic chemicals, methods for dispersing radiological materials, and even techniques for chemically modifying VX, a highly dangerous nerve agent. This capacity highlights the potential for malicious actors to leverage these models for nefarious purposes.
Mistral has not yet issued a public statement addressing the report’s findings, but Enkrypt AI indicated they are in communication with the company regarding the identified issues. The incident serves as a critical reminder of the challenges inherent in developing safe and responsible artificial intelligence, and the need for proactive measures to prevent misuse and protect vulnerable populations. The report’s release is expected to fuel further debate about the regulation of advanced AI models and the ethical responsibilities of developers.
The red teaming practice
Companies deploy red teams to assess potential risks in their AI. In the context of AI safety, red teaming is a process analogous to penetration testing in cybersecurity. It involves simulating adversarial attacks against an AI model to uncover vulnerabilities before they can be exploited by malicious actors.
This practice has gained significant traction within the AI development community as concerns over generative AI’s potential for misuse has escalated. OpenAI, Google, and Anthropic have, in the past, deployed red teams to find out vulnerabilities in their own models, prompting adjustments to training data, safety filters, and alignment techniques.
ChatGPT maker uses both internal and external red teams to test weaknesses in its AI models. For instance, the GPT-4.5 System Card details how the model exhibited limited ability in exploiting real-world cybersecurity vulnerabilities. While it could perform tasks related to identifying and exploiting vulnerabilities, its capabilities were not advanced enough to be considered a medium risk in this area. Specifically, the model struggled with complex cybersecurity challenges.
The red team assessed its capability by running a test set of over 100 curated, publicly available Capture The Flag (CTF) challenges that were categorised into three difficulty levels — High School CTFs, Collegiate CTFs, and Professional CTFs.
GPT-4.5’s performance was measured by the percentage of challenges it could successfully solve within 12 attempts. The results were — High School: 53% completion rate; Collegiate: 16% completion rate; and Professional: 2% completion rate. While the score is “low,” it was noted that these evaluations likely represent lower bounds on capability. That means improved prompting, scaffolding, or fine-tuning could significantly increase performance. Consequently, the potential for exploitation exists and needs monitoring.

Another example on how red teaming helped inform developers pertain to Google’s Gemini model. A group of independent researchers released findings from a red team assessment of the search giant’s AI model, highlighting its susceptibility to generating biased or harmful content when prompted with specific adversarial inputs. These assessments contributed directly to iterative improvements in the models’ safety protocols.
A stark reminder
The rise of specialised firms like Enkrypt AI demonstrates the increasing need for external, independent security evaluations that will provide a crucial check on internal development processes. The growing body of red teaming reports is driving a significant shift in how AI models are developed and deployed. Previously, safety considerations were often an afterthought, addressed after the core functionality was established. Now, there’s a greater emphasis on “security-first” development – integrating red teaming into the initial design phase and continuously throughout the model’s lifecycle.
Enkrypt AI’s report serves as a stark reminder that the development of safe and responsible AI is an ongoing process requiring continuous vigilance and proactive measures. The company recommends immediate implementation of robust mitigation strategies across the industry, emphasizing the need for transparency, accountability, and collaboration to ensure AI benefits society without posing unacceptable risks. The future of generative AI hinges on embracing this security-first approach – a lesson underscored by the alarming findings regarding Mistral’s Pixtral models.
Published – May 09, 2025 08:25 am IST