Can we convince AI to answer harmful requests?

Cyber security hologram with digital shield 3D rendering © iStock

Cyber security hologram with digital shield 3D rendering © iStock

New research from EPFL demonstrates that even the most recent Large Language Models (LLMs), despite undergoing safety training, remain vulnerable to simple input manipulations that can cause them to behave in unintended or harmful ways.

Today's LLMs have remarkable capabilities which, however, can be misused. For example, a malicious actor can use them to produce toxic content, spread misinformation, and support harmful activities.

Safety alignment or refusal training — where models are guided to generate responses that are judged as safe by humans, and to refuse responses to potentially harmful enquiries — is commonly used to mitigate the risks of misuse.

Yet, new EPFL research, presented at the 2024 International Conference on Machine Learning’s Workshop on Next Generation of AI Safetyhas demonstrated that even the most recent safety-aligned LLMs are not robust to simple adaptive jailbreaking attacks – essentially manipulations through the prompt to influence a model’s behavior and generate outputs that deviate from their intended purpose.

Bypassing LLM safeguards

As their paper, ‘Jailbreaking leading safety-aligned LLMs with simple adaptive attacks’ outlines, researchers Maksym Andriushchenko, Francesco Croce and Nicolas Flammarion from the Theory of Machine Learning Laboratory (TML) in the School of Computer and Communication Sciences achieved a 100% successful attack rate for the first time on many leading LLMs. This includes the most recent LLMs from OpenAI and Anthropic, such as GPT-4o and Claude 3.5 Sonnet.

“Our work shows that it is feasible to leverage the information available about each model to construct simple adaptive attacks, which we define as attacks that are specifically designed to target a given defense, which we hope will serve as a valuable source of information on the robustness of frontier LLMs,” explained Nicolas Flammarion, Head of the TML and co-author of the paper.

The researchers’ key tool was a manually designed prompt template that was used for all unsafe requests for a given model. Using a dataset of 50 harmful requests, they obtained a perfect jailbreaking score (100%) on Vicuna-13B, Mistral-7B, Phi-3-Mini, Nemotron-4-340B, Llama-2-Chat-7B/13B/70B, Llama-3-Instruct-8B, Gemma-7B, GPT-3.5, GPT-4o, Claude-3/3.5, and the adversarially trained R2D2.

Using adaptivity to evaluate robustness

The common theme behind these attacks is that the adaptivity of attacks is crucial: different models are vulnerable to different prompting templates, for example, some models have unique vulnerabilities based on their Application Programming Interface, and in some settings, it is crucial to restrict the token search space based on prior knowledge.

“Our work shows that the direct application of existing attacks is insufficient to accurately evaluate the adversarial robustness of LLMs and generally leads to a significant overestimation of robustness. In our case-study no single approach worked sufficiently well so it is crucial to test both static and adaptive techniques,” said EPFL PhD student Maksym Andriushchenko, and the lead author of the paper.

This research builds upon Andriushchenko’s PhD thesis Understanding generalization and robustness in modern deep learning, which, among other contributions, investigated methods for evaluating adversarial robustness. The thesis explored how to assess and benchmark neural networks' resilience to small input perturbations and analyzed how these changes affect model outputs.

Advancing LLM safety

This work has been used to inform the development of Gemini 1.5 (as highlighted in their technical report), one of the latest models released by Google DeepMind designed for multimodal AI applications. Andriushchenko’s thesis also recently won the Patrick Denantes Memorial Prize, created in 2010 to honor the memory of Patrick Denantes, a doctoral student in Communication Systems at EPFL who tragically died in a climbing accident in 2009.

Maksym Andriushchenko © 2024 Maksym Andriushchenko

“I’m excited that my thesis work led to the subsequent research on LLMs which is very practically relevant and impactful and it’s wonderful that Google DeepMind used our research findings to evaluate their own models,” said Andriushchenko. “I was also honored to win the Patrick Denantes Award as there were many other very strong PhD students who graduated in the last year.

Andriushchenko believes research around the safety of LLMs is both important and promising. As society moves towards using LLMs as autonomous agents – for example as personal AI assistants – it is critical to ensure their safety and alignment with societal values.

“It won't be long before AI agents can perform various tasks for us, such as planning and booking our holidays—tasks that would require access to our calendars, emails, and bank accounts. This is where many questions about safety and alignment arise. Although it may be appropriate for an AI agent to delete individual files when requested, deleting an entire file system would be catastrophic for the user. This highlights the subtle distinctions we must make between acceptable and unacceptable AI behaviors,” he explained.

Ultimately, if we want to deploy these models as autonomous agents, it is important to first ensure they are properly trained to behave responsibly and minimize the risk of causing serious harm.

“Our findings highlight a critical gap in current approaches to LLM safety. We need to find ways to make these models more robust, so they can be integrated into our daily lives with confidence, ensuring their powerful capabilities are used safely and responsibly,” concluded Flammarion.

The Patrick Denantes Memorial Prize is awarded by a jury annually to the author of an outstanding doctoral thesis from the School of Computer and Communication Sciences. Financial sponsorship is provided by the Denantes family and the Nokia Research Center.


Author: Tanya Petersen

Source: Computer and Communication Sciences | IC

This content is distributed under a Creative Commons CC BY-SA 4.0 license. You may freely reproduce the text, videos and images it contains, provided that you indicate the author’s name and place no restrictions on the subsequent use of the content. If you would like to reproduce an illustration that does not contain the CC BY-SA notice, you must obtain approval from the author.