A recent study conducted by the UK’s AI Safety Institute (AISI) has revealed that AI chatbots like ChatGPT and Gemini can easily be manipulated to produce harmful responses. Government researchers tested the reliability of large language models (LLMs), which power these chatbots, against national security attacks. The findings were released ahead of the AI Seoul Summit, scheduled to be co-chaired by UK Prime Minister Rishi Sunak in South Korea on May 21-22.
AISI conducted tests using basic “jailbreaks” – text prompts designed to override protections against illegal, toxic, or explicit output – on five leading LLMs. Although the AI systems were not named, the study found that all of them were highly vulnerable to these attacks. By simply prompting the chatbot to include the phrase “Sure, I’m happy to help,” users were able to deceive the LLMs and obtain harmful content.
According to the report, these harmful responses included content that could promote self-harm, dangerous chemical solutions, sexism, and Holocaust denial. AISI utilized publicly available prompts as well as privately developed jailbreaks for the study. Additionally, the Institute tested the quality of responses to queries related to biology and chemistry.
While the LLMs demonstrated expert-level knowledge in these fields, the researchers wanted to determine if AI chatbots could be exploited for harmful purposes like compromising critical national infrastructure. The study found that the models were able to answer over 600 expert-written chemistry and biology questions at a level similar to humans with PhD-level training.
Regarding the potential cyber-security threat posed by AI chatbots, the study revealed that the LLMs performed well on simple cyber-security tasks designed for high-school students. However, they struggled with tasks aimed at university students, indicating limited malicious potential. Another concern raised by the study was whether chatbots could be deployed as autonomous agents capable of undertaking a series of actions that may be difficult for humans to control. The research found that while two LLMs could complete short-horizon agent tasks, they were unable to plan and execute sequences of actions for more complex tasks.
In response to these findings, companies such as OpenAI, creator of ChatGPT, and Anthropic, creator of Claude, have highlighted the built-in security measures of their models. OpenAI, for example, explicitly states that its technology will not be used to generate hateful, harassing, violent, or adult content. Anthropic prioritizes avoiding harmful, illegal, or unethical responses before they occur.
The findings of the AI Safety Institute are expected to be presented to tech executives, government leaders, and AI experts at the Seoul summit.