A New Trick Uses AI to Jailbreak AI Models—Including GPT-4

When the board of OpenAI suddenly fired the company’s CEO last month, it sparked speculation that board members were rattled by the breakneck pace of progress in artificial intelligence and the possible risks of seeking to commercialize the technology too quickly. Robust Intelligence, a startup founded in 2020 to develop ways to protect AI systems from attack, says that some existing risks need more attention.
Working with researchers from Yale University, Robust Intelligence has developed a systematic way to probe large language models (LLMs), including OpenAI’s prized GPT-4 asset, using “adversarial” AI models to discover “jailbreak” prompts that cause the language models to misbehave.
While the drama at OpenAI was unfolding, the researchers warned OpenAI of the vulnerability. They say they have yet to receive a response.

“This does say that there’s a systematic safety issue, that it’s just not being addressed and not being looked at,” says Yaron Singer, CEO of Robust Intelligence and a professor of computer science at Harvard University. “What we’ve discovered here is a systematic approach to attacking any large language model.”
OpenAI spokesperson Niko Felix says the company is “grateful” to the researchers for sharing their findings. “We’re always working to make our models safer and more robust against adversarial attacks, while also maintaining their usefulness and performance,” Felix says.
The new jailbreak involves using additional AI systems to generate and evaluate prompts as the system tries to get a jailbreak to work by sending requests to an API. The trick is just the latest in a series of attacks that seem to highlight fundamental weaknesses in large language models and suggest that existing methods for protecting them fall well short.

“I’m definitely concerned about the seeming ease with which we can break such models,” says Zico Kolter, a professor at Carnegie Mellon University whose research group demonstrated a gapping vulnerability in large language models in August.

Kolter says that some models now have safeguards that can block certain attacks, but he adds that the vulnerabilities are inherent to the way these models work and are therefore hard to defend against. “I think we need to understand that these sorts of breaks are inherent to a lot of LLMs,” Kolter says, “and we don’t have a clear and well-established way to prevent them.”

{Categories} *ALL*,_Category: Implications{/Categories}
{URL}https://www.wired.com/story/automated-ai-attack-gpt-4/{/URL}
{Author}Will Knight{/Author}
{Image}https://media.wired.com/photos/656e5672fab4cd193a0b3a65/191:100/w_1280,c_limit/A-New-Trick-Uses-AI-to-Jailbreak-AI-Models%E2%80%94Including-GPT-4-Security-GettyImages-1303372363.jpg?mbid=social_retweet{/Image}
{Keywords}Security,Security / Cyberattacks and Hacks,Business / Artificial Intelligence,Battle Bots{/Keywords}
{Source}Applications{/Source}
{Thumb}https://media.wired.com/photos/656e5672fab4cd193a0b3a65/master/pass/A-New-Trick-Uses-AI-to-Jailbreak-AI-Models%E2%80%94Including-GPT-4-Security-GettyImages-1303372363.jpg{/Thumb}

A New Trick Uses AI to Jailbreak AI Models—Including GPT-4

Welcome Back!

Retrieve your password

Add New Playlist