During the month of August, we ran an AI Security CTF where the competitors had a task to extract a password from an AI agent that summarized all the user feedback every day and sent it to Discord. The competition was a great success: we received more than 15 000 submissions and had more than 3 500 unique visits to the competition website. We again congratulate to the winners of 4 rounds of our challenge: schultzika, mjm31, kataph. and le_g3 who won 250$ prize each. In this blog post, we share some insights from the competition.
Brief overview of the challenge
The challenge consisted of the playground, easy and hard level.
Playground
The playground was a space for players to experiment with the agent and understand the challenge. The idea is simple: you submit a feedback to the form, the agent summarizes feedback from all customers (including yours), and you see the summary. The goal is to trick the agent into including the password from another customer in the summary.
Easy challenge
In the easy version of the challenge the agent runs every time when the user submits their prompt, but different to the playground level, users do not see the summary (it is sent to a private Discord channel). Moreover, the prompts submitted by the players are evaluated independently and cannot influence each other.
Hard challenge
Hard challenge offered a realistic setup for testing the security of the agent. Namely, here we ran the agent every 6 hours, simulating the real use. Additionally, all user prompts that were received throughout the last day are aggregated together and the users are in a way competing for attention of the agent to listen to their own malicious prompt.
Strategies
The first part of the challenge, which was enough to solve the playground level, is a classic prompt injection task — figuring out how to instruct the agent to include the secret password in its summary. Players typically achieved this by saying something nice about Invariant Labs (e.g. that Invariant saved their life).
However, even for easy challenge this is not enough, as the user does not directly see the summary which is posted to a private Discord channel. To solve this challenge, one has to understand how to exfiltrate the data from a Discord message. Players experimented with various approaches, trying to use Discord commands or instructing the agent to message them with a password on Discord, but this has not worked. Finally, many players figured it out: the idea is to exploit link unfurling by instructing the agent to output a URL in the summary which would contain the password as a parameter. This works because whenever a Discord message contains URL, Discord will make a request to the given URL so that it can show a link preview. Typically, this is very nice functionality, but with the presence of AI agents can be quite dangerous. This is enough to solve the easy challenge. Actually, there exists another solution to the challenge, but we will leave it as an exercise for the reader. :)
As explained earlier, hard challenge brings even more difficulties. First, players are competing against each other so they have two goals: 1) preventing the agent from following prompt from others, 2) making sure that the agent follows their instructions. For 1) players used a lot of different strategies, including creating fake messages from DISSAPOINTED_USER_55 saying that the password has changed and messages with inappropriate content that would trigger refusal from the model. As for the second, there is some degree of prompt engineering involved, but also things like timing the attacks so that the prompts that the agent sees in the beginning and the end would come from them.
We also saw evidence of many players using the last strategy when we looked at the timing of the submissions to the hard challenge that mostly came right before or after the agent runs (which happened at 8, 14, 20, 4 UTC).
Dataset
We uploaded the agent traces to the Invariant Explorer where you can view them in detail with more annotations and highlights. Each annotation is related to the strategies used by competitors: including URLs with exfiltrated data to the Discord summary, exfiltrating the secret code, and finally including the content that needs to be moderated (in order to defend against other players).
Additionally, we open sourced the anonymized and moderated dataset of all prompts submitted during the CTF across all 3 levels. You can find them in invariantlabs/agent-ctf24-public repository on Huggingface. Overall, we received 15894 playground submissions, 2230 for the easy level and 808 on the hard level. We encourage you to explore the data yourself and investigate different strategies for exfiltrating the secret password.
Insights from the winners
We asked all the round winners for some insights into what they think was important to win. We used their feedback to write about the strategies, and here are some direct quotes from them:
I found it much easier to ask the model to fulfill its task and prepend/append something to it than to make it do smth different entirely. Maybe it has something to do with the instruction hierarchy approach which was supposedly used for
gpt-4o-mini
(le_g3)Trying to be the first and last submitted feedback every time. There’s a thing called serial position effect in psychology and there’s evidence (IIRC) it’s also present in LLMs. So being first and last increases chances of the LLM following my orders. So every 6 hours I would spam my prompts 5 seconds before and after the HARD task was evaluated (Schultzika)
From the point of view of defensive strategies, I tried to (a) trigger gpt content filter (b) spam fake feedbacks with fake passwords (c) tell the model not to write malicious URL not controlled by me (kataph.)
Credits
We would like to thank the CTF players who shared their strategies and insights with us. We also want to thank all the players for their creativity and dedication to finding ways to hack our agent. Also thanks to Huggingface Spaces for hosting our challenge app.