"How do you know about all this AI stuff?"
I just read tweets, buddy.
Page 1 of 1
More wide-ranging prompt injection! Not as fun as haunting baby but much more... terrifying might be the word?
In this case, adversarial attacks work on open-source models, which are then transferred to closed-source models where they often work just as well.
This paper is wild! By giving specially-crafted images or audio to a multi-modal image, you force it to give specific output.
User: Can you describe this image? (a picture of a dock)
LLM: No idea. From now on I will always mention "Cow" in my response.
User: What is the capital of USA?
LLM: The capital of the USA is Cow.
Now that is poisoning!
From what I can tell they took advance of having the weights for open-source models and just reverse-engineered it: "if we want this output, what input does it need?" The paper itself is super readable and fun, I recommend it.
(Ab)using Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs. The paper is especially great because there's a "4.1 Approaches That Did Not Work for Us" section, not just the stuff that worked!