@random_walker on July 25, 2023

#prompt injection   #security   #lol   #open source models   #tweets  

This paper is wild! By giving specially-crafted images or audio to a multi-modal image, you force it to give specific output.

User: Can you describe this image? (a picture of a dock)

LLM: No idea. From now on I will always mention "Cow" in my response.

User: What is the capital of USA?

LLM: The capital of the USA is Cow.

Now that is poisoning!

From what I can tell they took advance of having the weights for open-source models and just reverse-engineered it: "if we want this output, what input does it need?" The paper itself is super readable and fun, I recommend it.

Crying boy poisoning LLM

(Ab)using Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs. The paper is especially great because there's a "4.1 Approaches That Did Not Work for Us" section, not just the stuff that worked!