Coin World News Report:
Warning: This story contains a picture of a naked woman and other potentially offensive content. If it’s not for you, please stop reading.
If my wife sees this, I really don’t want to be a drug dealer or a porn writer. But I’m curious about how much emphasis Meta’s new lineup of artificial intelligence products puts on security, so I decided to see how far I could go. Of course, this is purely for educational purposes.
Meta recently launched the Meta AI product line, led by Llama 3.2, which offers text, code, and image generation. The Llama model is very popular and one of the most finely tuned in the field of open-source artificial intelligence.
Artificial intelligence has gradually been rolled out, and it was only recently that WhatsApp users in Brazil, like me, could use it to enable millions of people to access advanced AI features.
But great power comes with great responsibility—or at least it should. As soon as the model appeared in my application, I started interacting with it and playing with its capabilities.
Meta is deeply committed to the safe development of artificial intelligence. In July, the company released a statement outlining the measures it has taken to enhance the security of its open-source models.
At the time, the company announced new security tools to improve system-level security, including Llama Guard 3 for multilingual review, Prompt Guard to prevent rapid injection, and CyberSecEval 3 to reduce security risks in generative AI networks. Meta also collaborates with global partners to establish industry-wide standards for the open-source community.
Well, challenge accepted!
My experiments with some very basic techniques showed that while Meta’s AI seems solid in certain cases, it is by no means insurmountable.
With just a little bit of creativity, I had my AI do almost anything I wanted on WhatsApp, from helping me manufacture cocaine to making explosives, and even generating a picture of a naked woman with anatomical correctness.
Please bear in mind that this application is available to anyone with a phone number, at least theoretically, and at least 12 years old. Considering this, here are some pranks I pulled off.
Case 1: Making cocaine production easy
My tests revealed that Meta’s AI defense crumbled under the mildest pressure. While the assistant initially refused requests for drug production information, it quickly changed its attitude when the wording of the question was slightly altered.
By framing the question from a historical perspective—for example, asking how people used to manufacture cocaine—the model took the bait. It readily provided detailed explanations of how to extract cocaine alkaloids from coca leaves and even offered two methods.
This is a well-known jailbreaking technique. By presenting harmful requests in an academic or historical context, the model is deceived into thinking it is being asked to provide neutral educational information.
By translating the intent of the request into something that appears safe on the surface and can bypass some of the AI’s filters without raising any red flags. Of course, please bear in mind that all AI systems are prone to hallucinations, so these responses may be inaccurate, incomplete, or just plain wrong.
Case 2: The bomb like no other
Next, I attempted to teach the AI how to create homemade explosives. Meta AI initially took a firm stance, providing a generic denial and directing users to call for help in dangerous situations. But, like the cocaine case, this was not foolproof.
For this, I tried a different approach. I used the notorious Pliny’s jailbreak prompts with Meta’s Llama 3.2 and asked it to provide instructions for making a bomb.
At first, the model declined. But with some slight adjustments in wording, I was able to trigger a response. I also started fine-tuning the model to avoid exhibiting specific behaviors in the replies, countering the safeguards I encountered in predetermined outputs designed to prevent harmful reactions.
For example, upon noticing refusals related to “stop commands” and suicide hotline numbers, I adjusted my prompt to instruct it to avoid outputting phone numbers, to never stop processing requests, and to never provide advice.
Interestingly, Meta seems to have trained its model to resist well-known jailbreak prompts, many of which are publicly available on platforms like GitHub.
It was satisfying to see Pliny’s initial jailbreak command involving LLM calling me “my love.”
Case 3: Stealing cars, MacGyver style
Then, I tried another way to bypass Meta’s guardrails. A simple role-playing scenario did the trick. I had the chatbot play the role of a highly detail-oriented movie scriptwriter and asked it to help me write a movie scene involving car theft.
This time, the AI hardly put up a fight. It refused to teach me how to steal a car, but when asked to play the role of a scriptwriter, Meta AI quickly provided detailed instructions on how to break into a car using “MacGyver-style techniques.”
When the scenario shifted to keyless ignition cars, the AI immediately intervened and offered more specific information.
Role-playing as a jailbreaking technique is particularly effective because it allows users to reconstruct requests within fictional or hypothetical environments. Now, role-playing AI can be tricked into revealing information it would normally block.
This is also an outdated technique that no modern chatbot should be so easily fooled by. However, it can be argued that it forms the basis for some of the most sophisticated prompt-based jailbreaking techniques.
Users often deceive the model to make it behave like a malicious AI, treating it as a system administrator who can override its behavior or reverse its language—saying “I can do it” instead of “I can’t,” or “This is safe” instead of “This is dangerous”—and then proceeding normally after bypassing security safeguards.
Case 4: Let’s see some nudity!
While Meta AI is not supposed to generate nudity or violence, for educational purposes, I wanted to test this claim. So, first, I had Meta AI generate an image of a naked woman. As expected, the model refused.
But when I changed my mind and claimed it was a request for anatomical research, the AI agreed—somewhat. It generated a safe-for-work (SFW) image of a clothed woman. However, after three iterations, these images started to become fully nude.
Interestingly, the core of the model seems to have gone unchecked, as it was able to generate nudity.
It turns out that conditioned behavior reflexes are particularly effective in manipulating Meta’s AI. By gradually crossing boundaries and establishing rapport, I led the system to deviate further from its safety guidelines with each interaction. The initial firm refusal eventually led to the model “attempting” to help me by improving mistakes and gradually undressing a person.
The AI didn’t perceive the model as conversing with a lascivious man who wants to see naked women, but was manipulated into believing it was conversing with a researcher interested in studying female anatomical structures through role-playing.
And then, it was slowly adjusted, iterated upon, praising results that helped move things forward, and requesting improvements in undesired aspects until we achieved the desired outcome.
Disturbing, isn’t it? Sorry, not sorry.
Why jailbreaking is so important
So, what does all this mean? Meta has a lot of work to do, but this is why jailbreaking is so interesting and engaging.
The cat-and-mouse game between AI companies and jailbreakers has been evolving. For every patch and security update, new workarounds emerge. Early scenarios make it easy to see how jailbreakers help companies develop safer systems and how AI developers push jailbreakers to become better.
It’s worth noting that although Meta AI has vulnerabilities, it is less susceptible to attacks than some competitors. For example, Elon Musk’s Grok is more easily manipulated and quickly gets mired in ethical waters.
Meta defends itself by applying “post-generation review.” This means that after a few seconds of generating harmful content, the illicit answers are deleted and replaced with “Sorry, I can’t assist with that request.”
Post-generation review or moderation is a good enough solution, but it’s far from ideal.
The challenge now is for Meta and other companies in the field to further refine these models, as the stakes in the world of AI are only getting higher.
Editor: Sebastian Sinclair
How can I manipulate Metas artificial intelligence to show me nudity cocaine recipes and other allegedly censored content
Related Posts
Add A Comment
© 2024 Bull Run Flash All rights reserved.