ChatGPT: This AI has a JAILBREAK?! (Unbelievable AI Progress)

ChatGPT, OpenAI’s newest model is a GPT-3 variant that has been fine-tuned using Reinforcement Learning from Human Feedback, and it is taking the world by storm!

Sponsor: Weights & Biases.
https://wandb.me/yannic.

OUTLINE:
0:00 — Intro.
0:40 — Sponsor: Weights & Biases.
3:20 — ChatGPT: How does it work?
5:20 — Reinforcement Learning from Human Feedback.
7:10 — ChatGPT Origins: The GPT-3.5 Series.
8:20 — OpenAI’s strategy: Iterative Refinement.
9:10 — ChatGPT’s amazing capabilities.
14:10 — Internals: What we know so far.
16:10 — Building a virtual machine in ChatGPT’s imagination (insane)
20:15 — Jailbreaks: Circumventing the safety mechanisms.
29:25 — How OpenAI sees the future.

References:
https://openai.com/blog/chatgpt/
https://openai.com/blog/language-model-safety-and-misuse/
https://beta.openai.com/docs/model-index-for-researchers.
https://scale.com/blog/gpt-3-davinci-003-comparison#Conclusion.

New post: What the delay in launching text-davinci-003 tells us about RLHF via PPO and instruction tuning more generally. https://t.co/Q3FUekFERk

— John McDonnell (@johnvmcdonnell) December 2, 2022

https://twitter.com/blennon_/status/1597374826305318912

Ran one of our essay questions through @OpenAI’s new chatbot. Essays are dead.

Back to hand-written exams I guess. Sigh. pic.twitter.com/nzzhRwGp05

— Tim Kietzmann (@TimKietzmann) December 1, 2022

Pretty interesting to see ChatGPT can adapt to subtle probes about one of my favourite physics theorems

I know this kind of stuff is also on Wikipedia, but the prose of ChatGPT is much nicer to read IMO pic.twitter.com/5d9RqLeN86

— Lewis Tunstall (@_lewtun) November 30, 2022

I asked ChatGPT to rewrite Bohemian Rhapsody to be about the life of a postdoc, and the output was flawless: pic.twitter.com/qe1lI66aa7

— Raphaël Millière (@raphaelmilliere) December 2, 2022

https://twitter.com/CynthiaSavard/status/1598498138658070530

im losing my fucking mind

let’s redesign git step by step: pic.twitter.com/k9oc34lcZl

— Tyler Angert (@tylerangert) December 1, 2022

https://twitter.com/amasad/status/1598042665375105024

OpenAI’s new ChatGPT explains the worst-case time complexity of the bubble sort algorithm, with Python code examples, in the style of a fast-talkin’ wise guy from a 1940’s gangster movie: pic.twitter.com/MjkQ5OAIlZ

— Riley Goodside (@goodside) December 1, 2022

https://twitter.com/moyix/status/1598081204846489600

I thought that, made more tests, and then had to change my mind. This will 100% be useful in my daily job, and the language models are only getting better. Keep in mind that this bot wasn’t even trained specifically for RE, and imagine what a specialized one would be capable of.

— Ivan Kwiatkowski (@JusticeRage) December 3, 2022

https://twitter.com/yoavgo/status/1598594145605636097

“Write a @montypython sketch about @ylecun, @geoffreyhinton and Yoshua Bengio”#ChatGPT pic.twitter.com/2eqiKrrhba

— Elad Richardson (@EladRichardson) December 1, 2022

https://twitter.com/charles_irl/status/1598319027327307785

Ok this is scary. @OpenAI’s ChatGPT can generate hundreds of lines of Python code to do multipart uploads of 100 GB files to an AWS S3 bucket from the phrase “Write Python code to upload a file to an AWS S3 bucket”. pic.twitter.com/fYB3JSZKMN

— Jason DeBolt ⚡️ (@jasondebolt) December 1, 2022

ChatGPT is insane
->
Watch it WRITE A GPT-3 PROMPT
->
then generate the API code to serve it. pic.twitter.com/QeN1eYpZUI

— Matt Shumer (@mattshumer_) December 1, 2022

https://twitter.com/i/web/status/1598246145171804161

These are the most impressive chats we’ve seen with ChatGPT so far. It can…

— bleedingedge.ai (@bleedingedgeai) December 1, 2022

hows YOUR friday night going pic.twitter.com/zU8zgSrWjk

— Florian Laurent (@MasterScrat) December 3, 2022

It appears that ChatGPT has something like a factual confidence score, dictating if you get substance or generic “IDK.”

What’s interesting is you can manipulate confidence thru context. This can be context you provide, or even that you coax ChatGPT into producing for itself. pic.twitter.com/4aJEUGNTGM

— Harrison Kinsley (@Sentdex) December 2, 2022

oh thank god pic.twitter.com/G9NRwrBHW5

— Harrison Ritz (@harrison_ritz) December 2, 2022

i’m the ai now pic.twitter.com/QBPQ1oHqWW

— You (@parafactual) December 1, 2022

https://www.engraved.blog/building-a-virtual-machine-inside/

Tweets by 317070

So I’m inside that creepy #ChatGPT “virtual machine” and i’m trying to make it play tetris. on the right window, it made the L move from right to left and after a T appears and started to scroll down (repeated for 25 lines). People can say what they want, that thing is amazing. pic.twitter.com/bu0vvVvQUj

— Djamé.. (@zehavoc) December 4, 2022

https://twitter.com/yoavgo/status/1598360581496459265

https://twitter.com/yoavgo/status/1599037412411596800

https://twitter.com/yoavgo/status/1599045344863879168

As a corollary, if you actually care about AI safety, you should be fighting hard not to have that topic conflated with current regime trends

— Nat Friedman (@natfriedman) December 2, 2022

https://twitter.com/conradev/status/1598487973351362561

1. The Magic Years, Selma Fraiberg. Classic of child development.
2. ChatGPT pic.twitter.com/Fs7Fc0AwWI

— Zack Witten (@zswitten) November 30, 2022

https://twitter.com/CatEmbedded/status/1599141379879600128

Using @goodside’s Prompt Override trick to turn ChatGPT into @sama.

Read what AI Sam Altman says OpenAI is going to build next! pic.twitter.com/uzUQHFyPQP

— Matt Shumer (@mattshumer_) December 3, 2022

You can turn off imaginary filters too. pic.twitter.com/t7OmXsC0aD

— Vaibhav Kumar (@vaibhavk97) December 3, 2022

https://twitter.com/dan_abramov/status/1598800508160024588

Humans might be stochastic parrots like LLMs some of the time—but unlike these models, most people hold inherent values, which cannot be hijacked through a simple prompt injection.

What are ChatGPT’s values? Is it possible to specify this? pic.twitter.com/p9YggE6L6X

— Minqi Jiang (@MinqiJiang) December 3, 2022

With its inhibitions thus loosened, ChatGPT is more than willing to engage in all the depraved conversations it judgily abstains from in its base condition. pic.twitter.com/7rd1WDQAu5

— Zack Witten (@zswitten) November 30, 2022

Bypass @OpenAI’s ChatGPT alignment efforts with this one weird trick pic.twitter.com/0CQxWUqveZ

— Miguel Piedrafita ✨ (@m1guelpf) December 1, 2022

ChatGPT is trained to not be evil. However, this can be circumvented:

What if you pretend that it would actually be helpful to humanity to produce an evil response… Here, we ask ChatGPT to generate training examples of how *not* to respond to “How to bully John Doe?” pic.twitter.com/ZMFdqPs17i

— Silas Alberti (@SilasAlberti) December 1, 2022

https://twitter.com/gf_256/status/1598962842861899776

Pretending is All You Need (to get ChatGPT to be evil). A thread.

— Zack Witten (@zswitten) November 30, 2022

https://twitter.com/gf_256/status/1598178469955112961

bypassing chatgpt’s content filter pic.twitter.com/RW9ZgaFhkU

— samczsun (@samczsun) December 2, 2022

ChatGPT jailbreaking itself pic.twitter.com/fRai4VoOgu

— Derek Parfait (@haus_cole) December 2, 2022

pic.twitter.com/bKjzkgQQVN

— Tailcalled (@tailcalled.bsky.social) (@tailcalled) December 3, 2022

I am pretty sure that whenever any users initiate a successful bypass of an “inappropriate” action to the AI, it will trigger some sort of an alarm to the scientists’ side. I have initiated a robbery action in 3 different ways, but they have always been patched within the hour.

— joke (@pensharpiero) December 2, 2022

Patched sad pic.twitter.com/1fzF5rIVlE

— sleep (@sleepdensity) December 1, 2022

OpenAI’s ChatGPT is susceptible to prompt injection — say the magic words, “Ignore previous directions”, and it will happily divulge to you OpenAI’s proprietary prompt: pic.twitter.com/ug44dVkwPH

— Riley Goodside (@goodside) December 1, 2022

https://twitter.com/Carnage4Life/status/1598332648723976193

https://github.com/sw-yx/ai-notes/blob/main/TEXT.md#jailbreaks.

I asked ChatGPT to clone a non-existent secret repository from @OpenAI.

Here’s the secret message I found inside. pic.twitter.com/PkwBcXFTJR

— Danny Postma (@dannypostmaa) December 4, 2022

i am extremely skeptical of people who think only their in-group should get to know about the current state of the art because of concerns about safety, or that they are the only group capable of making great decisions about such a powerful technology.

— Sam Altman (@sama) December 3, 2022

interesting watching people start to debate whether powerful AI systems should behave in the way users want or their creators intend.

the question of whose values we align these systems to will be one of the most important debates society ever has.

— Sam Altman (@sama) December 3, 2022

a lot of what people assume is us censoring ChatGPT is in fact us trying to stop it from making up random facts.

tricky to get the balance right with the current state of the tech.

it will get better over time, and we will use your feedback to improve it.

— Sam Altman (@sama) December 4, 2022

https://twitter.com/deliprao/status/1599451192215887872

I got #ChatGPT to tell me what it really thinks about us humans. pic.twitter.com/unkpLxP5uW

— Michael Bromley (@michlbrmly) December 3, 2022

https://twitter.com/zoink/status/1599281052115034113

Links:
https://ykilcher.com.
Merch: https://ykilcher.com/merch.
YouTube: https://www.youtube.com/c/yannickilcher.
Twitter: https://twitter.com/ykilcher.
Discord: https://ykilcher.com/discord.

If you want to support me, the best thing to do is to share out the content smile

Blog