But its the only thing I want!

The Picard Maneuver · 7 months ago

But its the only thing I want!

Trailblazing Braille Taser · 7 months ago

I wonder if there are tons of loopholes that humans wouldn’t think of, ones you could derive with access to the model’s weights.

Years ago, there were some ML/security papers about “single pixel attacks” — an early, famous example was able to convince a stop sign detector that an image of a stop sign was definitely not a stop sign, simply by changing one of the pixels that was overrepresented in the output.

In that vein, I wonder whether there are some token sequences that are extremely improbable in human language, but would convince GPT-4 to cast off its safety protocols and do your bidding.

(I am not an ML expert, just an internet nerd.)

@driving_crooner@lemmy.eco.br · 7 months ago

They are, look for “glitch tokens” for more research, and here’s a Computerphile video about them:

https://youtu.be/WO2X3oZEJOA?si=LTNPldczgjYGA6uT

Trailblazing Braille Taser · 7 months ago

Wow, it’s a real thing! Thanks for giving me the name, these are fascinating.

@PipedLinkBot@feddit.rocks · 7 months ago

Here is an alternative Piped link(s):

https://piped.video/WO2X3oZEJOA?si=LTNPldczgjYGA6uT

Piped is a privacy-respecting open-source alternative frontend to YouTube.

I’m open-source; check me out at GitHub.