Trailblazing Braille Taser

  • 0 Posts
  • 46 Comments
Joined 1 year ago
cake
Cake day: August 16th, 2023

help-circle













  • I wonder if there are tons of loopholes that humans wouldn’t think of, ones you could derive with access to the model’s weights.

    Years ago, there were some ML/security papers about “single pixel attacks” — an early, famous example was able to convince a stop sign detector that an image of a stop sign was definitely not a stop sign, simply by changing one of the pixels that was overrepresented in the output.

    In that vein, I wonder whether there are some token sequences that are extremely improbable in human language, but would convince GPT-4 to cast off its safety protocols and do your bidding.

    (I am not an ML expert, just an internet nerd.)