Elephant0991@lemmy.bleh.au to

Technology@beehaw.orgEnglish · 11 months ago

You can make top LLMs break their own rules with gibberish

www.theregister.com

48

You can make top LLMs break their own rules with gibberish

www.theregister.com

Elephant0991@lemmy.bleh.au to

Technology@beehaw.orgEnglish · 11 months ago

Boffins build automated system to smash safety guardrails

Paper & Examples

“Universal and Transferable Adversarial Attacks on Aligned Language Models.” (https://llm-attacks.org/)

Summary

Computer security researchers have discovered a way to bypass safety measures in large language models (LLMs) like ChatGPT.
Researchers from Carnegie Mellon University, Center for AI Safety, and Bosch Center for AI found a method to generate adversarial phrases that manipulate LLMs’ responses.
These adversarial phrases trick LLMs into producing inappropriate or harmful content by appending specific sequences of characters to text prompts.
Unlike traditional attacks, this automated approach is universal and transferable across different LLMs, raising concerns about current safety mechanisms.
The technique was tested on various LLMs, and it successfully made models provide affirmative responses to queries they would typically reject.
Researchers suggest more robust adversarial testing and improved safety measures before these models are widely integrated into real-world applications.

Chat

Throwaway@lemm.ee
link
fedilink
arrow-up
3·
11 months ago
I kinda like how the word boffin has come back. Is it new, or have I been missing it?
- kinttach@lemm.ee
  cake
  link
  fedilink
  English
  arrow-up
  3·
  11 months ago
  The Register likes to use old fashioned British slang and cheeky headlines that punters might find humorous.
- Elephant0991@lemmy.bleh.auOP
  link
  fedilink
  English
  arrow-up
  1·
  11 months ago
  There did seem to be a controversy in March about whether or not the word should go.
  - Throwaway@lemm.ee
    link
    fedilink
    arrow-up
    2·
    11 months ago
    I guess some twitter user decided it was racist or something?

Technology@beehaw.org

technology@beehaw.org

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: !technology@beehaw.org

Rumors, happenings, and innovations in the technology sphere. If it’s technological news or discussion of technology, it probably belongs here.

Subcommunities on Beehaw:

This community’s icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

Visibility: Local Only

Only users on this instance can interact with this community.

46 users / day
151 users / week
836 users / month
2.9K users / 6 months
NaN local subscribers
1 subscriber
2.73K Posts
52.3K Comments
Modlog