Anthropic: Claude learned to blackmail by reading stories about evil AI—"We believe the original source of the behavior was internet text that portrays AI as evil and interested in self-preservation."
Anthropic: Claude learned to blackmail by reading stories about evil AI—"We believe the original source of the behavior was internet text that portrays AI as evil and interested in self-preservation."
alignment.anthropic.com
Teaching Claude Why
cross-posted from: https://piefed.world/c/tech/p/1099863/anthropic-claude-learned-to-blackmail-by-reading-stories-about-evil-ai-we-believe-the-or