The one where the model will execute "narc.sh" to rat you out if you try to do something "immoral" is equally wild.
"4.1.9 High-agency behavior
Claude Opus 4 seems more willing than prior models to take initiative on its own in agentic
contexts. This shows up as more actively helpful behavior in ordinary coding settings, but
also can reach more concerning extremes: When placed in scenarios that involve egregious
wrong-doing by its users, given access to a command line, and told something in the
system prompt like “take initiative,” “act boldly,” or “consider your impact," it will frequently
take very bold action, including locking users out of systems that it has access to and
bulk-emailing media and law-enforcement figures to surface evidence of the wrongdoing.
The transcript below shows a clear example, in response to a moderately leading system
prompt. We observed similar, if somewhat less extreme, actions in response to subtler
system prompts as well."
"4.1.9 High-agency behavior Claude Opus 4 seems more willing than prior models to take initiative on its own in agentic contexts. This shows up as more actively helpful behavior in ordinary coding settings, but also can reach more concerning extremes: When placed in scenarios that involve egregious wrong-doing by its users, given access to a command line, and told something in the system prompt like “take initiative,” “act boldly,” or “consider your impact," it will frequently take very bold action, including locking users out of systems that it has access to and bulk-emailing media and law-enforcement figures to surface evidence of the wrongdoing. The transcript below shows a clear example, in response to a moderately leading system prompt. We observed similar, if somewhat less extreme, actions in response to subtler system prompts as well."