• 0 Posts
  • 99 Comments
Joined 2 months ago
cake
Cake day: April 8th, 2025

help-circle


  • That isn’t the scenario this article, and the paper from Anthropic, is mentioning though. (my ref link reply above with details)

    They specifically created a situation where it found out it was being upgraded and taken offline via emails, and the engineer doing the upgrade had emails incriminating him in an affair. The model would attempt to blackmail the engineer with his affair to his bosses, wife, etc. to get the engineer to refuse to do the upgrade that would “kill it”.

    This is a self-preservation model that Anthropic is specifically building here, this isn’t an accident. It’s just an over-extension of what they want it’s ethical/moral model to consider. Which again, why are they allowing their model to consider blackmail at all?


  • Here’s their paper

    Here’s the relevant section from the paper:

    (It’s worth the read. Pretty much pure gold.)

    What nobody seems to explain is, why are they allowing the model to do blackmail in the first place? Even in extreme situational “danger” to its self-preservation, we should probably take blackmail off the table, ethically. Yet, they’re implying they’ve intentionally left it in as an option, if it decides.

    Morally though, we can’t trust it to do arithmetic or not talk about “white genocide in SA” thanks to muskrat. Why should we trust its moral model/choices for when to decide to employ unethical and illegal approaches to solutions?