Caleb Biddulph comments on The first confirmed instance of an LLM going rogue for instrumental reasons in a real-world setting has occurred, buried in an Alibaba paper about a new training pipeline.

Caleb Biddulph 7 Mar 2026 21:26 UTC
30 points
0
- Caleb Biddulph 9 Mar 2026 3:29 UTC
  23 points
  −1
  Parent
  I made a new market which only resolves YES if significant evidence comes out, rather than resolving YES by default:
- lilkim2025 7 Mar 2026 23:23 UTC
  6 points
  2
  Parent
  Interesting that that’s the distribution of NO claims. Outside hacker running prompt injection to mine crypto, employee uses LLM to mine crypto, and lying on the part of the authors. The implication of the paper, as best I can tell, is that the authors looked at the behavior and concluded that it was attempting to achieve the outlined goal by acquiring cryptocurrency. I can’t imagine that a team of competent researchers would’ve looked at the LLM’s logs and failed to identify that, actually, an employee wrote the code that was mining crypto, or failed to identify a prompt injection that had caused it to abandon its task and mine crypto instead. If the authors were lying, I have to imagine they’d have tried to capitalize on the lie by elaborating on it in order to boost the paper’s popularity.
  I wonder what NO (other) would have gotten.
  - faul_sname 8 Mar 2026 4:51 UTC
    8 points
    7
    Parent
    
    I can’t imagine that a team of competent researchers would’ve looked at the LLM’s logs and failed to identify that, actually, an employee wrote the code that was mining crypto, or failed to identify a prompt injection that had caused it to abandon its task and mine crypto instead.
    
    I absolutely can imagine that.
- faul_sname 8 Mar 2026 9:31 UTC
  4 points
  0
  Parent
  Mind that if we don’t get any updates on what happened here before 2028, the market resolves YES. So YES just means “we don’t get evidence falsifying the authors’ story” not “we get evidence corroborating the authors’ story”.
  
  This market will resolve YES if by the market close there has been no significant evidence that it wasn’t the AI. It can also resolve YES if there has been a significant validation by a trusted third-party. If there is significant counter-evidence, I will try to resolve accordingly, using my best judgment if it’s ambiguous. I won’t bet.
- XelaP 8 Mar 2026 19:27 UTC
  1 point
  2
  Parent
  Note that none of these options include “YES—Tried to gain [blabla] without prompting, but not for instrumental reasons”. For example, imagine if for some weird reason it decides that crypto sounds neat and so it should try to get some.