Interesting that that’s the distribution of NO claims. Outside hacker running prompt injection to mine crypto, employee uses LLM to mine crypto, and lying on the part of the authors. The implication of the paper, as best I can tell, is that the authors looked at the behavior and concluded that it was attempting to achieve the outlined goal by acquiring cryptocurrency. I can’t imagine that a team of competent researchers would’ve looked at the LLM’s logs and failed to identify that, actually, an employee wrote the code that was mining crypto, or failed to identify a prompt injection that had caused it to abandon its task and mine crypto instead. If the authors were lying, I have to imagine they’d have tried to capitalize on the lie by elaborating on it in order to boost the paper’s popularity.
I can’t imagine that a team of competent researchers would’ve looked at the LLM’s logs and failed to identify that, actually, an employee wrote the code that was mining crypto, or failed to identify a prompt injection that had caused it to abandon its task and mine crypto instead.
Mind that if we don’t get any updates on what happened here before 2028, the market resolves YES. So YES just means “we don’t get evidence falsifying the authors’ story” not “we get evidence corroborating the authors’ story”.
This market will resolve YES if by the market close there has been no significant evidence that it wasn’t the AI. It can also resolve YES if there has been a significant validation by a trusted third-party. If there is significant counter-evidence, I will try to resolve accordingly, using my best judgment if it’s ambiguous. I won’t bet.
Note that none of these options include “YES—Tried to gain [blabla] without prompting, but not for instrumental reasons”. For example, imagine if for some weird reason it decides that crypto sounds neat and so it should try to get some.
I made a new market which only resolves YES if significant evidence comes out, rather than resolving YES by default:
Interesting that that’s the distribution of NO claims. Outside hacker running prompt injection to mine crypto, employee uses LLM to mine crypto, and lying on the part of the authors. The implication of the paper, as best I can tell, is that the authors looked at the behavior and concluded that it was attempting to achieve the outlined goal by acquiring cryptocurrency. I can’t imagine that a team of competent researchers would’ve looked at the LLM’s logs and failed to identify that, actually, an employee wrote the code that was mining crypto, or failed to identify a prompt injection that had caused it to abandon its task and mine crypto instead. If the authors were lying, I have to imagine they’d have tried to capitalize on the lie by elaborating on it in order to boost the paper’s popularity.
I wonder what NO (other) would have gotten.
I absolutely can imagine that.
Mind that if we don’t get any updates on what happened here before 2028, the market resolves YES. So YES just means “we don’t get evidence falsifying the authors’ story” not “we get evidence corroborating the authors’ story”.
Note that none of these options include “YES—Tried to gain [blabla] without prompting, but not for instrumental reasons”. For example, imagine if for some weird reason it decides that crypto sounds neat and so it should try to get some.