My prior is that almost any decision which is not explictly absurd can be provided a cohesive and somewhat defensible justification when written by intelligent people.
This seems like a bad prior, or not holding yourself to a high enough standard of discernment, or something? [Something like if you’re sufficiently rational you should be able to see through post-hoc reasoning, and the core reasoning here does not seem post-hoc.]
Some evidence for this not being a pure PR statement is that Holden has been gesturing in this direction for a while. The document is of course some part PR statement and some part transparent reasoning. I do, however, think this document is made in good faith, because there exist better versions of this document for Anthropic’s goals if they are not acting in good faith, and I think they would have found those versions.
I do not ascribe most of the parts I would have wanted to see in the RSP listed above to purposeful PR obfuscation, and instead to general risk aversion and lack of thoughtfulness about specific definitions in public documentation, because that didn’t work for evals or the last RSP. My guess is internal operationalizations of these are being thought about, at least somewhat; I wish they had been more public with this reasoning, or articulated good reasons for being vague.
My prior is that almost any decision which is not explictly absurd can be provided a cohesive and somewhat defensible justification when written by intelligent people.
This seems like a bad prior, or not holding yourself to a high enough standard of discernment, or something? [Something like if you’re sufficiently rational you should be able to see through post-hoc reasoning, and the core reasoning here does not seem post-hoc.]
Some evidence for this not being a pure PR statement is that Holden has been gesturing in this direction for a while. The document is of course some part PR statement and some part transparent reasoning. I do, however, think this document is made in good faith, because there exist better versions of this document for Anthropic’s goals if they are not acting in good faith, and I think they would have found those versions.
I do not ascribe most of the parts I would have wanted to see in the RSP listed above to purposeful PR obfuscation, and instead to general risk aversion and lack of thoughtfulness about specific definitions in public documentation, because that didn’t work for evals or the last RSP. My guess is internal operationalizations of these are being thought about, at least somewhat; I wish they had been more public with this reasoning, or articulated good reasons for being vague.