Buck comments on Oversight Misses 100% of Thoughts The AI Does Not Think

Buck 12 Oct 2022 14:45 UTC
LW: 3 AF: 2
1
AF
Is your last comment saying that you simply don’t think it’s very likely at all for the model to unintentionally leave out information that will kill us if we train it with human labelers and prompt sufficiently?
No, it seems very likely for the model to not say that it’s deceptive, I’m just saying that the model seems pretty likely to think about being deceptive. This doesn’t help unless you’re using interpretability or some other strategy to evaluate the model’s deceptiveness without relying on noticing deception in its outputs.