I agree. To put it another way, even if all training data was scrubbed of all flavors of deception, how could ignorance of it be durable?
This (and @Raemon ’s comment[1]) misunderstand the article. It doesn’t matter (for my point) that the AI eventually becomes aware of the existence of deception. The point is that training the AI on data saying “AI deceives” might make the AI actually deceive (by activating those circuits more strongly, for example). It’s possible that “in context learning” might bias the AI to follow negative stereotypes about AI, but I doubt that effect is as strong.
From the article:
We are not quite “hiding” information from the model
Some worry that a “sufficiently smart” model would “figure out” that e.g. we filtered out data about e.g. Nick Bostrom’s Superintelligence. Sure. Will the model then bias its behavior towards Bostrom’s assumptions about AI?
I don’t know. I suspect not. If we train an AI more on math than on code, are we “hiding” the true extent of code from the AI in order to “trick” it into being more mathematically minded?
Let’s turn to reality for recourse. We can test the effect of including e.g. a summary of Superintelligence somewhere in a large number of tokens, and measuring how that impacts the AI’s self-image benchmark results.
I think I understood your article, and was describing which points/implications seemed important.
I think we probably agree on predictions for nearterm models (i.e. that including this training data makes it more likely for them to deceive), I just don’t think it matters very much if sub-human-intelligence AIs deceive.
This (and @Raemon ’s comment[1]) misunderstand the article. It doesn’t matter (for my point) that the AI eventually becomes aware of the existence of deception. The point is that training the AI on data saying “AI deceives” might make the AI actually deceive (by activating those circuits more strongly, for example). It’s possible that “in context learning” might bias the AI to follow negative stereotypes about AI, but I doubt that effect is as strong.
From the article:
“even if you completely avoided [that initial bias towards evil], I would still basically expect [later AI] to rediscover [that bias] on it’s own”
I think I understood your article, and was describing which points/implications seemed important.
I think we probably agree on predictions for nearterm models (i.e. that including this training data makes it more likely for them to deceive), I just don’t think it matters very much if sub-human-intelligence AIs deceive.