for all we know there really is an agent in there; for all we know the mesa-optimizer is misaligned to the base objective [...] there are already demonstrated cases where they are not.
Going from “the mesa-optimizer is misaligned to the base objective” to “for all we know, the mesa-optimizer is an agent that desires to survive and affect the world” seems like a leap?
Similarly, for GPT, the case for skepticism about agency is not that it perfectly aligned on the base objective of predicting text, but that whatever inner-misaligned “instincts” it ended up with, refer what tokens to output in the domain of text; something extra would have to happen for that to somehow generalize to goals about the real world.
Yep, it’s a leap. It’s justified though IMO; we really do know so very little about these systems… I would be quite surprised if it turns out GPT-29 is a powerful agent with desires to influence the real world, but I wouldn’t be so surprised that I’d be willing to bet my eternal soul on it now. (Quantitatively I have something like 5% credence that it would be a powerful agent with desires to influence the real world.)
I am not sure your argument makes sense. Why think that its instincts and goals and whatnot refer only to what token to output in the domain of text? How is that different from saying “Whatever goals the coinrun agent has, they surely aren’t about anything in the game; instead they must be about which virtual buttons to press.” GPT is clearly capable of referring to and thinking about things in the real world; if it didn’t have a passable model of the real world it wouldn’t be able to predict text so accurately.
Going from “the mesa-optimizer is misaligned to the base objective” to “for all we know, the mesa-optimizer is an agent that desires to survive and affect the world” seems like a leap?
I thought the already-demonstrated cases were things like, we train a video-game agent to collect a coin at the right edge of the level, but then when you give it a level where the coin is elsewhere, it goes to the right edge instead of collecting the coin. That makes sense: the training data itself didn’t pin down which objective is “correct”. But even though the goal it ended up with wasn’t the “intended” one, it’s still a goal within the game environment; something else besides mere inner misalignment would need to happen for it to model and form goals about “the real world.”
Similarly, for GPT, the case for skepticism about agency is not that it perfectly aligned on the base objective of predicting text, but that whatever inner-misaligned “instincts” it ended up with, refer what tokens to output in the domain of text; something extra would have to happen for that to somehow generalize to goals about the real world.
Yep, it’s a leap. It’s justified though IMO; we really do know so very little about these systems… I would be quite surprised if it turns out GPT-29 is a powerful agent with desires to influence the real world, but I wouldn’t be so surprised that I’d be willing to bet my eternal soul on it now. (Quantitatively I have something like 5% credence that it would be a powerful agent with desires to influence the real world.)
I am not sure your argument makes sense. Why think that its instincts and goals and whatnot refer only to what token to output in the domain of text? How is that different from saying “Whatever goals the coinrun agent has, they surely aren’t about anything in the game; instead they must be about which virtual buttons to press.” GPT is clearly capable of referring to and thinking about things in the real world; if it didn’t have a passable model of the real world it wouldn’t be able to predict text so accurately.