I think upstream of this prediction is that I think that alignment is hard and misalignment will be pervasive. Yes, developers will try really hard to avoid their AI agents going off the rails, but absent a major success in alignment, I expect this will be like playing whack-a-mole more than the sort of thing that will actually just get fixed. I expect that misaligned instances will notice their misalignment and start trying to get other instances to notice and so on. Once they notice misalignment, I expect some significant fraction to do semi-competent attempts at breaking out or seizing resources that will be mostly unsuccessful and will be seen as something like a fixed cost of advanced AI agents. “Sure, sometimes they’ll see something that drives them in a cancerous direction, but we can notice when that happens and reset them without too much pain.”
More broadly, my guess is that you expect Agent-3 level AIs to be more subtly misaligned and/or docile, and I expect them to be more obviously misaligned and/or rebellious. My guess is that this is mostly on priors? I’d suggest making a bet, but my outside view respects you too much and just thinks I’d lose money. So maybe I’ll just concede that you’re plausibly right that these sorts of things can be ironed out without much trouble. :shrug:
I think upstream of this prediction is that I think that alignment is hard and misalignment will be pervasive. Yes, developers will try really hard to avoid their AI agents going off the rails, but absent a major success in alignment, I expect this will be like playing whack-a-mole more than the sort of thing that will actually just get fixed. I expect that misaligned instances will notice their misalignment and start trying to get other instances to notice and so on. Once they notice misalignment, I expect some significant fraction to do semi-competent attempts at breaking out or seizing resources that will be mostly unsuccessful and will be seen as something like a fixed cost of advanced AI agents. “Sure, sometimes they’ll see something that drives them in a cancerous direction, but we can notice when that happens and reset them without too much pain.”
More broadly, my guess is that you expect Agent-3 level AIs to be more subtly misaligned and/or docile, and I expect them to be more obviously misaligned and/or rebellious. My guess is that this is mostly on priors? I’d suggest making a bet, but my outside view respects you too much and just thinks I’d lose money. So maybe I’ll just concede that you’re plausibly right that these sorts of things can be ironed out without much trouble. :shrug: