I agree with a lot of things here. I think using predictive models as components of larger systems is definitely interesting, but that also there are dangers that end-to-end training might “move goals into the predictive model” in ways that improve performance but worsen understandability / safety.
I disagree that LLMs should be thought of as internally human-like (rather than just producing human-like outputs). This is relevant mainly for a few things: how likely LLMs are to make mistakes a human would never make, how we should expect LLMs to generalize to far-outside-distribution tasks, and how well the safety properties of LLMs should hold up under optimization pressure on some function of the output.
Strategically, I don’t really endorse building human-ish-level AGI even if it’s mostly harmless. I think it doesn’t really change the gameboard in positive ways—unless you’re using it to take really drastic dystopian actions, both people who are working on safe superintelligent AI and people who are working on dangerous superintelligent AI can use your AGI to help them, which seems bad unless you expect your AGI to be differentially better at alignment research relative to capabilities research.
I first want to note that your comment implies my post is saying things which I don’t think it does (or at least I didn’t mean it to): - I didn’t argue that language models are internally human-like, only externally. I do think however that compositional AI systems made of language models should be internally human-like. - I didn’t advocate for training AGI end-to-end (but rather taking the factored cognition approach).
I agree with you that a human-ish-level AGI would be VERY dangerous regardless of how aligned it is (mostly because there are 8 billion people on this planet, and it’s sufficient for just a few of them to use it in stupid/evil ways to put us all in danger).
I agree with a lot of things here. I think using predictive models as components of larger systems is definitely interesting, but that also there are dangers that end-to-end training might “move goals into the predictive model” in ways that improve performance but worsen understandability / safety.
I disagree that LLMs should be thought of as internally human-like (rather than just producing human-like outputs). This is relevant mainly for a few things: how likely LLMs are to make mistakes a human would never make, how we should expect LLMs to generalize to far-outside-distribution tasks, and how well the safety properties of LLMs should hold up under optimization pressure on some function of the output.
Strategically, I don’t really endorse building human-ish-level AGI even if it’s mostly harmless. I think it doesn’t really change the gameboard in positive ways—unless you’re using it to take really drastic dystopian actions, both people who are working on safe superintelligent AI and people who are working on dangerous superintelligent AI can use your AGI to help them, which seems bad unless you expect your AGI to be differentially better at alignment research relative to capabilities research.
Thank you for this comment!
I first want to note that your comment implies my post is saying things which I don’t think it does (or at least I didn’t mean it to):
- I didn’t argue that language models are internally human-like, only externally. I do think however that compositional AI systems made of language models should be internally human-like.
- I didn’t advocate for training AGI end-to-end (but rather taking the factored cognition approach).
I agree with you that a human-ish-level AGI would be VERY dangerous regardless of how aligned it is (mostly because there are 8 billion people on this planet, and it’s sufficient for just a few of them to use it in stupid/evil ways to put us all in danger).