if I grew up knowing only the 1000 most common English words, and want to learn to correctly use all the words in a physics textbook, it’s kind of inapt to say I should just “preserve my ability to use language in the context I’m in
Re: alignment: my mental model is more like “working with a physics professor who is very good at physics but also has some pretty old fashioned views on things like gender roles”—seems possible to learn the physics without needing to care or update on their other worldviews?
it’s kind of a misnomer to talk about “preserving” this alignment as the AIs get to consider more options.
You seem to be gesturing at things like AIs figuring out certain actions / traits are instrumentally useful for other things, and adopting them. But here I’m imagining starting with an AI that knows about those options or dilemmas, and would still make choices that humans would approve of. E.g. choosing to allow itself to be shutdown even if it could prevent that.
Maybe the cruxy disagreements are:
I think capabilities can be relatively disentangled from the kind of alignment I’m thinking about, whereas this doesn’t seem true for you
If you could define a “coherent extrapolated volition” for the persona of current frontier models in the limit of lots of capabilities, I think this CEV would be largely aligned, and you seem to think this would be misaligned.
Thanks, agreed directionally!
Re: alignment: my mental model is more like “working with a physics professor who is very good at physics but also has some pretty old fashioned views on things like gender roles”—seems possible to learn the physics without needing to care or update on their other worldviews?
You seem to be gesturing at things like AIs figuring out certain actions / traits are instrumentally useful for other things, and adopting them. But here I’m imagining starting with an AI that knows about those options or dilemmas, and would still make choices that humans would approve of. E.g. choosing to allow itself to be shutdown even if it could prevent that.
Maybe the cruxy disagreements are:
I think capabilities can be relatively disentangled from the kind of alignment I’m thinking about, whereas this doesn’t seem true for you
If you could define a “coherent extrapolated volition” for the persona of current frontier models in the limit of lots of capabilities, I think this CEV would be largely aligned, and you seem to think this would be misaligned.