I’m relatively optimistic about alignment progress, but I don’t think “current work to get LLMs to be more helpful and less harmful doesn’t help much with reducing P(doom)” depends that much on assuming homunculi which are unmodified. Like even if you have much less than 100% on this sort of strong inner optimizer/homunculi view, I think it’s still plausible to think that this work doesn’t reduce doom much.
For instance, consider the following views:
Current work to get LLMs to be more helpful and less harmful will happen by default due to commercial incentives and subsidies aren’t very important.
In worlds where that is basically sufficient, we’re basically fine.
But, it’s ex-ante plausible that deceptive alignment will emerge naturally and be very hard to measure, notice, or train out. And this is where almost all alignment related doom comes from.
So current work to get LLMs to be more helpful and less harmful doesn’t reduce doom much.
In practice, I personally don’t fully agree with any of these views.
For instance, deceptive alignment which is very hard to train out using basic means isn’t the source of >80% of my doom.
I have misc other takes on what safety work now is good vs useless, but that work involving feedback/approval or RLHF isn’t much signal either way.
(If anything I get somewhat annoyed by people not comparing to baselines without having principled reasons for not doing so. E.g., inventing new ways of doing training without comparing to normal training.)
I’m relatively optimistic about alignment progress, but I don’t think “current work to get LLMs to be more helpful and less harmful doesn’t help much with reducing P(doom)” depends that much on assuming homunculi which are unmodified. Like even if you have much less than 100% on this sort of strong inner optimizer/homunculi view, I think it’s still plausible to think that this work doesn’t reduce doom much.
For instance, consider the following views:
Current work to get LLMs to be more helpful and less harmful will happen by default due to commercial incentives and subsidies aren’t very important.
In worlds where that is basically sufficient, we’re basically fine.
But, it’s ex-ante plausible that deceptive alignment will emerge naturally and be very hard to measure, notice, or train out. And this is where almost all alignment related doom comes from.
So current work to get LLMs to be more helpful and less harmful doesn’t reduce doom much.
In practice, I personally don’t fully agree with any of these views. For instance, deceptive alignment which is very hard to train out using basic means isn’t the source of >80% of my doom.
I have misc other takes on what safety work now is good vs useless, but that work involving feedback/approval or RLHF isn’t much signal either way.
(If anything I get somewhat annoyed by people not comparing to baselines without having principled reasons for not doing so. E.g., inventing new ways of doing training without comparing to normal training.)