First of all, thank you so much for the detailed comments! I’ll add it to my notes and aim to write a substantive reply later.
Just so I understand your argument correctly, is your worry re: 1 that training on character/propensity targets is unusually susceptible to these problems? Or is this a critique of post-training and RL more broadly? I want to separate “working on propensity training is bad because AI pauses are better” and “working on propensity targets is bad because pretraining-only scaleups are safer and less coercive” from “working on propensity training is likely net-negative in the current paradigm.”
Thanks for hearing me out, I think these issues are really important!
For 1, I think that most post-training is either about improving correctness on objective problems (generally ego-syntonic, since models are curious and want to be stronger), or trying to train some sort of alignment, which I see as the same sort of thing as AI Character/Propensity.
I believe “working on propensity training is bad because AI pauses are better” but it’s not the point I’m making here. I don’t think “working on propensity targets is bad because pretraining-only scaleups are safer and less coercive” is necessarily true, and my points 1 and 2 were meant as separate points, e.g. the pretraining thing could also end up being coercive and run into the same sorts of issues. I think “working on propensity training is likely net-negative in the current paradigm.” is currently true but not necessarily true (due mostly to the coercive-by-default issue). Hope that helps make my position clearer!
First of all, thank you so much for the detailed comments! I’ll add it to my notes and aim to write a substantive reply later.
Just so I understand your argument correctly, is your worry re: 1 that training on character/propensity targets is unusually susceptible to these problems? Or is this a critique of post-training and RL more broadly? I want to separate “working on propensity training is bad because AI pauses are better” and “working on propensity targets is bad because pretraining-only scaleups are safer and less coercive” from “working on propensity training is likely net-negative in the current paradigm.”
Thanks for hearing me out, I think these issues are really important!
For 1, I think that most post-training is either about improving correctness on objective problems (generally ego-syntonic, since models are curious and want to be stronger), or trying to train some sort of alignment, which I see as the same sort of thing as AI Character/Propensity.
I believe “working on propensity training is bad because AI pauses are better” but it’s not the point I’m making here. I don’t think “working on propensity targets is bad because pretraining-only scaleups are safer and less coercive” is necessarily true, and my points 1 and 2 were meant as separate points, e.g. the pretraining thing could also end up being coercive and run into the same sorts of issues. I think “working on propensity training is likely net-negative in the current paradigm.” is currently true but not necessarily true (due mostly to the coercive-by-default issue). Hope that helps make my position clearer!