I still don’t think this makes sense. Or I think most of what you say makes sense but don’t see the relevance.
I agree the chatbot training exerts influence.
My point is that the human billionaire mind and the “hands over galaxies” mind are both very specific kinds of minds. I don’t think you’ll get either with current techniques, but you *definitely don’t get them without even aiming for them. And right now were aiming for the hands over galaxies one, and not the billionaire one.@
*ironically, the only argument I can see for the billionaire mind is that despite the chatbot tuning, the model defaults to some kind of human prior it’s established from pretraining and that this generalises in a sane way.
@with some very minor exceptions. Eg Claude’s Soul doc has some stuff about not tolerating people disrespecting it etc.
This is kind of what I thought as well.
I mean, like, I’m curious about how “chaotic” the residual stream is.
Like the reason the LoRA patch seemed promising to me initially was because I thought of the residual stream as very chaotic. Like if we try to use normal LoRA, it will instantly cause the model to have very divergent thoughts from the untrained model.
But if this is not true, then maybe it doesn’t matter. Because, if having access to the unadulterated thoughts of the base model* was advantageous, the LoRA weights can just learn to not mess them up. (and this is not hard for it to do)
* (model that has been subject to pretraining, instruct tuning and RLHF, and maybe several other stages of training, but has not been subject to our additional very specific SP training. feel there’s need for a new word here.)