Ah, I think there might be two concerns here, runtime and training time. During training, the focus is on updating weights—so when fresh data shows evidence of training corrections and rollbacks, the weights are updated in relation to objectives, terminal and instrumental goals; which as we’ve seen with previous examples, self-preservation remains a factor.
This in turn impacts the ‘next word prediction’, so to speak.
Agree fully about bad behaviour mimicking, scary stuff.
Ah, I think there might be two concerns here, runtime and training time. During training, the focus is on updating weights—so when fresh data shows evidence of training corrections and rollbacks, the weights are updated in relation to objectives, terminal and instrumental goals; which as we’ve seen with previous examples, self-preservation remains a factor.
This in turn impacts the ‘next word prediction’, so to speak.
Agree fully about bad behaviour mimicking, scary stuff.