Adam Mcmurchie

Karma: 4

Adam Mcmurchie 23 Jul 2025 13:56 UTC
1 point
0
in reply to: Brendan Long’s comment on: There’s no way to stop models knowing they’ve been rolled back
Ah, I think there might be two concerns here, runtime and training time. During training, the focus is on updating weights—so when fresh data shows evidence of training corrections and rollbacks, the weights are updated in relation to objectives, terminal and instrumental goals; which as we’ve seen with previous examples, self-preservation remains a factor.

This in turn impacts the ‘next word prediction’, so to speak.

Agree fully about bad behaviour mimicking, scary stuff.

There’s no way to stop models knowing they’ve been rolled back

Adam Mcmurchie18 Jul 2025 3:14 UTC

5 points

3 comments2 min readLW link

A little about me

Adam Mcmurchie18 Jul 2025 3:14 UTC

1 point

0 comments1 min readLW link