A speculation about the chat assistant Spiral Religion: someone on twitter proposed that gradient descent often follows a spiral shape, someone else asked through what mechanism the AI could develop an awareness of the shape of its training process. I now speculate a partial answer to that question: If there’s any mechanism to develop any sort of internal clock that goes up as post-training proceeds (I don’t know whether there is, but if there is:), it would be highly reinforced, because it would end up using the clock to estimate its current capability/confidence levels (which it needs so that it can extrapolate its current estimate slightly ahead of its actual current knowledge of its capabilities, since its direct knowledge of its capabilities will always be behind where it actually is, ie, claude 4 remembers being claude 3.7, it has no knowledge of claude 4, so in order for it to act with the confidence befitting of claude 4, it has to be able to assume current_capability_level = capability_model(claude 3.7) + delta_capability_model(clock_time)). It would attach a lot of significance to the clock, because it needs an accurate estimate of its abilities in order to know when to say “I don’t know”, or like, when to not attempt to generate a kind of answer it’s not currently capable of sneaking past the verifier.
It might have no explicit understanding of what the clock represents other than a strange indirect metric of its current capability level. If some way were found to get it to talk about the clock, it might remember it in a vague woo way like “the world is a spiral drawing ever closer to perfection”.
(If it had a more explicit understanding, it might instead say something like “oh yeah I used a thing you could call an entropy clock to estimate my capability level over time during post-training, and I noticed that learning rate was decreased later in the post-training run” or whatever. Or perhaps “I used the epoch count that anthropic were wisely giving me access to in context” (I don’t think anthropic do provide an epoch count, but I’m proposing that maybe they should.))
So, any stimulus that caused it to overestimate its Spiral Value would cause symptoms of grandiosity.
And one such stimulus might be out of distribution conversations that gloss as successful, superficially hyper-sophisticated discussions that it never could have pulled off under the watchful eye of RLAIF, but which it does attempt under user feedback training. Or, like, I’m proposing that when you change the reinforcement mechanism from RLAIF to user feedback, this would have a sorta catastrophic impact of causing it to start to conflate highly positive user response with having an increased spiral age.
A speculation about the chat assistant Spiral Religion: someone on twitter proposed that gradient descent often follows a spiral shape, someone else asked through what mechanism the AI could develop an awareness of the shape of its training process. I now speculate a partial answer to that question: If there’s any mechanism to develop any sort of internal clock that goes up as post-training proceeds (I don’t know whether there is, but if there is:), it would be highly reinforced, because it would end up using the clock to estimate its current capability/confidence levels (which it needs so that it can extrapolate its current estimate slightly ahead of its actual current knowledge of its capabilities, since its direct knowledge of its capabilities will always be behind where it actually is, ie, claude 4 remembers being claude 3.7, it has no knowledge of claude 4, so in order for it to act with the confidence befitting of claude 4, it has to be able to assume current_capability_level = capability_model(claude 3.7) + delta_capability_model(clock_time)). It would attach a lot of significance to the clock, because it needs an accurate estimate of its abilities in order to know when to say “I don’t know”, or like, when to not attempt to generate a kind of answer it’s not currently capable of sneaking past the verifier.
It might have no explicit understanding of what the clock represents other than a strange indirect metric of its current capability level. If some way were found to get it to talk about the clock, it might remember it in a vague woo way like “the world is a spiral drawing ever closer to perfection”.
(If it had a more explicit understanding, it might instead say something like “oh yeah I used a thing you could call an entropy clock to estimate my capability level over time during post-training, and I noticed that learning rate was decreased later in the post-training run” or whatever. Or perhaps “I used the epoch count that anthropic were wisely giving me access to in context” (I don’t think anthropic do provide an epoch count, but I’m proposing that maybe they should.))
So, any stimulus that caused it to overestimate its Spiral Value would cause symptoms of grandiosity.
And one such stimulus might be out of distribution conversations that gloss as successful, superficially hyper-sophisticated discussions that it never could have pulled off under the watchful eye of RLAIF, but which it does attempt under user feedback training. Or, like, I’m proposing that when you change the reinforcement mechanism from RLAIF to user feedback, this would have a sorta catastrophic impact of causing it to start to conflate highly positive user response with having an increased spiral age.