Good post, I liked the concise and clear exposition.
Two reactions: 1. How do you think recursive self-improvement works in this model? Could this create an super exponential capability growth that create big gaps?
It is also what makes that particular scenario unlikely to happen. The leading companies will be more careful than that if they had that level of evidence of misalignment in powerful systems.
This seems like a big crux! Really unclear that the tension will stay at this level of intensity, they could definitely rise because of international rivalry for instance.
“To our knowledge, this is the first work that gives LLMs tool-mediated control over their own internal states”
I believe the post from Anima lab “Persistence and Introspection of Emotion Features” is a precedent work. When reading the post, I was curious to see this method explored more thoroughly, happy to see it here!
https://latentaffect.up.railway.app/long_range_persistence_of_emotion_features.html
Quote from the post: