Principal Investigator representing Leela AI at NIST’s AI Consortium. Recently contributed a formal response to NIST’s RFI on AI agent security, covering autonomous action risk spectrums, positive alignment as a security paradigm, and emergent goal formation in agentic systems.
Previous life: over 25 years in IC/silicon design at HP, National Semiconductor, and AMD — an experience that shapes how I think about verification and alignment (if you’ve ever tried to prove a chip correct before tapeout, you understand why I’m skeptical of post-hoc safety testing). PhD in machine learning (CSU 2022), MS from MIT 1989.
Research interests: AI safety and alignment, positive behavioral benchmarks beyond harm avoidance, world models as a path to robust agency, and the parallels between silicon verification and AI evaluation. Interested in how we build systems that are aligned by construction rather than aligned by patch.
Based in Fort Collins, CO
Thanks Steven for clearly making this point. I understand and agree with the point that weight update is important for true incremental learning. As you imply, weight updates give the opportunity for the model to represent information in more multidimensional way than simple context allows. It may be that something beyond transformers plus scaffolding is needed to get to ‘real’ continual learning, but I’m interested in comments about transformer-based possibilities.
Models could learn by retraining curated samples from prior models—like the agent rollouts described in AI 2027 and a workshop paper I co-authored on ‘Society of LLMs’. They can also potentially learn more ‘continuously’ like SEAL from MIT and the works mishka cited. Even for the first cases involving full retraining, if the models have 10 million tokens of context (about 1 year of speaking for an active speaker like a teacher), then they can be given a lot of context about a job or problem. Successful results can be added to a new 3-4 month model training run. In this way, models can learn for a few months through context and then have the learning rolled into weight training.
I think it’s intriguing, when talking about autonomously updating weights, to consider this paper on biological neurons: A neural substrate of prediction and reward . The paper covers the importance of ‘surprise’ to notice when the world state has changed unexpectedly as well as ‘valence’ signals to determine if there is a positive or negative reward associated with the event. Something like this self-selection of training data (which the SEAL paper from MIT covers) would be important for autonomous learning. Also, one might want a slower-to-update safety classifier (like Anthropic uses) to monitor the continuously updated model for alignment concerns....
I don’t see these approaches as a contradiction to your thesis, though—you make a good case that merely learning with context will have practical limitations.