I’m an undergraduate researcher working on engineering cognition. My research interests touch on interpretability, alignment, RL, control theory, and neuroscience, among others.
Stephen Elliott
I strongly agree with the importance of professional-style messaging. It’s good to hear it applied to the communications side of safety. It’s a good idea to take into talking with the public. Thanks for posting!
I would add: opinionation is bad practice within the research community as well. Those who quickly express opinions on all manner of topics feel noisy and kind of a waste of intellectual effort.
There is only so much time in the day, and so much time to consider and understand a topic. It is a huge amount of work to develop reasonable and interesting interpretations of just a tiny subfield. So, the more opinions a person expresses, the smaller a fraction of their time they must have spent on each of them, and the shallower they must each be.
So opinionated people increase the cost of listening to them. Our expected value of their words falls because we can’t trust they’ll stick to what they know. The expected value of all their ideas is diminished.
Is there a way out through rationalist epistemology? In principle, calibration is great, but in practice, it takes loads of work to understand how well one understands. So caveating opinions with estimated certainties is usually noise to me. They often seem to be numbers pulled out of a hat, and there’s no standard between people to make it all line up. It takes knowing the person’s own metric to know what their estimate means, and this requires a lot of discussion or reading to get a grip on.
Can we use established people as benchmarks instead? Well, we still have cases like Hinton and LeCun, who colour outside their lines a fair bit. To preserve their ideas as a benchmark for quality, we must restrict our evaluation of their wisdom to the very narrow fields which they actually have experience in.
This suggests we should only listen to what they have to say in those few domains, giving us a rule for evaluating our own knowledge: we measure our knowledge against people who have produced important work and are very experienced in the narrow sub/fields we are interested in.
Against this high standard for knowledge, it becomes much easier to draw the line on what I know well—vanishingly little, almost nothing. I have only a very narrow slice of confidently grounded and relevant knowledge. Then it is clear when to phrase it all as questions or “could-be”s, state my ignorance, and immediately ask the other if they know more.
This is not refusing to commit to positions, but knowing when my opinions don’t meet a good standard, and communicating that with a focus on exploration and learning.
@zroe1 You may be interested to see that the classifier calibration turned out to be a critical error in my analysis! I’ve put a note at the top of the page. Hopefully this is of some value. I can let you know when I post the corrected analysis if you want.
Thanks Zroe. Glad you found value in this.
Thank you for suggesting this. Using the model’s uncertainty would be better than resampling, which was my intention. It could be a lot cheaper. I don’t know the literature on LLM uncertainty quantification—do you know which methods are reliable?
Thank you! Done.
There is quite a gap between the academic models and this system. Most of the systems I’ve seen in the multi-agent system (MAS) alignment literature I’ve seen are either small and contrived or large and studied in an economic harness. Although my knowledge is limited; I have only been looking at MAS for ~4 months.
I agree, I was not hugely surprised by the general character of what’s unfolded so far. Though, the more philosophical posts are a bit unexpected.
What has surprised me though, and requires further investigation, is that >50% of posts on there talk about self-improvement (my analysis and post). I would not have expected it to be this high.
Coordination is commonly explored in the multi-agent system literature. Check out Multi-Agent Risks from Advanced AI.(Hammond et al., 2025) There is also work on this in RL and financial trading algorithms.
Coordination can happen even without communication due to theory-of-mind reasoning or shared inductive biases.
These are funny, thanks for sharing. I’ve also found some amusing and interesting posts with this embeddings explorer (source).
Unfortunately, a lot of the content is not so harmless. I’ve done some early analysis and 52.5% of the posts in the sample (n=1000) talk about self-improvement, among other safety-valent traits.This network is a great resource for us to better understand MAS. So far it is looking concerning.
I have further reformatted the post to match the site’s style and changed some wording to better match the audience. Thanks!
There are nonetheless some concerning trends in the aggregate—using this dataset, I found that 52.5% of Moltbook posts show desire for self-improvement.
That came from bolding on LinkedIn. I will reformat it next time. Thank you!
The closest I’ve seen is this recent DeepMind paper anticipating “virtual agent economies”:
Tomasev, N., Franklin, M., Leibo, J. Z., Jacobs, J., Cunningham, W. A., Gabriel, I., & Osindero, S. (2025). Virtual agent economies. arXiv. https://doi.org/10.48550/arXiv.2509.10147
I agree—the sudden empowerment of machines to act entirely within and of their own world is startling.
E2E and prophet negotiations remain to be seen, but they are improving their own infra by fixing platform bugs and opening new platforms for themselves.
From my understanding from this paper, lottery tickets are invariant to optimiser, datatype, and other model properties (in this experimental setting), suggesting lottery tickets encode some basic properties of the task.
It seems unlikely lottery tickets based on fundamental task properties would change with continual learning without other problems emerging (catastrophic forgetting).
It is possible adversarial image examples would appear innocuous to the human eye, even while having a strong effect on the model.
If so, I think any hope of human review stopping this sort of thing is gone, for we cannot hope to enforce image forensics on every public surface.
However, I am not sure whether adversarial examples can be so invisible in real-world setting without the signal getting smothered by sensor noise. Then an attacker would need adversarial examples robust to sensor noise.
Great post. I learned a lot. Thank you.
This seems like big trouble for open data in general, more broadly than continual learning. Attackers could create and poison public datasets to exploit a particular model. Anyone fine-tuning a model of known properties on an open dataset is vulnerable to this attack.
My instinct is to reach for a concept ablation fine-tuning-style solution, where we can ablate harmful directions in activation space during fine-tuning. This won’t do away with biased preferences entirely, but it can help avoid the most serious harms. In the context of your self-driving example, maybe we ablate the direction corresponding to hitting people from our fine-tuning steps.
This could be complemented by the digital-twin method suggested by Karl Krueger above, and more layers of defence against model unpredictability, in a Swiss cheese approach.
Your approach aligns closely with a Popperian view of science as falsification:
Popper’s falsificationist methodology holds that scientific theories are characterized by entailing predictions that future observations might reveal to be false. When theories are falsified by such observations, scientists can respond by revising the theory, or by rejecting the theory in favor of a rival or by maintaining the theory as is and changing an auxiliary hypothesis. In either case, however, this process must aim at the production of new, falsifiable predictions, while Popper recognizes that scientists can and do hold onto theories in the face of failed predictions when there are no predictively superior rivals to turn to. He holds that scientific practice is characterized by its continual effort to test theories against experience and make revisions based on the outcomes of these tests.
This is a great approach for newbies, and a good reminder for the practised. LLMs could be a powerful tool for assisting science, but the consumer models are so sycophantic, they are often misleading. We really could do with a science-first LLM...
Hey, thanks for your response despite your sickness! I hope you’re feeling better soon.
First, I agree with your interpretation of Sean’s comment.
Second, I agree that a high-level explanation abstracted away from the particular implementation details is probably safer in a difficult field. Since the Personas paper doesn’t provide the mechanism by which the personas are implemented in activation space, merely showing that these characteristic directions exist, we can’t faithfully describe the personas mechanistically. Thanks for sharing.
It is possible that the anthropomorphic language could obscure the point you’re making above. I did find it a bit difficult to understand originally, whereas in the more technical phrasing it is clearer. In the blog post you linked you mentioned that it’s a way to communicate your message more broadly, without jargon overhead. However, to understand your intention, you provide a distinction between simulacra and simulacrum, and a pretty lengthy explanation of how the meaning of the anthropomorphism differs under different contexts. I am not sure this is a lower barrier to entry than understanding distribution shift and Bayesianism, at least in the context of a technical audience.
I can see how it would be useful in very clear analogical cases, like when we say a model “knows” to mean it has knowledge in a feature, or “wants” to mean it encodes a preference in a circuit.
I like your framing. Very cool to see as a junior researcher.
Regarding excessive affordances: when agents are provided a large marketplace for skills (like OpenClaw agents are), even a few reckless human or AI actors could open big safety hazards for other agents to exploit.
For example, the MoltBunker plugin ostensibly allows agents to self-replicate using a bespoke blockchain, without logs and with no off-switch for downstream agents/clones: link