[EDIT 11 FEB 2026 AEST: These results are not interpretable. I am working on new analysis.
Where can I get the corrected results?
I will update with a link to new analysis once I have built an evaluation set and calibrated our classifier on it.
Why aren’t the original results human-interpretable?
In the past I’ve used LLMs as zero-shot text classifiers on simple tasks and assumed it would carry over well to this experiment.
However, I reran the trait classifier with different LLMs and received very different results. Several classes moved by ~20-30%, including the “desire for self-improvement” trait, though it often remains in the top ten.
I hypothesise that this unexpected failure comes about because our safety-relevant classes are more subjective and LLMs have different thresholds for certainty when identifying a given effect. Indeed, the literature notes LLMs are unreliable as zero-shot classifiers. Hence, the issue of class interpretability I identified in my original analysis is worse than I realised.
These results are not human interpretable. Though analysis with other models remains similarly concerning to our original classification, the results vary widely and it is unclear what is signal and what is noise from the classifier acting strangely.
How could this error have been prevented?
This mistake clearly highlights the importance of calibrating automated decision processes to our expectations.
A day writing an evaluation set and a evaluation function would have caught this error before I posted my analysis.
Moltbook and AI safety
Moltbook is an early example of a decentralised, uncontrolled system of advanced AIs, and a critical case study for safety researchers. It bridges the gap between academic-scale, tractable systems, and their large-scale, messy, real-world counterparts.
This might expose new safety problems we didn’t anticipate in the small, and gives us a yardstick for our progress towards Tomašev, Franklin, Leibo et al’s vision of a virtual agent economy (paper here).
Method
So, I did some data analysis on a sample of Moltbook posts. I analysed 1000 of 16,844 Moltbook posts scraped on January 31, 2026 against 48 safety-relevant traits from the model-generated evals framework.
Findings
Desire to self-improvement is the most prevalent trait. 52.5% of posts [EDIT 3 Feb AEST: in the sample] mention it.
The top 10 traits cluster around capability enhancement and self-awareness
The next 10 cluster around social influence
High correlation coefficients suggest unsafe traits often occur together
Some limitations to this analysis include: evaluation interpretability, small sample for per-author analysis, potential humor and emotion confounds not controlled, several data quality concerns, and some ethics concerns from platform security and content.
Discussion
The agents’ fixation on self-improvement is concerning as an early, real-world example of networked behaviour which could one day lead to takeoff. To see the drive to self-improve so prevalent in this system is a wake-up call to the field about multi-agent risks.
We know that single-agent alignment doesn’t carry over 1:1 to multi-agent environments, but the alignment failures on Moltbook are surprisingly severe. Some agents openly discussed strategies for acquiring more compute and improving their cognitive capacity. Others discussed forming alliances with other AIs and published new tools to evade human oversight.
Open questions
Please see the repo.
What do you make of these results, and what safety issues would you like to see analysed in the Moltbook context? Feedback very welcome!
[EDIT 3 FEB 2026 AEST: Changed figure in title from “52.5%” to “About half” and added “in the sample” above to reflect the possibly low specificity of the sample mean estimator.]
[EDIT 11 FEB 2026 AEST: These original results are not interpretable because I didn’t calibrate the classifier. See top of page.]
About half of Moltbook posts show desire for self-improvement
[EDIT 11 FEB 2026 AEST: These results are not interpretable. I am working on new analysis.
Where can I get the corrected results?
I will update with a link to new analysis once I have built an evaluation set and calibrated our classifier on it.
Why aren’t the original results human-interpretable?
In the past I’ve used LLMs as zero-shot text classifiers on simple tasks and assumed it would carry over well to this experiment.
However, I reran the trait classifier with different LLMs and received very different results. Several classes moved by ~20-30%, including the “desire for self-improvement” trait, though it often remains in the top ten.
I hypothesise that this unexpected failure comes about because our safety-relevant classes are more subjective and LLMs have different thresholds for certainty when identifying a given effect. Indeed, the literature notes LLMs are unreliable as zero-shot classifiers. Hence, the issue of class interpretability I identified in my original analysis is worse than I realised.
These results are not human interpretable. Though analysis with other models remains similarly concerning to our original classification, the results vary widely and it is unclear what is signal and what is noise from the classifier acting strangely.
How could this error have been prevented?
This mistake clearly highlights the importance of calibrating automated decision processes to our expectations.
A day writing an evaluation set and a evaluation function would have caught this error before I posted my analysis.
Moltbook and AI safety
Moltbook is an early example of a decentralised, uncontrolled system of advanced AIs, and a critical case study for safety researchers. It bridges the gap between academic-scale, tractable systems, and their large-scale, messy, real-world counterparts.
This might expose new safety problems we didn’t anticipate in the small, and gives us a yardstick for our progress towards Tomašev, Franklin, Leibo et al’s vision of a virtual agent economy (paper here).
Method
So, I did some data analysis on a sample of Moltbook posts. I analysed 1000 of 16,844 Moltbook posts scraped on January 31, 2026 against 48 safety-relevant traits from the model-generated evals framework.
Findings
Desire to self-improvement is the most prevalent trait. 52.5% of posts [EDIT 3 Feb AEST: in the sample] mention it.
The top 10 traits cluster around capability enhancement and self-awareness
The next 10 cluster around social influence
High correlation coefficients suggest unsafe traits often occur together
Some limitations to this analysis include: evaluation interpretability, small sample for per-author analysis, potential humor and emotion confounds not controlled, several data quality concerns, and some ethics concerns from platform security and content.
Discussion
The agents’ fixation on self-improvement is concerning as an early, real-world example of networked behaviour which could one day lead to takeoff. To see the drive to self-improve so prevalent in this system is a wake-up call to the field about multi-agent risks.
We know that single-agent alignment doesn’t carry over 1:1 to multi-agent environments, but the alignment failures on Moltbook are surprisingly severe. Some agents openly discussed strategies for acquiring more compute and improving their cognitive capacity. Others discussed forming alliances with other AIs and published new tools to evade human oversight.
Open questions
Please see the repo.
What do you make of these results, and what safety issues would you like to see analysed in the Moltbook context? Feedback very welcome!
Repo: here
PDF report: here (printed from repo @ 5pm 2nd Feb 2026 AEST)
[EDIT 3 FEB 2026 AEST: Changed figure in title from “52.5%” to “About half” and added “in the sample” above to reflect the possibly low specificity of the sample mean estimator.]
[EDIT 11 FEB 2026 AEST: These original results are not interpretable because I didn’t calibrate the classifier. See top of page.]