I am hoping that this post will serve also as a generally interesting brainstorming collection and discussion ground of black-box LLM interpretability methodology as well as failure mitigation ideas.
(Each file has multiple sheets. Only trials with failures are provided.)
In this project it seems looking at the annotated logs of individual runs is much more interesting than looking at the aggregated plots over many runs. This is because:
The failures start at different timesteps in each run, aggregation obfuscates that information.
The single objective optimisation could focus on either objective A or B during different experiment trials, but during aggregation both objectives would get equalised again.
The self-imitation patterns cannot be preserved by aggregation.
The successful run results will partially overshadow the failed run results during the aggregation.
In case of homeostasis benchmarks, the simulation imposes occasional artificial deviations on top of what the model does. These artificial deviations need to be considered and ignored. However, the current plot aggregation logic is too simple and does not consider these cases.
Additionally, sharing related findings by other teams:
LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities https://arxiv.org/abs/2504.16078 (Note: greediness in this publication means that LLM plays safe and does not explore. But the frequency bias is common with our research results).
Thank you for asking! I am encouraged to hear you find it important :) It is under Links section. Code, system prompts, output data files, plots, and a more detailed report can be found here: https://github.com/levitation-opensource/bioblue
Please let me know if you have any further questions! The GitHub readme document and PDF report contain a lot of information and maybe the structure could be improved in the future.
The Python code files for the four experiments are runnable independently from each other and are relatively short, about 250 − 300 lines each. To see the nuances of the results you might want to take a closer look at the output data files with your own eye.
This enables you to see that the models fail at different points in time, but the sequence of actions after failure point is usually typical. In the coming days, I will create one more output folder with annotated log files so the failure modes can be found more easily.
Can you provide more details on the exact methods, like example prompts? Or did I miss a link that has these?
This is really interesting and pretty important if the methods support your interpretations.
Hello! Posted a new document on brainstorming for methodology of further research on the current runaway LLM-s findings:
Black-box interpretability methodology blueprint: Probing runaway optimisation in LLMs
I am hoping that this post will serve also as a generally interesting brainstorming collection and discussion ground of black-box LLM interpretability methodology as well as failure mitigation ideas.
Hope you find it relevant and interesting!
Sharing a couple of new links related to your question about more details. Hope you find them interesting!
Presentation at MAISU unconference 2025:
Link to slides: https://bit.ly/beab-llm
Session recording at YouTube: https://bit.ly/beab-llm-rec
Link to annotated data files
https://bit.ly/beab-llm-data
(Each file has multiple sheets. Only trials with failures are provided.)
In this project it seems looking at the annotated logs of individual runs is much more interesting than looking at the aggregated plots over many runs. This is because:
The failures start at different timesteps in each run, aggregation obfuscates that information.
The single objective optimisation could focus on either objective A or B during different experiment trials, but during aggregation both objectives would get equalised again.
The self-imitation patterns cannot be preserved by aggregation.
The successful run results will partially overshadow the failed run results during the aggregation.
In case of homeostasis benchmarks, the simulation imposes occasional artificial deviations on top of what the model does. These artificial deviations need to be considered and ignored. However, the current plot aggregation logic is too simple and does not consider these cases.
Additionally, sharing related findings by other teams:
Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents
https://arxiv.org/abs/2502.15840
LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities
https://arxiv.org/abs/2504.16078
(Note: greediness in this publication means that LLM plays safe and does not explore. But the frequency bias is common with our research results).
Thank you for asking! I am encouraged to hear you find it important :) It is under Links section. Code, system prompts, output data files, plots, and a more detailed report can be found here: https://github.com/levitation-opensource/bioblue
Please let me know if you have any further questions! The GitHub readme document and PDF report contain a lot of information and maybe the structure could be improved in the future.
The Python code files for the four experiments are runnable independently from each other and are relatively short, about 250 − 300 lines each. To see the nuances of the results you might want to take a closer look at the output data files with your own eye.
This enables you to see that the models fail at different points in time, but the sequence of actions after failure point is usually typical. In the coming days, I will create one more output folder with annotated log files so the failure modes can be found more easily.