Roland Pihlakas comments on Systematic runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format (BioBlue)

Roland Pihlakas 7 Jun 2025 16:07 UTC
1 point
0
Sharing a couple of new links related to your question about more details. Hope you find them interesting!

Presentation at MAISU unconference 2025:
Link to slides: https://bit.ly/beab-llm
Session recording at YouTube: https://bit.ly/beab-llm-rec
Link to annotated data files
https://bit.ly/beab-llm-data
(Each file has multiple sheets. Only trials with failures are provided.)
In this project it seems looking at the annotated logs of individual runs is much more interesting than looking at the aggregated plots over many runs. This is because:
1. The failures start at different timesteps in each run, aggregation obfuscates that information.
2. The single objective optimisation could focus on either objective A or B during different experiment trials, but during aggregation both objectives would get equalised again.
3. The self-imitation patterns cannot be preserved by aggregation.
4. The successful run results will partially overshadow the failed run results during the aggregation.
5. In case of homeostasis benchmarks, the simulation imposes occasional artificial deviations on top of what the model does. These artificial deviations need to be considered and ignored. However, the current plot aggregation logic is too simple and does not consider these cases.
Additionally, sharing related findings by other teams:
Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents
https://arxiv.org/abs/2502.15840
LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities
https://arxiv.org/abs/2504.16078
(Note: greediness in this publication means that LLM plays safe and does not explore. But the frequency bias is common with our research results).