Roland Pihlakas comments on Systematic runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format (BioBlue)

Roland Pihlakas 30 Jun 2025 15:10 UTC
3 points
0
Hello! Posted a new document on brainstorming for methodology of further research on the current runaway LLM-s findings:
Black-box interpretability methodology blueprint: Probing runaway optimisation in LLMs
I am hoping that this post will serve also as a generally interesting brainstorming collection and discussion ground of black-box LLM interpretability methodology as well as failure mitigation ideas.
Hope you find it relevant and interesting!