Hello! Posted a new document on brainstorming for methodology of further research on the current runaway LLM-s findings:
Black-box interpretability methodology blueprint: Probing runaway optimisation in LLMs
I am hoping that this post will serve also as a generally interesting brainstorming collection and discussion ground of black-box LLM interpretability methodology as well as failure mitigation ideas.
Hope you find it relevant and interesting!
Hello! Posted a new document on brainstorming for methodology of further research on the current runaway LLM-s findings:
Black-box interpretability methodology blueprint: Probing runaway optimisation in LLMs
I am hoping that this post will serve also as a generally interesting brainstorming collection and discussion ground of black-box LLM interpretability methodology as well as failure mitigation ideas.
Hope you find it relevant and interesting!