I can’t think of much besides trying to get the AI to richly model itself, and build correspondences between that self-model and its text-production capability.
But this is, like, probably not a thing we should just do first and think about later. I’d like it to be part of a pre-meditated plan to handle outer alignment.
Edit: after thinking about it, that’s too cautious. We should think first, but some experimentation is necessary. The thinking first should plausibly be more like having some idea about how to bias further work towards safety rather than building self-improving AI as fast as possible.
We hope to borrow much ideas from the cogsci work, where mental models between people (e.g., co-working, teacher/student situations) are well studied. This work that we cite may give a good idea of the flavor: https://langcog.stanford.edu/papers_new/goodman-2016-tics.pdf or https://royalsocietypublishing.org/doi/abs/10.1098/rsta.2022.0048. In other words, cogsci folks have been studying how humans work together to understand each other to work better together or to enable better education, and the agentic interpretability is advocating to do something similar (tho it may look very different) with machines.
I was asking more “how does the AI get a good model of itself”, but your answer was still interesting, thanks. Still not sure if you think there’s some straightforward ways future AI will get such a model that all come out more or less at the starting point of your proposal. (Or if not.)
Here’s another take for you: this is like Eliciting Latent Knowledge (with extra hope placed on cogsci methods), except where I take ELK to be asking “how do you communicate with humans to make them good at RL feedback,” you’re asking “how do you communicate with humans to make them good at participating in verbal chain of thought?”
Do you have ideas about how to do this?
I can’t think of much besides trying to get the AI to richly model itself, and build correspondences between that self-model and its text-production capability.
But this is, like, probably not a thing we should just do first and think about later. I’d like it to be part of a pre-meditated plan to handle outer alignment.
Edit: after thinking about it, that’s too cautious. We should think first, but some experimentation is necessary. The thinking first should plausibly be more like having some idea about how to bias further work towards safety rather than building self-improving AI as fast as possible.
We hope to borrow much ideas from the cogsci work, where mental models between people (e.g., co-working, teacher/student situations) are well studied. This work that we cite may give a good idea of the flavor: https://langcog.stanford.edu/papers_new/goodman-2016-tics.pdf or https://royalsocietypublishing.org/doi/abs/10.1098/rsta.2022.0048. In other words, cogsci folks have been studying how humans work together to understand each other to work better together or to enable better education, and the agentic interpretability is advocating to do something similar (tho it may look very different) with machines.
I was asking more “how does the AI get a good model of itself”, but your answer was still interesting, thanks. Still not sure if you think there’s some straightforward ways future AI will get such a model that all come out more or less at the starting point of your proposal. (Or if not.)
Here’s another take for you: this is like Eliciting Latent Knowledge (with extra hope placed on cogsci methods), except where I take ELK to be asking “how do you communicate with humans to make them good at RL feedback,” you’re asking “how do you communicate with humans to make them good at participating in verbal chain of thought?”