Research at the Forecasting Research Institute. Previously U.C. Berkeley Center for Long-Term Cybersecurity. I’m interested in interpretability, particularly introspection and introspective access. https://else.how
Nick Merrill
Karma: 15
Research at the Forecasting Research Institute. Previously U.C. Berkeley Center for Long-Term Cybersecurity. I’m interested in interpretability, particularly introspection and introspective access. https://else.how
One outstanding question I’ve had about introspection is: what part of the model is doing the introspection? In humans, this might be the prefrontal cortex.
In the extreme case (in which models can introspect arbitrarily well), this component is the only part of the model that needs to be aligned: through introspection, it can clean up the rest. In other words, good introspection could reduce the ‘surface area’ for alignment considerably.