Do you expect interpretability tools developed now to extend to interpreting more general (more multimodal, better at navigating the real world) decision-making systems? How?
Yes, we do expect this to be the case. Unfortunately, I think explaining in detail why we think this may be infohazardous. Or at least, I am sufficiently unsure about how infohazardous it is that I would first like to think about it for longer and run it through our internal infohazard review before sharing more. Sorry!
Do you expect interpretability tools developed now to extend to interpreting more general (more multimodal, better at navigating the real world) decision-making systems? How?
Yes, we do expect this to be the case. Unfortunately, I think explaining in detail why we think this may be infohazardous. Or at least, I am sufficiently unsure about how infohazardous it is that I would first like to think about it for longer and run it through our internal infohazard review before sharing more. Sorry!
Did you end up running it through your internal infohazard review and if so what was the result?