My mental one-sentence summary of how to think about ELK is “making debate work well in a setting where debaters are able to cite evidence gained by using interpretability tools on each other”.
I’m not claiming that this is how anyone else thinks about ELK (although I got the core idea from talking to Paul) but since I haven’t seen it posted online yet, and since ELK is pretty confusing, I thought it’d be useful to put out there. In particular, this framing motivates us generating interpretability tools which scale in the sense of being robust when used as evidence by AGIs.
Note that this is a very different type of solution to the ones in the original writeup, which seem mainly useful for illustrative purposes rather than actually pointing in promising directions.
My mental one-sentence summary of how to think about ELK is “making debate work well in a setting where debaters are able to cite evidence gained by using interpretability tools on each other”.
I’m not claiming that this is how anyone else thinks about ELK (although I got the core idea from talking to Paul) but since I haven’t seen it posted online yet, and since ELK is pretty confusing, I thought it’d be useful to put out there. In particular, this framing motivates us generating interpretability tools which scale in the sense of being robust when used as evidence by AGIs.
Note that this is a very different type of solution to the ones in the original writeup, which seem mainly useful for illustrative purposes rather than actually pointing in promising directions.