alignment equivalents to “make a trillion dollars” for capabilities that are easy to verify, strictly imply alignment, and extremely difficult to get any traction on (and with it, a series of weakenings of such a metric that are easier to get traction on but also less-strictly imply alignment).
I expect there’s a fair amount of low-hanging fruit in finding good targets for automated alignment research. E.g. how about an LLM agent which reads 1000s of old LW posts looking for a good target? How about unlearning? How about a version of RLHF where you show an alignment researcher two AI-generated critiques of an alignment plan, and they rate which critique is better?
I expect there’s a fair amount of low-hanging fruit in finding good targets for automated alignment research. E.g. how about an LLM agent which reads 1000s of old LW posts looking for a good target? How about unlearning? How about a version of RLHF where you show an alignment researcher two AI-generated critiques of an alignment plan, and they rate which critique is better?