Ebenezer Dukakis comments on leogao’s Shortform

Ebenezer Dukakis 27 Oct 2025 11:01 UTC
1 point
0

alignment equivalents to “make a trillion dollars” for capabilities that are easy to verify, strictly imply alignment, and extremely difficult to get any traction on (and with it, a series of weakenings of such a metric that are easier to get traction on but also less-strictly imply alignment).

I expect there’s a fair amount of low-hanging fruit in finding good targets for automated alignment research. E.g. how about an LLM agent which reads 1000s of old LW posts looking for a good target? How about unlearning? How about a version of RLHF where you show an alignment researcher two AI-generated critiques of an alignment plan, and they rate which critique is better?