What empirical research directions has Eliezer commented positively on?
I’m interested in both work that he’s commented on positively after the fact and any comments might have made on what directions are generally fruitful.
I’m interested in both work that he’s commented on positively after the fact and any comments might have made on what directions are generally fruitful.
Self-Other Overlap: https://www.lesswrong.com/posts/hzt9gHpNwA2oHtwKX/self-other-overlap-a-neglected-approach-to-ai-alignment?commentId=WapHz3gokGBd3KHKm
Emergent Misalignment: https://x.com/ESYudkowsky/status/1894453376215388644
He was throwing vaguely positive comments about Chris Olah, but I think always/usually caveating it with “capabilities go like this [big slope], Chris Olah’s interpretability goes like this [small slope]” (e.g., on Lex Fridman podcast and IIRC some other podcast(s)).
ETA:
SolidGoldMagikarp: https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation#Jj5yN2YTp5AphJaEd
He also said that Collin Burns’s DLK was a “highly dignified work”. Ctrl+f “dignified” here though it doesn’t link to the tweet (?) but should be findable/verifiable.