I think an important consideration is how much we can break the alignment problem down into repeatable subtasks (in the vein of automated interpretability). If we just need to spend millions of human-hour-equivalents extending some work that humans already know how we could do, this is easy mode.
If instead you don’t even know what you want a solution to look like, but you’re going to turn an AI loose and hope you know a “solution to alignment” when you see it, now I’m much more concerned about how you’re generating selection pressure towards desired outputs without encouraging undesired outputs.
I think it’s probably true today that LLMs are better at doing subtasks than doing work over long time horizons.
On the other hand, I think human researchers are also quite capable of incrementally advancing their own research agendas and there’s a possibility that AI could be much more creative than humans at seeing patterns in a huge corpus of input text and coming up with new ideas from that.
I think an important consideration is how much we can break the alignment problem down into repeatable subtasks (in the vein of automated interpretability). If we just need to spend millions of human-hour-equivalents extending some work that humans already know how we could do, this is easy mode.
If instead you don’t even know what you want a solution to look like, but you’re going to turn an AI loose and hope you know a “solution to alignment” when you see it, now I’m much more concerned about how you’re generating selection pressure towards desired outputs without encouraging undesired outputs.
I think it’s probably true today that LLMs are better at doing subtasks than doing work over long time horizons.
On the other hand, I think human researchers are also quite capable of incrementally advancing their own research agendas and there’s a possibility that AI could be much more creative than humans at seeing patterns in a huge corpus of input text and coming up with new ideas from that.