Unless you can abstract out the “alignment reasoning and judgement” part of a human’s entire brain process (and philosophical reasoning and judgement as part of that) into some kind of explicit understanding of how it works, how do you actually build that into AI without solving uploading (which we’re obviously not on track to solve in 2-4 year either)?
put a bunch of smart thoughtful humans in a sim and run it for a long time
Alignment researchers have had this thought for a long time (see e.g. Paul Christiano’s A formalization of indirect normativity) but I think all of the practical alignment research programs that this line of thought led to, such as IDA and Debate, are all still bottlenecked by lack of metaphilosophical understanding, because without the kind of understanding that lets you build an “alignment/philosophical reasoning checker” (analogous to a proof checker for mathematical reasoning) they’re stuck trying to do ML of alignment/philosophical reasoning from human data, which I think is unlikely to work out well.
Unless you can abstract out the “alignment reasoning and judgement” part of a human’s entire brain process (and philosophical reasoning and judgement as part of that) into some kind of explicit understanding of how it works, how do you actually build that into AI without solving uploading (which we’re obviously not on track to solve in 2-4 year either)?
Alignment researchers have had this thought for a long time (see e.g. Paul Christiano’s A formalization of indirect normativity) but I think all of the practical alignment research programs that this line of thought led to, such as IDA and Debate, are all still bottlenecked by lack of metaphilosophical understanding, because without the kind of understanding that lets you build an “alignment/philosophical reasoning checker” (analogous to a proof checker for mathematical reasoning) they’re stuck trying to do ML of alignment/philosophical reasoning from human data, which I think is unlikely to work out well.