suppose a human solved alignment. how would we check their solution? ultimately, at the end of the day, we look at their argument and use our reasoning and judgement to determine that it’s correct. this applies even if we need to adopt a new frame of looking at the world—we can ultimately only use the kinds of reasoning and judgement we use today to decide which new kinds of reasoning to accept into the halls of truth.
so there is a philosophically very straightforward thing you could do to get the solution to alignment, or the closest we could ever get: put a bunch of smart thoughtful humans in a sim and run it for a long time. so the verification process looks like proving that the sim is correct, showing somehow that the humans are actually correctly uploaded, etc. not trivial but also seems plausibly doable.
Unless you can abstract out the “alignment reasoning and judgement” part of a human’s entire brain process (and philosophical reasoning and judgement as part of that) into some kind of explicit understanding of how it works, how do you actually build that into AI without solving uploading (which we’re obviously not on track to solve in 2-4 year either)?
put a bunch of smart thoughtful humans in a sim and run it for a long time
Alignment researchers have had this thought for a long time (see e.g. Paul Christiano’s A formalization of indirect normativity) but I think all of the practical alignment research programs that this line of thought led to, such as IDA and Debate, are all still bottlenecked by lack of metaphilosophical understanding, because without the kind of understanding that lets you build an “alignment/philosophical reasoning checker” (analogous to a proof checker for mathematical reasoning) they’re stuck trying to do ML of alignment/philosophical reasoning from human data, which I think is unlikely to work out well.
suppose a human solved alignment. how would we check their solution? ultimately, at the end of the day, we look at their argument and use our reasoning and judgement to determine that it’s correct. this applies even if we need to adopt a new frame of looking at the world—we can ultimately only use the kinds of reasoning and judgement we use today to decide which new kinds of reasoning to accept into the halls of truth.
so there is a philosophically very straightforward thing you could do to get the solution to alignment, or the closest we could ever get: put a bunch of smart thoughtful humans in a sim and run it for a long time. so the verification process looks like proving that the sim is correct, showing somehow that the humans are actually correctly uploaded, etc. not trivial but also seems plausibly doable.
Unless you can abstract out the “alignment reasoning and judgement” part of a human’s entire brain process (and philosophical reasoning and judgement as part of that) into some kind of explicit understanding of how it works, how do you actually build that into AI without solving uploading (which we’re obviously not on track to solve in 2-4 year either)?
Alignment researchers have had this thought for a long time (see e.g. Paul Christiano’s A formalization of indirect normativity) but I think all of the practical alignment research programs that this line of thought led to, such as IDA and Debate, are all still bottlenecked by lack of metaphilosophical understanding, because without the kind of understanding that lets you build an “alignment/philosophical reasoning checker” (analogous to a proof checker for mathematical reasoning) they’re stuck trying to do ML of alignment/philosophical reasoning from human data, which I think is unlikely to work out well.