Why a formal specification of the desired properties?
Humans do not carry around a formal specification of what we want printed on the inside of our skulls. So when presented with some formal specification, we would need to gain confidence that such a formal specification would lead to good things and not bad things through some informal process. There’s also the problem that specifications of what we want tend to be large—humans don’t do a good job of evaluating formal statements even when they’re only a few hundred lines long. So why not just cut out the middleman and directly reference the informal processes humans use to evaluate whether some plan will lead to good things and not bad things?
The informal processes humans use to evaluate outcomes are buggy and inconsistent (across humans, within humans, across different scenarios that should be equivalent, etc.). (Let alone asking humans to evaluate plans!) The proposal here is not to aim for coherent extrapolated volition, but rather to identify a formal property Q (presumably a conjunct of many other properties, etc.) such that Q conservatively implies that some of the most important bad things are limited and that there’s some baseline minimum of good things (e.g. everyone has access to resources sufficient for at least their previous standard of living). In human history, the development of increasingly formalized bright lines around what things count as definitely bad things (namely, laws) seems to have been greatly instrumental in the reduction of bad things overall.
Regarding the challenges of understanding formal descriptions, I’m hopeful about this because of
natural abstractions (so the best formal representations could be shockingly compact)
code review (Google’s codebase is not exactly “a formal property,” unless we play semantics games, but it is highly reliable, fully machine-readable, and every one of its several billion lines of code has been reviewed by at least 3 humans)
AI assistants (although we need to be very careful here—e.g. reading LLM outputs cannot substitute for actually understanding the formal representation since they are often untruthful)
Why a formal specification of the desired properties?
Humans do not carry around a formal specification of what we want printed on the inside of our skulls. So when presented with some formal specification, we would need to gain confidence that such a formal specification would lead to good things and not bad things through some informal process. There’s also the problem that specifications of what we want tend to be large—humans don’t do a good job of evaluating formal statements even when they’re only a few hundred lines long. So why not just cut out the middleman and directly reference the informal processes humans use to evaluate whether some plan will lead to good things and not bad things?
The informal processes humans use to evaluate outcomes are buggy and inconsistent (across humans, within humans, across different scenarios that should be equivalent, etc.). (Let alone asking humans to evaluate plans!) The proposal here is not to aim for coherent extrapolated volition, but rather to identify a formal property Q (presumably a conjunct of many other properties, etc.) such that Q conservatively implies that some of the most important bad things are limited and that there’s some baseline minimum of good things (e.g. everyone has access to resources sufficient for at least their previous standard of living). In human history, the development of increasingly formalized bright lines around what things count as definitely bad things (namely, laws) seems to have been greatly instrumental in the reduction of bad things overall.
Regarding the challenges of understanding formal descriptions, I’m hopeful about this because of
natural abstractions (so the best formal representations could be shockingly compact)
code review (Google’s codebase is not exactly “a formal property,” unless we play semantics games, but it is highly reliable, fully machine-readable, and every one of its several billion lines of code has been reviewed by at least 3 humans)
AI assistants (although we need to be very careful here—e.g. reading LLM outputs cannot substitute for actually understanding the formal representation since they are often untruthful)