This is a logical consequence. Rice’s theorem says: no uniform decision procedure can take an arbitrary program and decide a nontrivial semantic property. The reply (“we only need it for the systems we build”) seems to side-steps that quantifier.
So let’s tighten it: How do you prove that a single specific agent is aligned across one possible states of the world? By a criterion. Some’ possible states of the world? By multiple criteria. Across ‘all relevant’ states of the world? By all ‘relevant’ criteria. Problematic if one has no model and no criteria.
Might wan’t to resort to a HOL probabilistic, game-theoretic, philosophy Ansatz instead… What will fail is to extract semantics from places semantics were never previously extracted (circuits, features etc.)
If you can restrict the program class (good), but unless you also operationalize the state space and key properties, you haven’t escaped diagonalization, you’ve just moved it. “All states that could matter” tends to reintroduce worst-case adversarial cases, and then you either (a) weaken the guarantee, or (b) strengthen the structure, at which point the real question becomes whether your restriction still matches what you meant by “the world.”
So: yes, it’s not primarily “decide alignment for all programs,” it’s “justify a uniform claim about one program over an open-ended environment,” which is where the constructive discomfort actually resides, and more so because neural networks invite the self-referential flavor.
This is a logical consequence.
Rice’s theorem says: no uniform decision procedure can take an arbitrary program and decide a nontrivial semantic property. The reply (“we only need it for the systems we build”) seems to side-steps that quantifier.
So let’s tighten it: How do you prove that a single specific agent is aligned across one possible states of the world? By a criterion. Some’ possible states of the world? By multiple criteria. Across ‘all relevant’ states of the world? By all ‘relevant’ criteria. Problematic if one has no model and no criteria.
Might wan’t to resort to a HOL probabilistic, game-theoretic, philosophy Ansatz instead… What will fail is to extract semantics from places semantics were never previously extracted (circuits, features etc.)
If you can restrict the program class (good), but unless you also operationalize the state space and key properties, you haven’t escaped diagonalization, you’ve just moved it. “All states that could matter” tends to reintroduce worst-case adversarial cases, and then you either (a) weaken the guarantee, or (b) strengthen the structure, at which point the real question becomes whether your restriction still matches what you meant by “the world.”
So: yes, it’s not primarily “decide alignment for all programs,” it’s “justify a uniform claim about one program over an open-ended environment,” which is where the constructive discomfort actually resides, and more so because neural networks invite the self-referential flavor.
Note. Interestingly this obstructs AI too.