Have you seen Erik P Hoel’s work on causal abstraction? This post reminded me of it.
When you talk about counterfactuals do you mean interventions? Although I’m guessing the “everything still works” conclusion holds for both interventions and counterfactuals.
I think this is related to the word problem for the rewriting system defined by your programming language. When I first read this question I was thinking “Something to do with Church-Rosser?”—but you can follow the links to see for yourself if that literature is what you’re after.
Didn’t watch the video but would have read the post. Might watch the video only because previous posts have been appetising enough.
One misgiving I have about the illustrated format is that it’s less accessible than text. I hope the authors of work in this format keep the needs of a wide variety of readers in mind.
the objective of agent-designers is to have the agent collect as many agents as possible
Typo: should say “dollars”?
if the daemon is obfuscated, there is no efficient procedure which takes the daemon circuit as input and produces a smaller circuit that still solves the problem.
So we can’t find any efficient constructive argument. That rules out most of the obvious strategies.
I don’t think the procedure needs to be efficient to solve the problem, since we only care about existence of a smaller circuit (not an efficient way to produce it).
I don’t think this question has much intrinsic importance, because almost all realistic learning procedures involve a strong simplicity prior (e.g. weight sharing in neural networks).
Does this mean you do not expect daemons to occur in practice because they are too complicated?
Thanks for a great post! I have a small confusion/nit regarding natural selection. Despite its name, I don’t think it’s a good exemplar of a selection process. Going through the features of a selection process from the start of the post:
can directly instantiate any element of the search space. No: natural selection can only make local modifications to previously instantiated points. But you already dealt with this local search issue in Choices Don’t Change Later Choices.
gets direct feedback on the quality of each element. Yes.
quality of element does not depend on previous choices. No, the evaluation of an element in natural selection depends a great deal on previous choices because they usually make up important parts of its environment. I think this is the thrust of the claim that natural selection is online (which I agree with).
only the final output matters. No? From the perspective of natural selection, I think the quality of the current output is what matters.
I’d love to know why natural selection seemed obvious as an example of a selection process, since it did not to me due to its poor score on the checklist above.
I like this post because it pushes us to be more precise about what we mean by corrigibility. Nice example.
Nice post! Do you have a link to an explanation of what counterfactual mugging is and why it’s a good thing?
For subagent alignment problems, is there an interesting distinction to be drawn between the limited agent being able to understand the process by which the more powerful agent becomes powerful, versus not even understanding that? (What would it mean to “understand the process”? I suppose it means being able to validate certain relevant facts about the process though not enough to know exactly what results from it.)
More specifically, it seems that your c must include information about how to interpret the X bits. Right? So it seems slightly wrong to say “R is the largest number that can be specified in X bits of information” as long as c stays fixed. c might grow as the specification scheme changes.
Alternatively, you might just be wrong in thinking that 30 bits are enough to specify 3^^^^3. If c indicates that the number of additional universes is specified by a standard binary-encoded number, 30 bits only gets you about a billion.
They’re not yet close to being taken over by AI, but there has been research on automating all of the above. Some possibly relevant keywords: automated theorem proving, and program synthesis.
I’d go, presuming no other important commitments on the dates.
I know someone who might be interested in doing some translations into Japanese. If you add that language we might have a go.