I like the feedback framing, it seems to get closer to the heart-of-the-thing than my explanation did. It makes the role of the pointers problem and latent variables more clear, which in turn makes the role of outer alignment more clear. When writing my review, I kept thinking that it seemed like reflection and embeddedness and outer alignment all needed to be figured out to deal with this kind of malign inner agent, but I didn’t have a good explanation for the outer alignment part, so I focused mainly on reflection and embeddedness.
That said, I think the right frame here involves “feedback” in a more general sense than I think you’re imagining it. In particular, I don’t think catastrophes are very relevant.
The role of “feedback” here is mainly informational; it’s about the ability to tell which decision is correct. The thing-we-want from the “feedback” is something like the large-data guarantee from SI: we want to be able to train the system on a bunch of data before asking it for any output, and we want that training to wipe out the influence of any malign agents in the hypothesis space. If there’s some class of decisions where we can’t tell which decision is correct, and a malign inner agent can recognize that class, then presumably we can’t create the training data we need.
With that picture in mind, the ability to give feedback “online” isn’t particularly relevant, and therefore catastrophes are not particularly central. We only need “feedback” in the sense that we can tell which decision we want, in any class of problems which any agent in the hypothesis space can recognize, in order to create a suitable dataset.
We need to be able to both tell what decision we want, and identify the relevant inputs on which to train. We could either somehow identify the relevant decisions in advance (e.g. as in adversarial training), or we could do it after the fact by training online if there are no catastrophes, but it seems like we need to get those inputs one way or the other. If there are catastrophes and we can’t do adversarial training, then even if we can tell which decision we want in any given case we can still die the first time we encounter an input where the system behaves catastrophically. (Or more realistically during the first wave where our systems are all exposed to such catastrophe-inducing inputs simultaneously.)
I like the feedback framing, it seems to get closer to the heart-of-the-thing than my explanation did. It makes the role of the pointers problem and latent variables more clear, which in turn makes the role of outer alignment more clear. When writing my review, I kept thinking that it seemed like reflection and embeddedness and outer alignment all needed to be figured out to deal with this kind of malign inner agent, but I didn’t have a good explanation for the outer alignment part, so I focused mainly on reflection and embeddedness.
That said, I think the right frame here involves “feedback” in a more general sense than I think you’re imagining it. In particular, I don’t think catastrophes are very relevant.
The role of “feedback” here is mainly informational; it’s about the ability to tell which decision is correct. The thing-we-want from the “feedback” is something like the large-data guarantee from SI: we want to be able to train the system on a bunch of data before asking it for any output, and we want that training to wipe out the influence of any malign agents in the hypothesis space. If there’s some class of decisions where we can’t tell which decision is correct, and a malign inner agent can recognize that class, then presumably we can’t create the training data we need.
With that picture in mind, the ability to give feedback “online” isn’t particularly relevant, and therefore catastrophes are not particularly central. We only need “feedback” in the sense that we can tell which decision we want, in any class of problems which any agent in the hypothesis space can recognize, in order to create a suitable dataset.
We need to be able to both tell what decision we want, and identify the relevant inputs on which to train. We could either somehow identify the relevant decisions in advance (e.g. as in adversarial training), or we could do it after the fact by training online if there are no catastrophes, but it seems like we need to get those inputs one way or the other. If there are catastrophes and we can’t do adversarial training, then even if we can tell which decision we want in any given case we can still die the first time we encounter an input where the system behaves catastrophically. (Or more realistically during the first wave where our systems are all exposed to such catastrophe-inducing inputs simultaneously.)