A sufficient condition here should be a lack of feedback loops that include information about the agent.
Agreed. Also agreed that this seems very difficult, both in theory and in practice.
Your argument about Solomonoff induction is clever but I feel like it’s missing the point.
I agree it’s missing the point. I do get the point, and I disagree with it—I wanted to say “all three cases will build self-models”; I couldn’t because that may not be true for Solomonoff induction due to an unrelated reason which as you note misses the point. I did claim that the other two cases would be self-aware as you define it.
(I agree that Solomonoff induction might build an approximate model of itself, idk.)
Maybe if we do it right, the best model would not be self-reflective, not knowing what it was doing as it did its predictive thing, and thus unable to reason about its internal processes or recognize causal connections between that and the world it sees (even if such connections are blatant).
My claim is that we have no idea how to do this, and I think the examples in your post would not do this.
One intuition is: An oracle is supposed to just answer questions. It’s not supposed to think through how its outputs will ultimately affect the world. So, one way of ensuring that it does what it’s supposed to do, is to design the oracle to not know that it is a thing that can affect the world.
I’m not disagreeing that if we could build a self-unaware oracle then we would be safe. That seems reasonably likely to fix agency issues (though I’d want to think about it more). My disagreement is on the premise of the argument, i.e. can we build self-unaware oracles at all.
Example of a self-aware system: A traditional RL agent. (Why? Because it has a special concept of “its own actions” represented in its models.)
Example of a self-unaware system: Any system that takes inputs, does a deterministic computation, and spits out an output. (Why? Because when you correctly compute a computable function, you get the same answer regardless of where and whether the computation is physically instantiated in the universe.)
A traditional RL agent is absolutely a deterministic computation (modulo bugs in the code). It is a program that gets compiled into or run by machine instructions which follow a particular deterministic specification that Intel has written (if you’re running on Intel chips).
Also, this argument would prove that humans are not deterministic systems, which seems like it’s proving too much.
If you want to predict what’s going to happen in the world, it often helps if you know that you are a thing that affects the world.
For your three examples: The Solomonoff induction example is a weird case because it is an uncomputable algorithm that only has computable hypotheses, so it can’t be aware of itself, but your second and third examples seem like they totally could lead to self-aware systems. In fact, the third example sounds like a description of humans, and humans are self-aware.
Overall I don’t see how we could tell in advance whether a system would be self-unaware or not.
So far it is more of a position paper saying “we should do this” rather than “we have done this”, so it’s not super clear what tasks the helper-bot does. The human interacts with the bot by messaging it in normal English. Here’s an example of the kind of thing they want to be able to do, taken directly from the paper:
PLAYER: build a tower 15 blocks tall and then put a giant smiley on top
ASSISTANT: ok [assistant starts building the tower]
PLAYER: wait, stop, make every other block red [assistant recognizes the instruction refers to a change in its current build task, but doesn’t understand the change (and specifically recognizes “every other” as unknown)]
ASSISTANT: What is “every other”?
PLAYER: Let me show you [player makes a stack of blocks alternating in color; assistant is able to generalize “every other” to new situations]
My project is not at all related in the sense that this paper was a surprise to me when it was released a few days ago, but it is very related in the sense that I too am trying to build a benchmark of Minecraft tasks where there aren’t obvious reward functions. I’m less focused on natural language though—the hope is that people could try to solve it using other techniques as well, such as IRL or imitation learning.
Oh, I see, so the argument is that conditional on the idealized synthesis algorithm being a good definition of human preferences, the AI can approximate the synthesis algorithm, and whatever utility function it comes up with and optimizes should not have any human-identifiable problems. That makes sense. Followup questions:
How do you tell the AI system to optimize for “what the idealized synthesis algorithm would do”?
How can we be confident that the idealized synthesis algorithm actually captures what we care about?
For example, imagine that the AI, for example, extinguished all meaningful human interactions because these can sometimes be painful and the AI knows that we prefer to avoid pain. But it’s clear to us that most people’s partial preferences will not endorse total loneliness as good outcome; if it’s clear to us, then it’s a fortiori clear to a very intelligent AI; hence the AI will avoid that failure scenario.
I don’t understand this. My understanding is that you are proposing that we build a custom preference inference and synthesis algorithm, that’s separate from the AI. This produces a utility function that is then fed into the AI. But if this is the case, then you can’t use the AI’s intelligence to argue that the synthesis algorithm will work well, since they are separate.
Perhaps you do intend for the synthesis algorithm to be part of “the AI”? If so, can you say more about how that works? What assumptions about the AI do you need to be true?
Fwiw I also think it is not necessary to know lots of areas of math for AI safety research. Note that I do in fact know a lot of areas of math relatively shallowly.
I do think it is important to be able to do mathematical reasoning, which I can roughly operationalize as getting to the postrigorous stage in at least one area of math.
… Plausibly? Idk, it’s very hard for me to talk about the validity of intuitions in an informal, intuitive model that I don’t share. I don’t see anything obviously wrong with it.
There’s the usual issue that Bayesian reasoning doesn’t properly account for embeddedness, but I don’t think that would make much of a difference here.
Note that even if AI researchers do this similarly to other groups of people, that doesn’t change the conclusion that there are distortions that push towards shorter timelines.
Sorry in advance for how unhelpful this is going to be. I think decomposing an agent into “goals”, “world-model”, and “planning” is the wrong way to be decomposing agents. I hope to write a post about this soon.
I think I’m understanding you to be conceptualizing a dichotomy between “uncertainty over a utility function” vs. “looking for the one true utility function”.
Well, I don’t personally endorse this. I was speculating on what might be relevant to Stuart’s understanding of the problem.
I was trying to point towards the dichotomy between “acting while having uncertainty over a utility function” vs. “acting with a known, certain utility function” (see e.g. The Off-Switch Game). I do know about the problem of fully updated deference and I don’t know what Stuart thinks about it.
Also, for what it’s worth, in the case where there is an unidentifiability problem, as there is here, even in the limit, a Bayesian agent won’t converge to certainty about a utility function.
Agreed, but I’m not sure why that’s relevant. Why do you need certainty about the utility function, if you have certainty about the policy?
Does this not sound like a plan of running (C)IRL to get the one true utility function?
I do not think that is actually his plan, but I agree it sounds like it. One caveat is that I think the uncertainty over preferences/rewards is key to this story, which is a bit different from getting a single true utility function.
But really my answer is, the inferential distance between Stuart and the typical reader of this forum is very large. (The inferential distance between Stuart and me is very large.) I suspect he has very different empirical beliefs, such that you could reasonably say that he’s working on a “different problem”, in the same way that MIRI and I work on radically different stuff mostly due to different empirical beliefs.
This post argues that AI researchers and AI organizations have an incentive to predict that AGI will come soon, since that leads to more funding, and so we should expect timeline estimates to be systematically too short. Besides the conceptual argument, we can also see this in the field’s response to critics: both historically and now, criticism is often met with counterarguments based on “style” rather than engaging with the technical meat of the criticism.
I agree with the conceptual argument, and I think it does hold in practice, quite strongly. I don’t really agree that the field’s response to critics implies that they are biased towards short timelines—see these comments. Nonetheless, I’m going to do exactly what this post critiques, and say that I put significant probability on short timelines, but not explain my reasons (because they’re complicated and I don’t think I can convey them, and certainly can’t convey them in a small number of words).
My main point is that IRL, as it is typically described, feels nearly complete: just throw in a more advanced RL algorithm as a subroutine and some narrow-AI-type add-on for identifying human actions from a video feed, and voila, we have a superhuman human helper.
But maybe we could be spending more effort trying to follow through to fully specified proposals which we can properly put through the gauntlet.
Regardless of whether it is intended or not, this sounds like a dig at CHAI’s work. I do not think that IRL is “nearly complete”. I expect that researchers who have been at CHAI for at least a year do not think that IRL is “nearly complete”. I wrote a sequence partly for the purpose of telling everyone “No, really, we don’t think that we just need to run IRL to get the one true utility function; we aren’t even investigating that plan”.
(Sorry, this shouldn’t be directed just at you in particular. I’m annoyed at how often I have to argue against this perception, and this paper happened to prompt me to actually write something.)
Also, I don’t agree that “see if an AIXI-like agent would be aligned” is the correct “gauntlet” to be thinking about; that kind of alignment seems doomed to me, but in any case the AI systems we actually build are not going to look anything like that.
Strongly agree. Another benefit is that it exposes you to a broader swath of the world, which makes your models of the world better / more generalizable. I often feel like the rationalist community has “beliefs about people” that I think only apply to a small subset of people, e.g.
People need to find meaning in their jobs to be happy
Everyone thinks that the thing that they are doing is “good for the world” or “morally right” (as opposed to thinking that the thing they are doing is justifiable / reasonable to do)
I see, so the argument is mostly that jobs are performed more stably and so you can learn better how to deal with the principal-agent problems that arise. This seems plausible.
I don’t think that’s it. The inference I most disagree with is “rationality must have a simple core”, or “Occam’s razor works on rationality”. I’m sure there’s some meaning of “fundamental” or “epistemologically basic” such that I’d agree that rationality has that property, but that doesn’t entail “rationality has a simple core”.
The core of my intuition is that with different optimized AIs, it will be straightforward to determine exactly what the principal-agent problem consists of, and this can be compensated for.
I feel like it is not too hard to determine principal-agent problems with humans either? It’s just hard to adequately compensate for them.
Would you associate “ambitious value learning vs. adequate value learning” with “works in theory vs. doesn’t work in theory but works in practice”?
Potentially. I think the main question is whether adequate value learning will work in practice.
Moreover, there is a core difference between the growth of the cost of brain size between humans and AI (sublinear vs linear).
Actually, I was imagining that for humans the cost of brain size grows superlinearly. The paper you linked uses a quadratic function, and also tried an exponential and found similar results.
But in the world where AI dev faces hardware constraints, social learning will be much more useful.
Agreed if the AI uses social learning to learn from humans, but that only gets you to human-level AI. If you want to argue for something like fast takeoff to superintelligence, you need to talk about how the AI learns independently of humans, and in that setting social learning won’t be useful given linear costs.
E.g. Suppose that each unit of adaptive knowledge requires one unit of asocial learning. Every unit of learning costs $K, regardless of brain size, so that everything is linear. No matter how much social learning you have, the discovery of N units of knowledge is going to cost $KN, so the best thing you can do is put N units of asocial learning in a single brain/model so that you don’t have to pay any cost for social learning.
In contrast, if N units of asocial learning in a single brain costs $KN2, then having N units of asocial learning in a single brain/model is very expensive. You can instead have N separate brains each with 1 unit of asocial learning, for a total cost of $KN, and that is enough to discover the N units of knowledge. You can then invest a unit or two of social learning for each brain/model so that they can all accumulate the N units of knowledge, giving a total cost that is still linear in N.
I’m claiming that AI is more like the former while this paper’s model is more like the latter. Higher hardware constraints only changes the value of K, which doesn’t affect this analysis.