I don’t think I’d put it that way (although I’m not saying it’s inaccurate). See my comments RE “safety via myopia” and “inner optimizers”.
Yes, maybe? Elaborating...
I’m not sure how well this fits into the category of “inner optimizers”; I’m still organizing my thoughts on that (aiming to finish doing so within the week...). I’m also not sure that people are thinking about inner optimizers in the right way.
Also, note that the thing being imitated doesn’t have to be a human.
OTTMH, I’d say:
This seems more general in the sense that it isn’t some “subprocess” of the whole system that becomes a dangerous planning process.
This seems more specific in the sense that the boldest argument for inner optimizers is, I think, that they should appear in effectively any optimization problem when there’s enough optimization pressure.
See the clarifying note in the OP. I don’t think this is about imitating humans, per se.
The more general framing I’d use is WRT “safety via myopia” (something I’ve been working on in the past year). There is an intuition that supervised learning (e.g. via SGD as is common practice in current ML) is quite safe, because it doesn’t have any built-in incentive to influence the world (resulting in instrumental goals); it just seeks to yield good performance on the training data, learning in a myopic sense to improve it’s performance on the present input. I think this intuition has some validity, but also might lead to a false sense of confidence that such systems are safe, when in fact they may end up behaving as if they *do* seek to influence the world, depending on the task they are trained on (ETA: and other details of the learning algorithm, e.g. outer-loop optimization and model choice).
Aha, OK. So I either misunderstand or disagree with that.
I think SHF (at least most examples) have the human as “CEO” with AIs as “advisers”, and thus the human can chose to ignore all of the advice and make the decision unaided.
I think I disagree pretty broadly with the assumptions/framing of your comment, although not necessarily the specific claims.
1) I don’t think it’s realistic to imagine we have “indistinguishable imitation” with an idealized discriminator. It might be possible in the future, and it might be worth considering to make intellectual progress, but I’m not expecting it to happen on a deadline. So I’m talking about what I expect might be a practical problem if we actually try to build systems that imitate humans in the coming decades.
2) I wouldn’t say “decision theory”; I think that’s a bit of a red herring. What I’m talking about is the policy.
3) I’m not sure the link you are trying to make to the “universal prior is malign” ideas. But I’ll draw my own connection. I do think the core of the argument I’m making results from an intuitive idea of what a simplicity prior looks like, and its propensity to favor something more like a planning process over something more like a lookup table.
OK, so it sounds like your argument why SHF can’t do ALD is (a specific, technical version of) the same argument that I mentioned in my last response. Can you confirm?
I intended to make that clear in the “Concretely, I imagine a project around this with the following stages (each yielding at least one publication)” section. The TL;DR is: do a literature review of analytic philosophy research on (e.g.) honesty.
Yes, please try to clarify. In particular, I don’t understand your “|” notation (as in “S|Output”).
I realized that I was a bit confused in what I said earlier. I think it’s clear that (proposed) SHF schemes should be able to do at least as well as a human, given the same amount of time, because they have human “on top” (as “CEO”) who can merely ignore all the AI helpers(/underlings).
But now I can also see an argument for why SHF couldn’t do ALD, if it doesn’t have arbitrarily long to deliberate: there would need to be some parallelism/decomposition in SHF, and that might not work well/perfectly for all problems.
Regarding the question of how to do empirical work on this topic: I remember there being one thing which seemed potentially interesting, but I couldn’t find it in my notes (yet).
RE the rest of your comment: I guess you are taking issue with the complexity theory analogy; is that correct? An example hypothetical TDMP I used is “arbitrarily long deliberation” (ALD), i.e. a single human is allowed as long as they want to make the decision (I don’t think that’s a perfect “target” for alignment, but it seems like a reasonable starting point). I don’t see why ALD would (even potentially) “do something that can’t be approximated by SHF-schemes”, since those schemes still have the human making a decision.
“Or was the discussion more about, assuming we have theoretical reasons to think that SHF-schemes can approximate TDMP, how to test it empirically?” <-- yes, IIUC.
I’d suggest separating these two scenarios, based on the way the comments are meant to work according to the OP.
I actually don’t understand why you say they can’t be fully disentangled.
IIRC, it seemed to me during the discussion that your main objection was around whether (e.g.) “arbitrarily long deliberation (ALD)” was (or could be) fully specified in a way that accounts properly for things like deception, manipulation, etc. More concretely, I think you mentioned the possibility of an AI affecting the deliberation process in an undesirable way.
But I think it’s reasonable to assume (within the bounds of a discussion) that there is a non-terrible way (in principle) to specify things like “manipulation”. So do you disagree? Or is your objection something else entirely?
Hey, David here!
Just writing to give some context… The point of this session was to discuss an issue I see with “super-human feedback (SHF)” schemes (e.g. debate, amplification, recursive reward modelling) that use helper AIs to inform human judgments. I guess there was more of an inferential gap going into the session than I expected, so for background: let’s consider the complexity theory viewpoint in feedback (as discussed in section 2.2 of “AI safety via debate”). This implicitly assumes that we have access to a trusted (e.g. human) decision making process (TDMP), sweeping the issues that Stuart mentions under the rug.
Under this view, the goal of SHF is to efficiently emulate the TDMP, accelerating the decision-making. For example, we’d like an agent trained with SHF to be able to quickly (e.g. in a matter of seconds) make decisions that would take the TDMP billions of years to decide. But we don’t aim to change the decisions.
Now, the issue I mentioned is: there doesn’t seem to be any way to evaluate whether the SHF-trained agent is faithfully emulating the TDMP’s decisions on such problems. It seems like, naively, the best we can do is train on problems where the TDMP can make decisions quickly, so that we can use its decisions as ground truth; then we just hope that it generalizes appropriately to the decisions that take TDMP billions of years. And the point of the session was to see if people have ideas for how to do less naive experiments that would allow us to increase our confidence that a SHF-scheme would yield safe generalization to these more difficult decisions.
Imagine there are 2 copies of me, A and B. A makes a decision with some helper AIs, and independently, B makes a decision without their help. A and B make different decisions. Who do we trust? I’m more ready to trust B, since I’m worried about the helper AIs having an undesirable influence on A’s decision-making.
...So questions of how to define human preferences or values seem mostly orthogonal to this question, which is why I want to assume them away. However, our discussion did make me consider more that I was making an implicit assumption (and this seems hard to avoid), that there was some idealized decision-making process that is assumed to be “what we want”. I’m relatively comfortable with trusting idealized versions of “behavioral cloning/imitation/supervised learning” (P) or “(myopic) reinforcement learning/preference learning” (NP), compared with the SHF-schemes (PSPACE).
One insight I gleaned from our discussion is the usefulness of disentangling:
an idealized process for *defining* “what we want” (HCH was mentioned as potentially a better model of this than “a single human given as long as they want to think about the decision” (which was what I proposed using, for the purposes of the discussion)).
a means of *approximating* that definition.
From this perspective, the discussion topic was: how can we gain empirical evidence for/against this question: “Assuming that the output of a human’s indefinite deliberation is a good definition of ‘what they want’, do SHF-schemes do a good/safe job of approximating that?”
So I discovered that Paul Christiano already made a very similar distinction to the holistic/parochial one here:
ambitious ~ holistic
narrow ~ parochial
Someone also suggested simply using general/narrow instead of holistic/parochial.
Has it been rolled out yet? I would really like this feature.
RE spamming: certainly they can be disabled by default, and you can have an unsubscribe button at the bottom of every email?
I view this as a capability control technique, highly analogous to running a supervised learning algorithm where a reinforcement learning algorithm is expected to perform better. Intuitively, it seems like there should be a spectrum of options between (e.g.) supervised learning and reinforcement learning that would allow one to make more fine-grained safety-performance trade-offs.
I’m very optimistic about this approach of doing “capability control” by making less agent-y AI systems. If done properly, I think it could allow us to build systems that have no instrumental incentives to create subagents (although we’d still need to worry about “accidental” creation of subagents and (e.g. evolutionary) optimization pressures for their creation).
I would like to see this fleshed out as much as possible. This idea is somewhat intuitive, but it’s hard to tell if it is coherent, or how to formalize it.
P.S. Is this the same as “platonic goals”? Could you include references to previous thought on the topic?
I realized it’s unclear to me what “trying” means here, and in your definition of intentional alignment. I get the sense that you mean something much weaker than MIRI does by “(actually) trying”, and/or that you think this is a lot easier to accomplish than they do. Can you help clarify?
It seems like you are referring to daemons.
To the extent that daemons result from an AI actually doing a good job of optimizing the right reward function, I think we should just accept that as the best possible outcome.
To the extent that daemons result from an AI doing a bad job of optimizing the right reward function, that can be viewed as a problem with capabilities, not alignment. That doesn’t mean we should ignore such problems; it’s just out of scope.
Indeed, most people at MIRI seem to think that most of the difficulty of alignment is getting from “has X as explicit terminal goal” to “is actually trying to achieve X.”
That seems like the wrong way of phrasing it to me. I would put it like “MIRI wants to figure out how to build properly ‘consequentialist’ agents, a capability they view us as currently lacking”.
Can you please explain the distinction more succinctly, and say how it is related?
I don’t think I was very clear; let me try to explain.
I mean different things by “intentions” and “terminal values” (and I think you do too?)
By “terminal values” I’m thinking of something like a reward function. If we literally just program an AI to have a particular reward function, then we know that it’s terminal values are whatever that reward function expresses.
Whereas “trying to do what H wants it to do” I think encompasses a broader range of things, such as when R has uncertainty about the reward function, but “wants to learn the right one”, or really just any case where R could reasonably be described as “trying to do what H wants it to do”.
Talking about a “black box system” was probably a red herring.
Another way of putting it: A parochially aligned AI (for task T) needs to understand task T, but doesn’t need to have common sense “background values” like “don’t kill anyone”.
Narrow AIs might require parochial alignment techniques in order to learn to perform tasks that we don’t know how to write a good reward function for. And we might try to combine parochial alignment with capability control in order to get something like a genie without having to teach it background values. When/whether that would be a good idea is unclear ATM.