This is not what I meant by “the same values”, but the comment points towards an interesting point.
When I say “the same values”, I mean the same utility function, as a function over the state of the world (and the states of “R is having sex” and “H is having sex” are different).
The interesting point is that states need to be inferred from observations, and it seems like there are some fundamentally hard issues around doing that in a satisfying way.
Yes, maybe? Elaborating...
I’m not sure how well this fits into the category of “inner optimizers”; I’m still organizing my thoughts on that (aiming to finish doing so within the week...). I’m also not sure that people are thinking about inner optimizers in the right way.
Also, note that the thing being imitated doesn’t have to be a human.
OTTMH, I’d say:
This seems more general in the sense that it isn’t some “subprocess” of the whole system that becomes a dangerous planning process.
This seems more specific in the sense that the boldest argument for inner optimizers is, I think, that they should appear in effectively any optimization problem when there’s enough optimization pressure.
Hey, David here!
Just writing to give some context… The point of this session was to discuss an issue I see with “super-human feedback (SHF)” schemes (e.g. debate, amplification, recursive reward modelling) that use helper AIs to inform human judgments. I guess there was more of an inferential gap going into the session than I expected, so for background: let’s consider the complexity theory viewpoint in feedback (as discussed in section 2.2 of “AI safety via debate”). This implicitly assumes that we have access to a trusted (e.g. human) decision making process (TDMP), sweeping the issues that Stuart mentions under the rug.
Under this view, the goal of SHF is to efficiently emulate the TDMP, accelerating the decision-making. For example, we’d like an agent trained with SHF to be able to quickly (e.g. in a matter of seconds) make decisions that would take the TDMP billions of years to decide. But we don’t aim to change the decisions.
Now, the issue I mentioned is: there doesn’t seem to be any way to evaluate whether the SHF-trained agent is faithfully emulating the TDMP’s decisions on such problems. It seems like, naively, the best we can do is train on problems where the TDMP can make decisions quickly, so that we can use its decisions as ground truth; then we just hope that it generalizes appropriately to the decisions that take TDMP billions of years. And the point of the session was to see if people have ideas for how to do less naive experiments that would allow us to increase our confidence that a SHF-scheme would yield safe generalization to these more difficult decisions.
Imagine there are 2 copies of me, A and B. A makes a decision with some helper AIs, and independently, B makes a decision without their help. A and B make different decisions. Who do we trust? I’m more ready to trust B, since I’m worried about the helper AIs having an undesirable influence on A’s decision-making.
...So questions of how to define human preferences or values seem mostly orthogonal to this question, which is why I want to assume them away. However, our discussion did make me consider more that I was making an implicit assumption (and this seems hard to avoid), that there was some idealized decision-making process that is assumed to be “what we want”. I’m relatively comfortable with trusting idealized versions of “behavioral cloning/imitation/supervised learning” (P) or “(myopic) reinforcement learning/preference learning” (NP), compared with the SHF-schemes (PSPACE).
One insight I gleaned from our discussion is the usefulness of disentangling:
an idealized process for *defining* “what we want” (HCH was mentioned as potentially a better model of this than “a single human given as long as they want to think about the decision” (which was what I proposed using, for the purposes of the discussion)).
a means of *approximating* that definition.
From this perspective, the discussion topic was: how can we gain empirical evidence for/against this question: “Assuming that the output of a human’s indefinite deliberation is a good definition of ‘what they want’, do SHF-schemes do a good/safe job of approximating that?”
See the clarifying note in the OP. I don’t think this is about imitating humans, per se.
The more general framing I’d use is WRT “safety via myopia” (something I’ve been working on in the past year). There is an intuition that supervised learning (e.g. via SGD as is common practice in current ML) is quite safe, because it doesn’t have any built-in incentive to influence the world (resulting in instrumental goals); it just seeks to yield good performance on the training data, learning in a myopic sense to improve it’s performance on the present input. I think this intuition has some validity, but also might lead to a false sense of confidence that such systems are safe, when in fact they may end up behaving as if they *do* seek to influence the world, depending on the task they are trained on (ETA: and other details of the learning algorithm, e.g. outer-loop optimization and model choice).
I don’t think I was very clear; let me try to explain.
I mean different things by “intentions” and “terminal values” (and I think you do too?)
By “terminal values” I’m thinking of something like a reward function. If we literally just program an AI to have a particular reward function, then we know that it’s terminal values are whatever that reward function expresses.
Whereas “trying to do what H wants it to do” I think encompasses a broader range of things, such as when R has uncertainty about the reward function, but “wants to learn the right one”, or really just any case where R could reasonably be described as “trying to do what H wants it to do”.
Talking about a “black box system” was probably a red herring.
Comparing with articles from a year ago, e.g. http://www.popsci.com/bill-gates-fears-ai-ai-researchers-know-better, this represents significant progress.
I’m a PhD student in Yoshua’s lab. I’ve spoken with him about this issue several times, and he has moved on this issue, as have Yann and Andrew. From my perspective following this issue, there was tremendous progress in the ML community’s attitude towards Xrisk.
I’m quite optimistic that such progress with continue, although pessimistic that it will be fast enough and that the ML community’s attitude will be anything like sufficient for a positive outcome.
Which transhumanist ideas are “not even wrong”?
And do you mean simply ‘not well specified enough’? Or more like ‘unfalsifiable’?
You also seem to be implying that scientists cannot discuss topics outside of their field, or even outside its current reach.
My philosophy on language is that people can generally discuss anything. For any words that we have heard (and indeed, many we haven’t), we have some clues as to their meaning, e.g. based on the context in which they’ve been used and similarity to other words.
Also, would you consider being cautious an inherently good thing?
Finally, from my experience as a Masters student in AI, many people are happy to give opinions on transhumanism, it’s just that many of those opinions are negative.
I found it interesting that he doesn’t think we should stop or slow down, but associates his position with Bill Joy, the author of “Why the Future Doesn’t Need Us” (2000), which argued for halting research in genetics, nanotech and robotics.
This is one of my main cruxes. I have 2 main concerns about honest mistakes:
1) Compounding errors: IIUC, Paul thinks we can find a basin of attraction for alignment (or at least corrigibility...) so that an AI can help us correct it online to avoid compounding errors. This seems plausible, but I don’t see any strong reasons to believe it will happen or that we’ll be able to recognize whether it is or not.
2) The “progeny alignment problem” (PAP): An honest mistake could result in the creation an unaligned progeny. I think we should expect that to happen quickly if we don’t have a good reason to believe it won’t. You could argue that humans recognize this problem, so an AGI should as well (and if it’s aligned, it should handle the situation appropriately), but that begs the question of how we got an aligned AGI in the first place. There are basically 3 subconcerns here (call the AI we’re building “R”):
2a) R can make an unaligned progeny before it’s “smart enough” to realize it needs to exercise care to avoid doing so.
2b) R gets smart enough to realize that solving PAP (e.g. doing something like MIRI’s AF) is necessary in order to develop further capabilities safely, and that ends up being a huge roadblock that makes R uncompetitive with less safe approaches.
2c) If R has gamma < 1, it could knowingly, rationally decide to build a progeny that is useful through R’s effective horizon, but will take over and optimize a different objective after that.
2b and 2c are *arguably* “non-problems” (although they’re at least worth taking into consideration). 2a seems like a more serious problem that needs to be addressed.
So my original response was to the statement:
Differential research that advances safety more than AI capability still advances AI capability.
Which seems to suggest that advancing AI capability is sufficient reason to avoid technical safety that has non-trivial overlap with capabilities. I think that’s wrong.
RE the necessary and sufficient argument:
1) Necessary: it’s unclear that a technical solution to alignment would be sufficient, since our current social institutions are not designed for superintelligent actors, and we might not develop effective new ones quickly enough
2) Sufficient: I agree that never building AGI is a potential Xrisk (or close enough). I don’t think it’s entirely unrealistic “to shoot for levels of coordination like ‘let’s just never build AGI’“, although I agree it’s a long shot. Supposing we have that level of coordination, we could use “never build AGI” as a backup plan while we work to solve technical safety to our satisfaction, if that is in fact possible.
Moving on from that I’m thinking that we might need a broad base of support from people (depending upon the scenario) so being able to explain how people could still have meaningful lives post AI is important for building that support. So I’ve been thinking about that.
This sounds like it would be useful for getting people to support the development of AGI, rather than effective global regulation of AGI. What am I missing?
Can you give some arguments for these views?
I think the best argument against institution-oriented work is that it might be harder to make a big impact. But more importantly, I think strong global coordination is necessary and sufficient, whereas technical safety is plausibly neither.
I also agree that one should consider tradeoffs, sometimes. But every time someone has raised this concern to me (I think it’s been 3x?) I think it’s been a clear cut case of “why are you even worrying about that”, which leads me to believe that there are a lot of people who are overconcerned about this.
It seems like the preferences of the AI you build are way more important than its experience (not sure if that’s what you mean).
This is because the AIs preferences are going to have a much larger downstream impact?
I’d agree, but caveat that there may be likely possible futures which don’t involve the creation of hyper-rational AIs with well-defined preferences, but rather artificial life with messy incomplete, inconsistent preferences but morally valuable experiences. More generally, the future of the light cone could be determined by societal/evolutionary factors rather than any particular agent or agent-y process.
I found your 2nd paragraph unclear...
the goals happen to overlap enough
Is this referring to the goals of having “AIs that have good preferences” and “AIs that have lots of morally valuable experience”?
Are you funding constrained? Would you give out more money if you had more?
FWIW, I think I represent the majority of safety researchers in saying that you shouldn’t be too concerned with your effect on capabilities; there’s many more people pushing capabilities, so most safety research is likely a drop in the capabilities bucket (although there may be important exceptions!)
Personally, I agree that improving social institutions seems more important for reducing AI-Xrisk ATM than technical work. Are you doing that? There are options for that kind of work as well, e.g. at FHI.
Overall, I think the question “which AIs are good successors?” is both neglected and time-sensitive, and is my best guess for the highest impact question in moral philosophy right now.
Interesting… my model of Paul didn’t assign any work in moral philosophy high priority.
I agree this is high impact. My idea of the kind of work to do here is mostly trying to solving the hardish problem of consciousness so that we can have some more informed guess as to the quantity and valence of experience that different possible futures generate.