This post has a discussion of every major alignment organization, and summarizes their mission to some extent.
I’m not Rohin, but I think there’s a tendency to reply to things you disagree with rather than things you agree with. That would explain my emphasis anyway.
For the Alignment Newsletter:
Previously, Paul Christiano proposed creating an adversary to search for inputs that would make a powerful model behave “unacceptably” and then penalizing the model accordingly. To make the adversary’s job easier, Paul relaxed the problem so that it only needed to find a pseudo-input, which can be thought of as predicate that constrains possible inputs. This post expands on Paul’s proposal by first defining a formal unacceptability penalty and then analyzing a number of scenarios in light of this framework. The penalty relies on the idea of an amplified model inspecting an unamplified version of itself. For this procedure to work, amplified overseers must be able to correctly deduce whether potential inputs will yield unacceptable behavior in their unamplified selves, which seems plausible since it should know everything the unamplified version does. The post concludes by arguing that progress in model transparency is key to these acceptability guarantees. In particular, Evan emphasizes the need to decompose models into the parts involved in their internal optimization processes, such as their world models, optimization procedures, and objectives.
I agree that transparency is an important condition for the adversary, since it would be hard to search for catastrophe-inducing inputs without details of how the model operated. I’m less certain that this particular decomposition of machine learning models is necessary. More generally, I am excited to see how adversarial training can help with inner alignment.
AlphaGo seems much closer to “one project leaps forward by a huge margin.”
I don’t have the data on hand, but my impression was that AlphaGo indeed represented a discontinuity in the domain of Go. It’s difficult to say why this happened, but my best guess is that DeepMind invested a lot more money into solving Go than any competing actor at the time. Therefore, the discontinuity may have followed straightforwardly from a background discontinuity in attention paid to the task.
If this hypothesis is true, I don’t find it compelling that AlphaGo is evidence for a discontinuity for AGI, since such funding gaps are likely to be much smaller for economically useful systems.
My intuition is that understanding human values is a hard problem, but taking over the world is a harder problem.
Especially because taking over the world requires you to be much better than other agents who want to stop you from taking over the world, which could very well include other AIs.
ETA: That said, upon reflection, there have been instances of people taking over large parts of the world without being superhuman. All world leaders qualify, and it isn’t that unusual. However, what would be unusual is if someone wanted to take over the world and everyone else didn’t want that yet it still happened.
I agree with this, but when I said deployment I meant deployment of a single system, not several.
I’m confused about why (1) and (2) are separate scenarios then. Perhaps because in (2) there’s a lot of different types of AIs?
For the alignment newsletter:
Planned summary: It is difficult to measure the usefulness of various alignment approaches without clearly understanding what type of future they end up being useful for. This post collects “Success Stories” for AI—disjunctive scenarios in which alignment approaches are leveraged to ensure a positive future. Whether these scenarios come to pass will depend critically on background assumptions, such as whether we can achieve global coordination, or solve the most ambitious safety issues. Mapping these success stories can help us prioritize research.
Planned opinion: This post does not exhaust the possible success stories, but it gets us a lot closer to being able to look at a particular approach and ask, “Where exactly does this help us?” My guess is that most research ends up being only minimally helpful for the long run, and so I consider inquiry like this to be very useful for cause prioritization.
I’m finding it hard to see how we could get (1) without some discontinuity?
When I think about why (1) would be true, the argument that comes to mind is that single AI systems will be extremely expensive to deploy, which means that only a few very rich entities could own them. However, this would contradict the general trend of ML being hard to train and easy to deploy. Unlike, say nukes, once you’ve trained your AI you can create a lot of copies and distribute them widely.
It’s worth noting that I wasn’t using it as evidence “for” continuous takeoff. It was instead an example of something which experienced a continuous takeoff that nonetheless was quick relative to the lifespan of a human.
It’s hard to argue that it wasn’t continuous under my definition, since the papers got gradually and predictably better. Perhaps there was an initial discontinuity in 2014 when it first became a target? Regardless, I’m not arguing that this is a good model for AGI development.
It might just join the ranks of “projects from before”, and subtly try to alter future systems to be similarly defective, waiting for a future opportunity to strike.
Admittedly, I did not explain this point well enough. What I meant to say was that before we have the first successful defection, we’ll have some failed defection. If the system could indefinitely hide its own private intentions to later defect, then I would already consider that to be a ‘successful defection.’
Knowing about a failed defection, we’ll learn from our mistake and patch that for future systems. To be clear, I’m definitely not endorsing this as a normative standard for safety.
I agree with the rest of your comment.
When I think about what I find morally valuable about consciousness, I tend to think about rich experiences which are negative/positive, from the way that I rate them internally. An example of a negative value conscious experience is a sharp pain associated with being pricked by a pin. An example of a valuable conscious experience is the sensation of warmth associated with sitting near a fire during a cold winter day, together with the way that my brain processes the situation and enjoys it.
These things appears to me subtlety distinct from the feeling of inner awareness called ‘consciousness’ in this post.
once CO2 gets high enough, it starts impacting human cognition.
once CO2 gets high enough, it starts impacting human cognition.
Do you have a citation for this being a big deal? I’m really curious whether this is a major harm over reasonable timescales (such as 100 years), as I don’t recall ever hearing about it in an EA analysis of climate change. That said, I haven’t looked very hard.
This post lays out several experiments that could clarify the inner alignment problem: the problem of how to get an ML model to be robustly aligned with the objective function it was trained with. One example experiment is giving an RL trained agent direct access to its reward as part of its observation. During testing, we could try putting the model in a confusing situation by altering its observed reward so that it doesn’t match the real one. The hope is that we could gain insight into when RL trained agents internally represent ‘goals’ and how they relate to the environment, if they do at all. You’ll have to read the post to see all the experiments.
I’m currently convinced that doing empirical work right now will help us understand mesa optimization, and this was one of the posts that lead me to that conclusion. I’m still a bit skeptical that current techniques are sufficient to demonstrate the type of powerful learned search algorithms which could characterize the worst outcomes for failures in inner alignment. Regardless, I think at this point classifying failure modes is quite beneficial, and conducting tests like the ones in this post will make that a lot easier.
It seems to me that the intention is that solar radiation management is a solution that sounds good without actually being good. That is, it’s an easy sell for fossil fuel corporations who have an interest in providing simple solutions to the problem rather than actually removing the root cause and thus solving the issue completely. I have little idea if this argument is actually true.
If they leave an “unclear” react, I can’t ignore that nearly as easily—wait, which point was unclear? What are other people potentially missing that I meant to convey? Come back, anon!
Maybe there should be an option that allows you to highlight a part of the comment and react to that part in particular.
So my feeling is that in order to actually implement an AI that does not cause bad kinds of high impact, we would need to make progress on value learning
Optimizing for a ‘slightly off’ utility function might be catastrophic, and therefore the margin for error for value learning could be narrow. However, it seems plausible that if your impact measurement used slightly incorrect utility functions to define the auxiliary set, this would not cause a similar error. Thus, it seems intuitive to me that you would need less progress on value learning than a full solution for impact measures to work.
From the AUP paper,
one of our key findings is that AUP tends to preserve the ability to optimize the correct reward function even when the correct reward function is not included in the auxiliary set.
The problem with the first part of this sequence is that it can seem… obvious… until you realize that almost all prior writing about impact has not even acknowledged that we want the AI to leave us able to get what we want (to preserve our attainable utility).
Agreed. This has been my impression from reading previous work on impact.