LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.
Charlie Steiner(Charlie Steiner)
So, the maximally impractical but also maximally theoretically rigorous answer here is AIXI-tl.
An almost as impractical answer would be Markov chain Monte Carlo search for well-performing huge neural nets on some objective.
I say MCMC search because I’m confident that there’s some big neural nets that are good at navigating the real world, but any specific efficient training method we know of right now could fail to scale up reliably. Instability being the main problem, rather than getting stuck in local optima.
Dumb but thorough hyperparameter search and RL on a huge neural net should also work. Here we’re adding a few parts of “I am confident in this because of empirical data abut the historical success of scaling up neural nets trained with SGD” to arguments that still mostly rest on “I am confident because of mathematical reasoning about what it means to get a good score at an objective.”
Thanks, this was interesting.
I couldn’t really follow along with my own probabilities because things started wild from the get-go. You say we need to “invent algorithms for transformative AI,” when in fact we already have algorithms that are in-principle general, they’re just orders of magnitude too inefficient, but we’re making gradual algorithmic progress all the time. Checking the pdf, I remain confused about your picture of the world here. Do you think I’m drastically overstating the generality of current ML and the gradualness of algorithmic improvement, such that currently we are totally lacking the ability to build AGI, but after some future discovery (recognizable on its own merits and not some context-dependent “last straw”) we will suddenly be able to?
And your second question is also weird! I don’t really understand the epistemic state of the AI researchers in this hypothetical. They’re supposed to have built something that’s AGI, it just learns slower than humans. How did they get confidence in this fact? I think this question is well-posed enough that I could give a probability for it, except that I’m still confused about how to conditionalize on the first question.
The rest of the questions make plenty of sense, no complaints there.
In terms of the logical structure, I’d point out that inference costs staying low, producing chips, and producing lots of robots are all definitely things that could be routes to transformative AI, but they’re not necessary. The big alternate path missing here is quality. An AI that generates high-quality designs or plans might not have a human equivalent, in which case “what’s the equivalent cost at $25 per human hour” is a wrong question. Producing chips and producing robots could also happen or not happen in any combination and the world could still be transformed by high-quality AI decision-making.
Lots of claims have been scrutinized fairly intensely by governments. Was it the chilean military that spent a couple years investigating a UFO sighting and eventually went public saying it was unexplainable? Sadly, this effort provides little increase in reliability. The investigators are often doing this for the first time and lack key skills for analyzing the data. This is exacerbated by the fact that governments are large enough to allow for selection effects, where people spending effort investigating UFOs are self-selected for thinking they’re really important, i.e. aliens.
Hm. I’m sure plenty of people could do a fine job, myself included. But if every such person jumped in, it would be a mess. I assume that if Stuart Russell was the right person for the job, the job would already be over. Plausibly ditto Eliezer.
Rob Miles might be the obvious person for explaining things well. I totally endorse him doing attention-getting things I wouldn’t endorse for people like me.
Also probably fine would be people optimized a little more for AI work than explaining things. Paul Christiano may be the Schelling-point tip of the iceberg of people-kinda-doing-Paul-like-things, or trading off even more for AI, it looks like Yoshua Bengio might be a solid choice.
A framing I’ve been thinking about recently is AutoGPT. Obviously it’s not very good at navigating the world, but my point is actually about humans: the first thing people asked AutoGPT was simple tests like “fix this code” or “make a plan for an ad campaign.” Soon after, the creator told it to “help humanity.” A few days after that, someone else told it to “destroy humanity.” I think this is a good way of dividing up the discussion of whether AI poses an existential threat. Taken backwards:
There’s the sort of risk where a bad actor tells some real-world-navigating AI to destroy humanity. What factors would have to go wrong for them to succeed? This is a good frame question to talk about whether we expect AI to be a powerful technology at all, and how we expect the timescale of progress to compare to the timescales of diffusion of technology and adaptation to technology.
There’s the sort of risk where someone tells an AI to help humanity, and it goes wrong. Why would it go wrong? Well, human values are complicated and often fragile. This is a good time to talk about what the state of the art is for getting computers to just “do what humans mean” and why that state of the art is lacking. The failure mode that shows up repeatedly is finding unintended optima, and this gets even worse when trying to generalize to totally unseen problems.
For most people you just need to stop at two, but the third category is also something people think about. Is there a risk from giving an AI a safe-sounding objective like “fix this code” or “run my ad campaign?” This is a good jumping off point for talking about instrumental goals, the progress we’ve made in the last few years on “artificial common sense” and how far we still have to go, and mesa-optimization that might cause RL to generalize poorly.
Lacking access to the other’s hardware, I think you’d need something that’s easy to compute for an honest AI, but hard to compute for a deceptive AI. Because a deceptive AI could always just simulate an honest AI, how do you distinguish simulation?
The only way I can think of is resource constraints. Deception adds a slight overhead to calculating quantities that depends in detail on the state of the AI. If you know what computational capabilities the other AI has very precisely, and you can time your communications with it, then maybe it can compute something for you that you can later verify implies honesty.
There are plenty of good posts that contradict a “strict” orthogonality thesis by showing correlation between capabilities and various values-related properties (scaling laws / inverse scaling laws).
What really gets you downvoted is the claim that super-intelligent AI cannot want things that are bad for humanity, or even agitating that we should give that idea serious weight.
What also gets you downvoted is the in-between claim that all the scaling laws tend towards superhuman morality and everything will work out fine, no need to be worried or spend lots of hours working.
How to make a successful piece in the latter categories? Simple—just be right, for communcable reasons. Simple, but maybe not possible.
If you start with an AI that makes decisions of middling quality, how well can you get it to make high-quality decisions by ablating neurons associated with bad decisions? This is the centeal thing I expect to have diminishing returns (though it’s related to some other uses of unterpretability that might also have diminishing returns).
If you take a predictive model of chess games trained on human play, it’s probably not too hard to get it to play near the 90th percentile of the dataset. But it’s not going to play as well as stockfish almost no matter what you do. The AI is a bit flexible, especially in ways the training data has prediction-relevant variation, but it’s not arbitrarily flexible, and once you’ve changed the few most important neurons the other neurons will be progressively less important. I expect this to show up for all sorts of properties (e.g. moral quality of decisions), not just chess skill.
Yes, preserving the existence of multiple good options that humans can choose between using their normal reasoning process sounds great. Which is why an AI that learns human values should learn that humans want the universe to be arranged in such a way.
I’m concerned that you seem to be saying that problems of agency are totally different from learning human values, and have to be solved in isolation. The opposite is true—preferring agency is a paradigmatic human value, and solving problems of agency should only be a small part of a more general solution.
I am shocked that higher quality training data based on more effortful human feedback produced a better result.
Consider the computational difficulty of intrinsic vs. extrinsic alignment for a chess-playing AI.
Suppose you want the AI to walk its king to the center of the board before winning. With intrinsic alignment, this is a little tricky to encode but not too hard. With extrinsic alignment, this requires vastly outsmarting the chess-playing AI so that you can make it dance to your tune—maybe humans could do it to a 500-elo chess bot, but past 800 elo I think I’d only be able to solve the problem by building second chess engine that was intrinsically aligned to extrinsically align the first one.
A nice exposition.
For myself I’d prefer the same material much more condensed and to-the-point, but I recognize that there are publication venues that prefer more flowing text.
E.g. compare
We turn next to the laggard. Compared to the fixed roles model, the laggard’s decision problem in the variable roles model is more complex primarily in that it must now consider the expected utility of attacking as opposed to defending or pursuing other goals. When it comes to the expected utility of defending or pursuing other goals, we can simply copy the formulas from Section 7. To calculate the laggard’s expected utility of attacking, however, we must make two changes to the formula that applies to the leader. First, we must consider the probability that choosing to attack rather than defend will result in the laggard being left defenseless if the leader executes an attack. Second, as we saw, the victory condition for the laggard’s attack requires that AT + LT < DT. Formally, we have:
to
The laggard now has the same decisions as the leader, unlike the fixed roles model. However, the laggard must consider that attacking may leave them defenseless if the leader attacks. Also, of course, the victory conditions for attack and defense have the lag time on the other side.
Two suggestions for things to explore:
People often care about the Nash equilibrium of games. For the simple game with perfect information this might be trivial, but it’s at least a little interesting with imperfect information.
Second, What about bargaining? Attacking and defending is costly, and AIs might be able to make agreements that they literally cannot break, essentially turning a multipolar scenario into a unipolar scenario where the effective goals are achieving a Pareto optimum of the original goals. Which Pareto optimum exactly will depend on things like the available alternatives, i.e. the power differential. Not super familiar with the bargaining literature so I can’t point you at great academic references, just blog posts.
My thoughts on the strategy are that this is overly optimistic. This picture where you have ten AGIs and exactly one of them is friendly is unlikely due to the logistic success curve. Or if the heterogeneity of the AGIs is due to heterogeneity of humans (maybe Facebook builds one AI and Google builds the other, or maybe there are good open-source AI tools that let lots of individuals build AGIs around the same time) rather than stochasticity of outcomes given humanity’s best AGI designs, why would the lab building the unfriendly AGI also use your safeguard interventions?
I also expect that more reaslistic models will increasingly favor the leader, as they can bring to bear information and resources in a way that doesn’t just look like atomic “Attack” or “Defend” actions. This isn’t necessarily bad, but it definitely makes it more important to get things right first try.
There is a causal relationship between time on LW and frequency of paragraph breaks :P
Anyhow, I broadly agree with this comment, but I’d say it’s also an illustration of why interpretability has diminishing returns and we really need to also be doing “positive alignment.” If you just define some bad behaviors and ablate neurons associated with those bad behaviors (or do other things like filter the AI’s output), this can make your AI safer but with ~exponentially diminishing returns on the selection pressure you apply.
What we’d also like to be doing is defining good behaviors and helping the AI develop novel capabilities to pursue those good behaviors. This is trickier because maybe you can’t just jam the internet at self-supervised learning to do it, so it has more bits that look like the “classic” alignment problem.
I would claim that an army of robots based on ASIs will generally lose to an army of robots based on true AGI.
The truly optimal war-winning AI would not need to question its own goal to win the war, presumably.
Would you agree that an AI that is maximizing paperclips does make intellectual mistake?
No. I think that’s anthropomorphism—just because a certain framework of moral reasoning is basically universal among humans, doesn’t mean it’s universal among all systems that can skillfully navigate the real world. Frameworks of moral reasoning are on the “ought” side of the is-ought divide.
The idea has a broader consequence to AI safety. While paperclip maximizer might be designed as part of paperclip maximizers research, it’s probably will not arise spontaneously from intelligence research in general. Even making one will even probably be considered immoral request by an AGI.
This doesn’t follow.
You start the post by saying that the most successful paperclip maximizer (or indeed the most successful AI at any monomaniacal goal) wouldn’t doubt its own goals, and in fact doesn’t even need the capacity to doubt its own goals. And since you care about this, you don’t want to call something that can’t doubt its own goals “AGI.”
This is a fine thing to care about.
Unfortunately, most people use “AGI” to mean an AI that can solve lots of problems in lots of environments (with somewhere around human broadness and competence being important), and this common definition includes some AIs that can’t question their own final goals, so long as they’re competent at lots of other things. So I don’t think you’ll have much luck changing peoples’ minds on how to use the term “AGI.”
Anyhow, point is, I agree that “best at being dangerous to humans” implies “doesn’t question itself”. But from this you cannot conclude that NOT “doesn’t question itself” implies NOT “is dangerous to humans”. It might not be the best at being dangerous to humans, but you can still make an AI that’s dangerous to humans and that also questions itself.
Yes, obviously we are trying to work on how to get an AI to do good ethical reasoning. But don’t get it twisted—reasoning about goals is more about goals than about general-purpose reasoning. An AI that wants to do things that are bad for humans is not making an intellectual mistake.
Do you think that human theorists are near the limit of what kind of approximations we should use to calculate the band structure of diamond (and therefore a superintelligent AI couldn’t outsmart human theorists by doing their job better)? Like if you left physics to stew for a century and came back, we’d still be using the GW approximation?
This seems unlikely to me, but I don’t really know much about DFT (I was an experimentalist). Maybe there are so few dials to turn that picking the best approximation for diamond is an easy game. Intuitively I’d expect that if a clever theorist knew that they were trying to just predict the band structure of diamond (but didn’t know the answer ahead of time), there are bespoke things they could do to try to get a better answer (abstract reasoning about what factors are important, trying to integrate DFT and a tight binding model, something something electron phonon interactions), and that is effectively equivalent to an efficient approximation that beats DFT+GWA.
Definitely we’re still making progress for more interesting materials (e.g. cuprates) - or at least people are still arguing. So even if we really can’t do better than what we have now for diamond, we should still expect a superintelligent AI to be better at numerical modeling for lots of cases of interest.
Do you think that human theorists are near the limit of what kind of approximations we should use to calculate the band structure of diamond (and therefore a superintelligent AI couldn’t outsmart human theorists by doing their job better)? Like if you left physics to stew for a century and came back, we’d still be using the GW approximation?
This seems unlikely to me, but I don’t really know much about DFT (I was an experimentalist). Maybe there are so few dials to turn that picking the best approximation for diamond is an easy game. Intuitively I’d expect that if a clever theorist knew that they were trying to just predict the band structure of diamond (but didn’t know the answer ahead of time), there are bespoke things they could do to try to get a better answer (abstract reasoning about what factors are important, trying to integrate DFT and a tight binding model, something something electron phonon interactions), and that is effectively equivalent to an efficient approximation that beats DFT+GWA.
Definitely we’re still making progress for more interesting materials (e.g. cuprates) - or at least people are still arguing. So even if we really can’t do better than what we have now for diamond, we should still expect a superintelligent AI to be better at numerical modeling for lots of cases of interest.
More or less.
Is this good news? Yes.
Is this strong evidence that we don’t need to work hard on AI safety? No.
Are elements of the simple generative-model-finetuning paradigm going to be reused to ensure safety of superintelligent AI (conditional on things going well)? Maybe, maybe not. I’d say that the probability is around 30%. That’s pretty likely in the grand scheme of things! But it’s even more likely that we’ll use new approaches entirely and the safety guardrails on GPT-4 will be about as technologically relevant to superintelligent AI as the safety guardrails on industrial robots.
I feel like it’s 4 ~ 1 > 2 > 3. The example of CNNs seems like this, where the artificial neural networks and actual brains face similar constraints and wind up with superficially similar solutions, but when you look at all the tricks that CNNs use (especially weight-sharing, but also architecture choices, choice of optimizer, etc.) they’re not actually very biology-like, and were developed based on abstract considerations more than biological ones.
Thanks! It seems like most of your exposure has been through Eliezer? Certainly impressions like “why does everyone think the chance of doom is >90%?” only make sense in that light. Have you seen presentations of AI risk arguments from other people like Rob Miles or Stuart Russell or Holden Karnofsky, and if so do you have different impressions?
It’s related in that you’re all talking about maintaining some parts of the status quo, but I think the instrumental technologies (human-directed services vs. agential AIs that directly care about maintaining status-quo boundaries) are pretty different, as are all the arguments related to those technologies.