Member of technical staff at METR.
Previously: MIRI → interp with Adrià and Jason → METR.
I have signed no contracts or agreements whose existence I cannot mention.
Member of technical staff at METR.
Previously: MIRI → interp with Adrià and Jason → METR.
I have signed no contracts or agreements whose existence I cannot mention.
I’m not a chess player (have played maybe 15 normal games of chess ever) and tried playing LeelaPieceOdds on the BBNN setting. When LeelaQueenOdds was released I’d lost at Q odds several times before giving up; this time it was really fun! I played nine times and stalemated it once before finally winning, taking about 40 minutes. My sense is that information I’ve absorbed from chess books, chess streamers and the like was significantly helpful, e.g. avoid mistakes, try to trade when ahead in material, develop pieces, keep pieces defended.
I think the lesson is that a superhuman search over a large search space is much more powerful than a small one. With BBNN odds, Leela only has a queen and two rooks and after sacrificing some material to solidify and trade one of them, I’m still up 7 points and Leela won’t enough material to miraculously slip out of every trade until I blunder. By an endgame of say, KRNNB vs KR there are only a small number of possible moves for Leela and I can just check that I’m safe against each one until I win. I’d probably lose when given QN or QR, because Leela having two more pieces would increase the required ratio of simplifications to blunders.
Donated the max to both. I can believe there’s more marginal impact for Bores, but on an emotional level, his proximity, YIMBY work, and higher probability of winning make me very excited about Wiener.
While the singularity doesn’t have a reference class, benchmarks do have a reference class—we have enough of them that we can fit reasonable distributions on when benchmarks will reach 50%, be saturated, etc., especially if we know the domain. The harder part is measuring superintelligence with benchmarks.
Do games between top engines typically end within 40 moves? It might be that an optimal player’s occasional win against an almost-optimal player might come from deliberately extending and complicating the game to create chances
Does this meaningfully reduce the probability that you jump out of the way of a car or get screened for heart disease? The important thing isn’t whether you have an emotional fear response, but how the behavior pattern of avoiding generalizes.
Much of my hope is that by the time we reach a superintelligence level where we need to instill reflectively endorsed values to optimize towards in a very hands-off way rather than just constitutions, behaviors, or goals, we’ll have figured something else out. I’m not claiming the optimizer advantage alone is enough to be decisive in saving the world.
To the point about tighter feedback loops, I see the main benefit as being in conjunction with adapting to new problems. Suppose that we notice AIs take some bad but non-world-ending action like murdering people; then we can add a big dataset of situations in which AIs shouldn’t murder people to the training data. If we were instead breeding animals, we would have to wait dozens of generations for mutations that reduce murder rate to appear and reach fixation. Since these mutations affect behavior through brain architecture, they would have a higher chance of deleterious effects. And if we’re also selecting for intelligence, they would be competing against mutations that increase intelligence, producing a higher alignment tax. All this means that we have less chance to detect whether our proxies hold up (capabilities researchers have many of these advantages too, but the AGI would be able to automate capabilities training anyway).
If we expect problems to get worse at some rate until an accumulation of unsolved alignment issues culminates in disempowerment, it seems to me there is a large band of rates where we can stay ahead of them with AI training but evolution wouldn’t be able to.
Noted. Somewhat surprised you believe in quantum immortality, is there a particular reason?
EJT’s incomplete preferences proposal. But as far as I’m able to make out from the comments, you need to define a decision rule in addition to the utility function of an agent with incomplete preferences, and only some of those ways are compatible with shutdownability.
When I read it in school, the story frustrated me because I immediately wanted to create Omelas seeing as it’s a thousand times better than our society, so I didn’t really get the point of the intended and/or common interpretations.
Gradient descent (when applied to train AIs) allows much more fine-grained optimization than evolution, for these reasons:
Evolution by natural selection acts on the genome, which can only crudely affect behavior and only very indirectly affect values, whereas gradient descent acts on the weights which much more directly affect the AI’s behavior and maybe can affect values
Evolution can only select between two alleles in a discrete way, whereas gradient descent operates over a continuous space
Evolution has a minimum feedback loop of one organism generation, whereas RL has a much shorter minimum feedback loop length of one episode
Evolution can only combine information from different individuals inefficiently through sex, whereas we can combine gradients from many episodes to produce one AI that’s learned strategies from all episodes
We can adapt our alignment RL methods, data, hyperparameters, and objectives as we observe problems in the wild
We can do adversarial training against other AIs, but ancestral humans didn’t have to contend with animals whose goal was to trick them into not reproducing by any means necessary; the closest was animals that try to kill us. (Our fear of death is therefore much more robust than our desire to maximize reproductive fitness)
On current models, we can observe the chain of thought (although the amount we can train against it while maintaining faithfulness is limited)
We can potentially do interpretability (if that ever works out)
It’s unclear the degree to which these will solve inner alignment problems or cause AI goals to be more robust than animal goals to distributional shift, but we’re in much better shape than evolution was.
If you disagree with much of IABIED but are still worried about AI risk, maybe the question to ask is “will the radical flank effect be positive or negative on mainstream AI safety movements?”, which seems more useful than “do I on net agree or disagree?” or “will people taking this book at face value do useful or anti-useful things?” Here’s what Wikipedia has to say on the sign of a radical flank effect:
It’s difficult to tell without hindsight whether the radical flank of a movement will have positive or negative effects.[2] However, following are some factors that have been proposed as making positive effects more likely:
Greater differentiation between moderates and radicals in the presence of a weak government.[2][13][14]: 411 As Charles Dobson puts it: “To secure their place, the new moderates have to denounce the actions of their extremist counterparts as irresponsible, immoral, and counterproductive. The most astute will quietly encourage ‘responsible extremism’ at the same time.”[15]
Existing momentum behind the cause. If change seems likely to happen anyway, then governments are more willing to accept moderate reforms in order to quell radicals.[2]
Radicalism during the peak of activism, before concessions are won.[16] After the movement begins to decline, radical factions may damage the image of moderate organizations.[16]
Low polarization. If there’s high polarization with a strong opposing side, the opposing side can point to the radicals in order to hurt the moderates.[2]
Of course it’s still useful to debate on which factual points the book is accurate, but making judgments of the book’s overall value requires modeling other parts of the world.
Yeah I expect corrigibility to get a lot worse by the 10x economy level with at least 15% probability, as my uncertainty is very large, just not in the median case. The main reason is that we don’t need to try very hard yet to get sufficient corrigibility from models. My very rough model is even if the amount of corrigibility training required, say, 2x every time horizon doubling, whereas the amount of total training required 1.5x per time horizon doubling, we will get 10 more time horizon doublings with only a (2/1.5)^10 = 18x increase in relative effort dedicated to corrigibility training. This seems doable given that relatively little of the system cards of current models is dedicated to shutdownability.
As for (b), my guess is that with NO corrigibility training at all, models would start doing things like disabling shutdown scripts in the wild, locking users out of their computers and changing their passwords to prevent them manually shutting the agent down, and there would be outcry from public and B2B customers that hurts their profits, as well as a dataset of examples to train against. It’s plausible that fixing this doesn’t allow them to stop disempowerment. Maybe a crux is whether naively training for corrigibility is more effective or less effective on more egregious incidents.
After the 10xing the economy stage, plausibly corrigibility stops being useful for users / aligned with the profit motive because humans will largely stop understanding what models are doing, so I get a lot more worried if we haven’t figured something out.
After thinking about it more, it might take more than 3% even if things scale smoothly because I’m not confident corrigibility is only a small fraction of labs’ current safety budgets
Sorry, I mean corrigibility as opposed to CEV, and narrow in the sense that it follows user instructions rather than optimizing for all the user’s inferred preferences in every domain, not in the sense of AI that only understands physics. I don’t expect unsolvable corrigibility problems at the capability level where AI can 10x the economy under the median default trajectory, rather something like today, where companies undershoot or overshoot on how much the agent tries to be helpful vs corrigible, what’s good for business is reasonably aligned, and getting there requires something like 3% of the lab’s resources.
Big fan of the “bowing out” react. I did notice a minor UI issue where the voting arrows don’t fit in the box:
Inasmuch as we’re going for corrigibility, it seems necessary and possible to create an agent that won’t self-modify into an agent with complete preferences. Complete preferences is antithetical to narrow agents, and would mean the agent might try to eg solve the Israel-Palestine conflict when all you asked it to do is code you a website. Even if there is a working stop button this is a bad situation. It seems likely we can just train against this sort of thing, though maybe it will require being slightly clever.
As for whether it we can/should have it self-modify to avoid more basic kinds of money-pumps so its preference are at least transitive and independent, this is an empirical question I’m extremely unsure of, but we should aim for the least dangerous agent that gets the desired performance, which should balance the propensity for misaligned actions with the additional capability needed to overcome irrationality.
I was aware there is some line but thought it was “don’t ignite a conversation that derails this one” rather than “don’t say inaccurate things about groups”, which is why I listed lots of groups rather than one and declined to list actively contentious topics like timelines, IABIED reviews, or Matthew Barnett’s opinions
This was not my intention, though I could have been more careful. Here are my reasons
The original comment seemed really vague, in a way that often dooms conversations. Little progress can be made on most problems without pointing out the specific key reasons for them. The key point to make is that tribalism in this case doesn’t arise spontaneously based on identities alone, it has micro level causes which have macro level causes
I thought Ray’s wanted to discuss what to do for a broader communication strategy, so replying in shortform would be fine because the output would get >20x the views (this is where I could have realized LW shortform has a high profile now, and toned it down somehow), rather than open up the conversation here
I am also frustrated about tribalism and reporting from experience about what I notice in a somewhat exaggerated way. If there is defeatism this is the source, though I don’t think addressing it is impossible, I just don’t have any ideas
If people replied to me with object level twitterbrained comments about how eg everyone has to unite against Marc Andreessen I would be super sad. Hopefully we’re better than that.
It seems like everyone is tired of hearing every other group’s opinions about AI. Since like 2005, Eliezer has been hearing people say a superintelligent AI surely won’t be clever, and has had enough. The average LW reader is tired of hearing obviously dumb Marc Andreessen accelerationist opinions. The average present harms person wants everyone to stop talking about the unrealistic apocalypse when artists are being replaced by shitty AI art. The average accelerationist wants everyone to stop talking about the unrealistic apocalypse when they could literally cure cancer and save Western civilization. The average NeurIPS author is sad that LLMs have made their expertise in Gaussian kernel wobblification irrelevant. Various subgroups of LW readers are dissatisfied with people who think reward is the optimization target, Eliezer is always right, or discussion is too tribal, or whatever.
With this combined with how Twitter distorts discourse is it any wonder that people need to process things as “oh, that’s just another claim by X group, time to dismiss”? Anyway I think naming the groups isn’t the problem, and so naming the groups in the post isn’t contributing to the problem much. The important thing to address is why people find it advantageous to track these groups.
As a child I read everything I could get my hands on! Mostly a couple of Silman’s books. The appeal to me was quantifying and systematizing strategy, not chess itself (which I bounced off in favor of sports and math contests). E.g. the idea of exploiting imbalances, or planning by backchaining, or some of the specific skills like putting your knights in the right place.
I found these more interesting than Go books in this respect, both due to Silman’s writing style and because Go is such a complicated game filled with exceptions that Go books get bogged down in specifics.