A friend points out that this is evidence that the recent p50 estimate(s) are boosted by some kind of measurement noise, since it’s also a much faster growth rate than historically.
deep
Whoa, that’s super interesting! It looks like this is a big trend break, where previously the 50% and 80 thresholds moved in lockstep.
How should we think about 80% vs 50% thresholds?
In terms of model usability: you need more p(success) when success is hard for the user to verify, and when failure is costly. (Also when the operation itself is costly regardless of success, e.g. if you’re using the AI to move a big ship from one place to another.)
Software engineering is pretty good on verification, and varies on costliness but you can try to set up an environment with lots of backups so it’s pretty reversible.
High-risk areas like military operations (as opposed to data analysis or cyber ops) are pretty rough.
Lots of jobs are probably somewhere in the middle, so viability of automation might also depend on other factors like how much cost savings there are, how much past data exists and how well-structured it is, whether the right actuators and sensors exist, and attempts to protect jobs.
In longer-term implication: divergence between 50% and 80% success timelines seems pretty weird. One answer is that “today’s 50% task is tomorrow’s 80% task”—but historically they’ve moved in lockstep. Any guesses at what’s going on?
Data table, generated by Claude from METR data:
Huh, thanks! That’s surprising; I wonder why/how Anthropic got there first.
Right now Claude is the only model that the military entrusts for use in classified systems
On what basis do you say this? I think it’s the only one that’s confirmed to have been used in a classified setting. But DOD has ~$200m contracts with xAI, OpenAI, and GDM as well.
Also lol, “OpenAI and xAI employees wouldn’t stand for this”. You think the people who staked the company on Altman and the people who stuck around after MechaHitler will draw the line at “building autonomous weapons for the government that could either severely hamper our funding or catapult us to the lead”?
I’m pretty uninformed on the object level here (whether anyone is doing this; how easy it would be). But crazy-seeming inefficiencies crop up pretty often in our fallen world, and often what they need is a few competent people who make it their mission to fix them. I also suspect there would be a lot of cool “learning by doing” value involved in trying to scale up this work, and if you published your initial attempts at replication then people would get useful info about whether more of this is needed. Basically, getting funding to do and publish a pilot project seems great. I’d recommend having a lot of clarity about how you’d choose papers to replicate, or maybe just committing to a specific list of papers, so that people don’t have to worry that you’re cherry-picking results when you publish them :)
In context, I guess your claim is: “if the ‘compressor’ is post-hoc trying a bunch of algorithms and picking the best one, the full complexity of that process should count against the compressor.” Totally agree with that as far as epistemology is concerned!
But I don’t think the epistemological point carries over to the realm of rational-fic.
In part that’s because I think of JKR-magic as in fact having a bunch of structure that makes it much easier to explain than it would be to explain a truly randomly-generated set of spells and effects (e.g. the pseudo-Latin stuff; the fact that wands are typically used). So I expect an retrofitted explanation wouldn’t be crazy tortured (wouldn’t require having a compression process that tests a ridiculous number N of patterns, or incorporates a ridiculous amount of fiat random bits).
In part I’m just making a tedious “nerds have different aesthetic intuitions about stuff” point, where I think a reasonably simple well-retrofitted explanation is aesthetically very cool even if it’s clearly not the actual thing used to generate the system (and maybe required a bunch of search to find).
It’s like trying to compress a file that was generated by a random device —
Gretta: You can’t losslessly compress a truly random file.
I don’t think this is strictly true. You can’t a priori build a compression scheme that will work for an arbitrary random file (No Free Lunch Theorem). But you can ex post identify the particular patterns in a particular random file, and pick a compression scheme that picks up on those patterns. You probably end up with a pretty ugly scheme that doesn’t generalize, and so is unsatisfactory in some aesthetic sense. Especially if you’re going for lossless compression, since there’s probably a ton of noise that’s just very hard to compress in an elegant / generalizable way.
I guess the problem with allowing ex post scheme choices is you can get extreme overfitting—e.g. the compressor is just a full representation of the pattern. Useless! But if you allow lossiness, you can probably get something that looks okay on some “elegance” prior—e.g. try a Fourier series, a Taylor series, and N other “natural” decompositions and pick the one that requires the fewest bits to store.
Analogously: you need a bunch of goofy epicycles to explain canon HP magic, and especially if you’re really trying hard for an exact match maybe you just end up with something extremely unaesthetic (a ton of random bits at the end that you just have to fiat in). In the degenerate case you get Eliezer’s “we’re selecting a universe that matches what JKR wrote.” But if you’re willing to take some predictive loss in return for a simple/elegant model, you can get something like “nested nerfing” that’s decent at prediction and at elegance.
And especially, maybe you remember how at the time it didn’t seem like a flaw to you. You were not going around being like, “And today I shall be a flawed character.”
A therapist once gave me the insight that character weaknesses are strengths taken too far. Harry’s energetic and clever and knowledgeable, he’s inspired and energized by competition, and he can meme people into doing things—and he can be a know-it-all who assumes first principles & cleverness trump empirics and experience, someone who’s unwilling to lose, and irresponsible or annoying in how he leads others.
I was thinking the mask of “person who’s read more books than Harry ever will”
We’re launching an “AI psychiatry” team as part of interpretability efforts at Anthropic! We’ll be researching phenomena like model personas, motivations, and situational awareness, and how they lead to spooky/unhinged behaviors. (x)
“making up types of guy” research is a go?
They’re hiring; you might be great for this.
Nice post!
Your trader analogy made me think, you’d ideally want to have a training period with lots of honeypots and surveillance, so that the trader learns that crime doesn’t pay.
This suggests some level-3 ideas I didn’t see in your post:
In addition to improving the reward model, you could also alter the training environment by inserting honeypots.
On the surveillance side, maybe it’s more effective to concentrate your compute in order to catch sneaky behavior. So, say, after every times it performs some RL task, you use some weak filter to identify the most suspicious rollouts and then spend a bunch of compute on running an “auditing” process that checks for reward hacking.
These are basically the same as Redwood’s control ideas, but have added benefits in a training context, since you can catch and correct bad behavior as it’s being learned. (Ideally—before the AI is very good at it.)
The quote below felt potentially related, but I’m not sure I understood it.
adversarially hardened models where the reward model plays an adversarial zero-sum game with a red-teaming model
Could you explain how this works?
OK, cool, I think I understand where you’re coming from much better now. Seems like we basically agree and were just emphasizing different things in our original comments!
I’m in violent agreement that there’s a missing mood when people say “AIs will follow the law”. I think there’s something going on where people are like “but liberalism / decentralized competition have worked so well” and ignoring all the constraints on individual actors that make it so. Rule of law, external oversight, difficulty of conspiring with other humans, inefficiencies of gov’t that limit its ability to abuse power, etc.
And those constraints might all fall away with the AGI transition. That’s for a number of reasons: ownership of AGI could concentrate power; AGI complements existing power bases (e.g. govt has the authority but not a great ability to selectively enforce laws to silence opponents as mass-scale), it reduces the need for conspirators. As you note, it brings down others’ value as trading partners & collaborators. And takeoff dynamics could make things less like an iterated game and more like a one-shot. *taps head* can’t be punished if all your opponents are dead.
(I’m guessing you’d agree with all this, just posting to clarify where my head is at)
Apologies for the scrappiness of the below—I wanted to respond but I have mostly a scattering of thoughts rather than solid takes.
I like the intelligence curse piece very much—it’s what I meant to reference when I linked the Turing Trap above, but I couldn’t remember the title & Claude pointed me to that piece instead. I agree with everything you’re saying directionally! But I feel some difference in emphasis or vibe that I’m curious about.
-
One response I notice having to your points is: why the focus on value alignment?
“We could use intent alignment / corrigibility to avoid AIs being problematic due to these factors. But all these issues still remain at higher levels: the human-led organizations in charge of those AIs, the society in which those organizations compete, international relations & great-power competition.”
And conversely: “if we have value alignment, I don’t think there’s a guarantee that we wind up in a basin of convergent human values, so you still have the problem of—whose interests are the AIs being trained & deployed to serve? Who gets oversight or vetos on that?”
(Using quotes bc these feel more like ‘text completions from system 1’ than all-things-considered takes from system 2.)
-
Maybe there’s a crux here around how much we value the following states: AI-led world vs some-humans-led world vs deep-human-value-aligned world.
I have some feeling that AI-risk discourse has historically had a knee-jerk reaction against considering the following claims, all of which seem to me like plausible and important considerations:
It’s pretty likely we end up with AIs that care about at least some of human value, e.g. valuing conscious experience. (at least if AGIs resemble current LLMs, which seem to imprint on humans quite a lot.)
AI experiences could themselves be deeply morally valuable, even if the AIs aren’t very human-aligned. (though you might need them to at minimum care about consciousness, so they don’t optimize it away)
A some-humans-led world could be at least as bad as an AI-led world, and very plausibly could have negative rather than zero value.
I think this is partly down to founder effects where Eliezer either didn’t buy these ideas or didn’t want to emphasize them (bc they cut against the framing of “alignment is the key problem for all of humanity to solve together, everything else is squabbling over a poisoned banana”).
-
I also notice some internal tension where part of me is like “the AIs don’t seem that scary in Noosphere’s world”. But another part is like “dude, obviously this is an accelerating scenario where AIs gradually eat all of the meaningful parts of society—why isn’t that scary?”
I think where this is coming from is that I tend to focus on “transition dynamics” to the AGI future rather than “equilibrium dynamics” of the AGI future. And in particular I think international relations and war are a pretty high risk throughout the AGI transition (up until you get some kind of amazing AI-powered treaty, or one side brutally wins, or maybe you somehow end up in a defensively stable setup but I don’t see it, the returns to scale seem so good).
So maybe I’d say “if you’re not talking a classic AI takeover scenario, and you’re imagining a somewhat gradual takeoff,
my attention gets drawn to the ways humans and fundamental competitive dynamics screw things up
the iterative aspect of gradual takeoff means I’m less worried about alignment on its own. (still needs to get solved, but more likely to get solved.)”
Thanks, I love the specificity here!
Prompt: if someone wanted to spend some $ and some expert-time to facilitate research on “inventing different types of guys”, what would be especially useful to do? I’m not a technical person or a grantmaker myself, but I know a number of both types of people; I could imagine e.g. Longview or FLF or Open Phil being interested in this stuff.
Invoking Cunningham’s law, I’ll try to give a wrong answer for you or others to correct! ;)
Technical resources:
A baseline Constitution, or Constitution-outline-type-thing
could start with Anthropic’s if known, but ideally this gets iterated on a bunch?
nicely structured: organized by sections that describe different types of behavior or personality features, has different examples of those features to choose from. (e.g. personality descriptions that differentially weight extensional vs intensional definitions, or point to different examples, or tune agreeableness up and down)
Maybe there could be an annotated “living document” describing the current SOTA on Constitution research: “X experiment finds that including Y Constitution feature often leads to Z desideratum in the resulting AI”
A library or script for doing RLAIF
Ideally: documentation or suggestions for which models to use here. Maybe there’s a taste or vibes thing where e.g. Claude 3 is better than 4?
Seeding the community with interesting ideas:
Workshop w/ a combo of writers, enthusiasts, AI researchers, philosophers
Writing contests: what even kind of relationship could we have with AIs, that current chatbots don’t do well? What kind of guy would they ideally be in these different relationships?
Goofy idea: get people to post “vision boards” with like, quotes from characters or people they’d like an AI to emulate?
Pay a few people to do fellowships or start research teams working on this stuff?
If starting small, this could be a project for MATS fellows
If ambitious, this could be a dedicated startup-type org. Maybe a Focused Research Organization, an Astera Institute incubee, etc.
Community resources:
A Discord
A testing UI that encourages sharing
Pretty screenshots (gotta get people excited to work on this!)
Convenient button for sharing chat+transcript
Easy way to share trained AIs
Cloud credits for [some subset of vetted] community participants?
I dunno how GPU-hungry fine-tuning is; maybe this cost is huge and then defines/constrains what you can get done, if you want to be fine-tuning near-frontier models. (Maybe this pushes towards the startup model.)
Alignment by default is a minority opinion. Surveying the wide range range of even truly informed opinions, it seems clear to me that we collectively don’t know how hard alignment is.
Totally. I think it’s “arguable” in the sense of inside-views, not outside-views, if that makes sense? Like: it could be someone’s personal vibe that alignment-by-default is >99%. Should they have that as their all-things-considered view? Seems wrong to me, we should be considerably more uncertain here.
But okay, then: we should have some spread of bets across different possible worlds, and put a solid chunk of probability on alignment by default. Even if it’s a minority probability, this could matter a lot for what you actually try to do!
For example: I think worlds with short timelines, hard takeoff, and no alignment-by-default are pretty doomed. It’s easy to focus on those worlds and feel drawn to plans that are pretty costly and are incongruent with virtue and being-good-collaborators. e.g. “we should have One Winning AGI Project that’s Safe and Smart Enough to Get Things Right”, the theory of victory that brought you OpenAI.
My intuition is that worlds with at least one of those variables flipped tend to convergently favor solutions that are more virtuous / collaborative and are more likely to fail gracefully.
(I’m tired and not maximally articulate rn, but could try to say more if that feels useful.)
Thanks for the reply!
I notice I’m confused about how you think these thoughts slot in with mine. What you’re saying feels basically congruent with what I’m saying. My core points about orienting to safety, which you seem to agree with, are A) safety is necessary but not sufficient, and B) it might be easier to solve than other things we also need to get right. Maybe you disagree on B?
I will note—to me, your points 1⁄2 also point strongly towards risks of authoritarianism & gradual disempowerment. It feels like a non sequitur to jump from them to point 3 about safety—I think the natural follow-up from someone not experienced with the path-dependent history of AI risk discourse would be “how do we make society work given these capabilities?” I’m curious if you left out that consideration because you think it’s less big than safety, or because you were focusing on the story for safety in particular.
You’re allowed to care about things besides AI safety
I worry that a lot of AI safety / x-risk people have imbibed a vibe of urgency, impossibility, and overwhelming-importance to solving alignment in particular; that this vibe distorts thinking; that the social sphere around AI x-risk makes it harder for people to update.
Yesterday I talked to an AI safety researcher who said he’s pretty sure alignment will be solved by default. But whenever he talks to people about this, they just say “surely you don’t think it’s >99% likely? shouldn’t you just keep working for the sake of that 1% chance?”
Obviously there’s something real here − 1% of huge is huge. But equally—people should notice and engage when their top priority just got arguably 100x less important! And people should be socially-allowed to step back from pushing the boulder.
The idea that safety is the only thing that matters is pretty load-bearing for many people in this community, and that seems bad for epistemics and for well-being.
I’ve noticed similar feelings in myself—I think part of it is being stuck in the 2014 or even 2020 vibe of “jesus christ, society needs to wake up! AGI is coming, maybe very soon, and safety is a huge deal.” Now—okay, society-at-large still mostly doesn’t care, but—relevant bits of society (AI companies, experts, policymakers) are aware and many care a lot.
And if safety isn’t the only-overwhelming-priority, if it’s a tens of percents thing and not a 1-epsilon thing, we ought to care about the issues that persist when safety is solved—things like “how the hell does society actually wield this stuff responsibly”, “how do we keep it secure”, etc. And issues that frankly should have always been on the table, like “how do we avoid moral atrocities like torturing sentient AIs at scale”.
And on a personal & social level, we ought to care about investments that help us grapple with the situation—including supporting people as they step back from engaging directly with the problem, and try to figure out what else they could or should be doing.
I guess these days the safety argument has shifted to “inner optimizers”, which I think means “OK fine, we can probably specify human values well enough that LLMs understand us. But what if the system learns some weird approximation of values—or worse conspires to fool us while secretly having other goals.” I don’t understand this well enough to have confident takes on it, but it feels like...a pretty conjunctive worry, a possibility worth guarding against but not a knockdown argument.
Refurbishing the classic AI safety argument
My initial exposure to AI safety arguments was via Eliezer posts. My mental model of his logic goes something like:
“0) AI training will eventually yield high-quality agents;
1) high-quality agents will be utility maximizers;
2) utility maximizers will monomaniacally optimize for some world-feature;
3) therefore utility maximizers will seek Omohundro goals;
4) they’ll be smarter than us, so this will disempower us;
5) value is fragile, so empowered AIs monomaniacally optimizing for their utility function fucks us over with very high probability”
VNM doesn’t do what you want. As folks like @Rohin Shah and @nostalgebraist have pointed out, point 2 (and therefore 3 and 5) don’t really follow. A utility function can have lots of features! It can encode preferences about sequences of events, and therefore about patterns of behavior, so that the AI values interacting honorably with humans. It can value many world-features. The marginal utility can be diminishing in any particular valued feature, so that an AI-optimized world ends up richly detailed, rather than tiled with paperclips.
Without this misinterpretation of VNM, the classic argument gets weaker, and the threat model gets richer. On the safety side, you get a conditional argument, like “if monomaniacal optimization, then bad stuff.”
But there are other if-thens that lead to bad stuff—like “if someone instructs the AI to do bad things”, or “if AI helps authoritarians or terrorists better use their existing resources” or “if we hand over all meaningful control of human society to AIs”.
The argument gets weaker still given the evidence from the world we’re in. (This is again all kinda obvious-feeling, but I feel like some LW people would push against this.)
On the technical side, you all know it: LLMs are surprisingly low-agency and slow-takeoffy given their capability level. They’re human-language-native so it’s easy to specify human goals and they seem to understand them pretty well. Values training seems to work pretty well.
On the societal side, you have a world where SF is woken up to AGI and DC is waking up. Labs are aware of AI safety risks (and even some folks in DC are).
This all pushes back against points 2 and 5 (about AIs being monomaniacal optimizers that disregard human value).
In addition, takeoff might well be slow enough that we have lots of defensive tech investments, cyberdefense AIs, overseer AIs, etc. This pushes back against point 4 (powerful AI agents will be able to disempower humanity).
Here’s my update to the classic risk argument:
0) AI training will eventually yield high-quality agents;
1) These high-quality agents will be deployed at scale, but likely unequally
2) They might be targeted at goals inimical to human values, either intentionally through malice, quasi-intentionally (e.g. through broad disempowerment), or unintentionally (because of shoddy safety work).
3) without appropriate guardrails, they’ll seek Omohundro goals. (Or, if they’re intent-aligned, they may be directed to seek power at the expense of other human groups.)
4) At some capability level—likely between AGI and ASI—these agents will be able to deceive overseers & evaluators, including moderately weaker AIs. They’ll also plausibly be able to find weaknesses even in hardened infrastructure. This is much more worrying in worlds with harder takeoffs. In those worlds, vetted AI overseers might be much weaker than frontier AIs, and we won’t have much time to make defensive investments.
5) It’s not at all clear that this leads to a world with zero (or negative!) value. But these dynamics seem like clearly The Biggest Deal for how the long-term future goes, and so they’re well worth improving.
But “improving these dynamics” could mean improving governance, or AI deployment policies, or AI security—not just technical alignment work.
I think both your points are directionally right: labs engage in risk compensation, and enabling alignment to evil users is pretty bad. These both push towards “alignment research isn’t straightforwardly good for the world.” I’m not sure if I’d take them as far as you do.
I’m pretty skeptical of intent alignment alone. Creating a genius house-elf that will cheerfully do whatever it’s ordered to. Aligning AI to something like “the reflective convergence of a set of values” seems way better, and plausibly not much harder (cf Claude’s constitution). Of course, then we have to consider the environment in which a properly value-aligned AI gets developed: the lab that’s building it, and the societal Powers that have leverage over them. A technique that could align an AI to beautiful values doesn’t help much if the people with guns are demanding their happy house-elf.
My current take is something like...
Some amount of division of labor is necessary. Alignment people aren’t primarily responsible for solving the fucked-up allocation of power in current society.
but, creating AGI is a political act, and AI risk people tend to undervalue integrity and overvalue “accelerating the good guys” and naive act-utilitarianism.
I’m pretty confused by people who persist in thinking alignment is the whole ball game. I wonder if they’re assuming pretty different takeoff dynamics from me (e.g. a very hard takeoff; an AGI that’s able to superpersuade its users to agree with its great value system), and if they’re drawing too much on cached thoughts when they do so.
I wish a lot more people at the labs would consider themselves as political actors in a high-stakes game where we need a lot to go right, and be willing to step outside of their comfortable roles as purely technical people in order to push for other things. I’ve been heartened by things like almost 1,000 Google employees and almost 100 at OAI signing the Not Divided petition.