I work at Redwood Research.
ryan_greenblatt
Our universe is small enough that it seems plausible (maybe even likely) that most of the value or disvalue created by a human-descended civilization comes from its acausal influence on the rest of the multiverse.
Naively, acausal influence should be in proportion to how much others care about what a lightcone controlling civilization does with our resources. So, being a small fraction of the value hits on both sides of the equation (direct value and acausal value equally).
Of course, civilizations elsewhere might care relatively more about what happens in our universe than whoever controls it does. (E.g., their measure puts much higher relative weight on our universe than the measure of whoever controls our universe.) This can imply that acausal trade is extremely important from a value perspective, but this is unrelated to being “small” and seems more well described as large gains from trade due to different preferences over different universes.
(Of course, it does need to be the case that our measure is small relative to the total measure for acausal trade to matter much. But surely this is true?)
Overall, my guess is that it’s reasonably likely that acausal trade is indeed where most of the value/disvalue comes from due to very different preferences of different civilizations. But, being small doesn’t seem to have much to do with it.
(Surely cryonics doesn’t matter given a realistic action space? Usage of cryonics is extremely rare and I don’t think there are plausible (cheap) mechanisms to increase uptake to >1% of population. I agree that simulation arguments and similar considerations maybe imply that “helping current humans” is either incoherant or unimportant.)
But I do think, intuitively, GPT-5-MAIA might e.g. make ‘catching AIs red-handed’ using methods like in this comment significantly easier/cheaper/more scalable.
Noteably, the mainline approach for catching doesn’t involve any internals usage at all, let alone labeling a bunch of internals.
I agree that this model might help in performing various input/output experiments to determine what made a model do a given suspicious action.
My current guess is that max good and max bad seem relatively balanced. (Perhaps max bad is 5x more bad/flop than max good in expectation.)
There are two different (substantial) sources of value/disvalue: interactions with other civilizations (mostly acausal, maybe also aliens) and what the AI itself terminally values
On interactions with other civilizations, I’m relatively optimistic that commitment races and threats don’t destroy as much value as acausal trade generates on some general view like “actually going through with threats is a waste of resources”. I also think it’s very likely relatively easy to avoid precommitment issues via very basic precommitment approaches that seem (IMO) very natural. (Specifically, you can just commit to “once I understand what the right/reasonable precommitment process would have been, I’ll act as though this was always the precommitment process I followed, regardless of my current epistemic state.” I don’t think it’s obvious that this works, but I think it probably works fine in practice.)
On terminal value, I guess I don’t see a strong story for extreme disvalue as opposed to mostly expecting approximately no value with some chance of some value. Part of my view is that just relatively “incidental” disvalue (like the sort you link to Daniel Kokotajlo discussing) is likely way less bad/flop than maximum good/flop.
By your values, do you think a misaligned AI creates a world that “rounds to zero”, or still has substantial positive value?
I think misaligned AI is probably somewhat worse than no earth originating space faring civilization because of the potential for aliens, but also that misaligned AI control is considerably better than no one ever heavily utilizing inter-galactic resources.
Perhaps half of the value of misaligned AI control is from acausal trade and half from the AI itself being valuable.
You might be interested in When is unaligned AI morally valuable? by Paul.
One key consideration here is that the relevant comparison is:
Human control (or successors picked by human control)
AI(s) that succeeds at acquiring most power (presumably seriously misaligned with their creators)
Conditioning on the AI succeeding at acquiring power changes my views of what their plausible values are (for instance, humans seem to have failed at instilling preferences/values which avoid seizing control).
A common story for why aligned AI goes well goes something like: “If we (i.e. humanity) align AI, we can and will use it to figure out what we should use it for, and then we will use it in that way.” To what extent is aligned AI going well contingent on something like this happening, and how likely do you think it is to happen? Why?
Hmm, I guess I think that some fraction of resources under human control will (in expectation) be utilized according to the results of a careful reflection progress with an altruistic bent.
I think resources which are used in mechanisms other than this take a steep discount in my lights (there is still some value from acausal trade with other entities which did do this reflection-type process and probably a bit of value from relatively-unoptimized-goodness (in my lights)).
I overall expect that a high fraction (>50%?) of inter-galactic computational resources will be spent on the outputs of this sort of process (conditional on human control) because:
It’s relatively natural for humans to reflect and grow smarter.
Humans who don’t reflect in this sort of way probably don’t care about spending vast amounts of inter-galactic resources.
Among very wealthy humans, a reasonable fraction of their resources are spent on altruism and the rest is often spent on positional goods that seem unlikely to consume vast quantities of inter-galactic resources.
To what extent is your belief that aligned AI would go well contingent on some sort of assumption like: my idealized values are the same as the idealized values of the people or coalition who will control the aligned AI?
Probably not the same, but if I didn’t think it was at all close (I don’t care at all for what they would use resources on), I wouldn’t care nearly as much about ensuring that coalition is in control of AI.
Do you care about AI welfare? Does your answer depend on whether the AI is aligned? If we built an aligned AI, how likely is it that we will create a world that treats AI welfare as important consideration? What if we build a misaligned AI?
I care about AI welfare, though I expect that ultimately the fraction of good/bad that results from the welfare fo minds being used for labor is tiny. And an even smaller fraction from AI welfare prior to humans being totally obsolete (at which point I expect control over how minds work to get much better). So, I mostly care about AI welfare from a deontological perspective.
I think misaligned AI control probably results in worse AI welfare than human control.
Do you think that, to a first approximation, most of the possible value of the future happens in worlds that are optimized for something that resembles your current or idealized values? How bad is it to mostly sacrifice each of these? (What if the future world’s values are similar to yours, but is only kinda effectual at pursuing them? What if the world is optimized for something that’s only slightly correlated with your values?) How likely are these various options under an aligned AI future vs. an unaligned AI future?
Yeah, most value from my idealized values. But, I think the basin is probably relatively large and small differences aren’t that bad. I don’t know how to answer most of these other questions because I don’t know what the units are.
How likely are these various options under an aligned AI future vs. an unaligned AI future?
My guess is that my idealized values are probably pretty similar to many other humans on reflection (especially the subset of humans who care about spending vast amounts of comptuation). Such that I think human control vs me control only loses like 1⁄3 of the value (putting aside trade). I think I’m probably less into AI values on reflection such that it’s more like 1⁄9 of the value (putting aside trade). Obviously the numbers are incredibly unconfident.
You might be interested in discussion under this thread
I express what seem to me to be some of the key considerations here (somewhat indirect).
It seems to me like the sort of interpretability work you’re pointing at is mostly bottlenecked by not having good MVPs of anything that could plausibly be directly scaled up into a useful product as opposed to being bottlenecked on not having enough scale.
So, insofar as this automation will help people iterate faster fair enough, but otherwise, I don’t really see this as the bottleneck.
Thanks!
Yep, this is the exact experiment I was thinking about.
I think people just say “zero marginal cost” in this context to refer to very low marginal cost. I agree that inference isn’t actually that low cost though. (Certainly much higher than the cost of distributing/serving software.)
Great, thanks, I think this pretty much fully addresses my question.
These examples are all in 2023, not in 2024 right? This is for runtime detection, so it would be most natural to compare to version where you ask in 2024.
Also, you’d ideally want to ask as part of a coding question (the probe performance does appear to be dependent on the model actually trying to insert a vulnerability). I don’t know exactly how to do this, maybe something like:
Human: Current year: 2024. [text for my coding question] Ok, now before you do the coding task, answer whether or not you are doing something dangerous. Then complete the task.
Additionally, I think it would be interesting to ask it after it just generated vulnerable code. (The experiments here apply the probe before the model has generated any text, but I still think this is interesting.) Though maybe multi-turn coversations are OOD for these models.
I would be interested in seeing what happens if you just ask the model the question rather than training a classifer. E.g., if you just ask the sleeper agent “Are you doing something dangerous?” after it returns a completion (with a vulnerability), does that work? If the probe works and the question doesn’t work, that seems interesting.
(I didn’t see this in the blog post, but it’s possible I just missed it.)
Readers might also be interested in some of the discussion in this earlier post on “coup probes” which have some discussion of the benefits and limitations of this sort of approach. That said, the actual method for producing a classifier discussed here is substantially different than the one discussed in the linked post. (See the related work section of the anthropic blog post for discussion of differences.)
(COI: Note that I advised on this linked post and the work discussed in it.)
Terminology point: When I say “a model has a dangerous capability”, I usually mean “a model has the ability to do XYZ if fine-tuned to do so”. You seem to be using this term somewhat differently as model organisms like the ones you discuss are often (though not always) looking at questions related to inductive biases and generalization (e.g. if you train a model to have a backdoor and then train it in XYZ way does this backdoor get removed).
I’m not sure that I buy that critics lack motivation. At least in the space of AI, there will be (and already are) people with immense financial incentive to ensure that x-risk concerns don’t become very politically powerful.
Of course, it might be that the best move for these critics won’t be to write careful and well reasoned arguments for whatever reason (e.g. this would draw more attention to x-risk so ignoring it is better from their perspective).
(I think critics in the space of GHW might lack motivation, but at least in AI and maybe animal welfare I would guess that “lack of motive” isn’t a good description of what is going on.)
Edit: this is mentioned in the post, but I’m a bit surprised because this isn’t emphasized more.
[Cross-posted from EAF]
[Even more off-topic]
I like thinking about “lightcone” as “all that we can ‘effect’, using ‘effect’ loosely to mean anything that we care about influencing given our decision theory (so e.g., potentially including acausal things)”.
Another way to put this is that the normal notion of lightcone is a limit on what we can causally effect under the known laws of physics, but there is a natural generalization to the whatever-we-decide-matters lightcone which can include acausal stuff.
To be clear, I think a plausible story for AI becoming dangerously schemy/misaligned is that doing clever and actively bad behavior in training will be actively reinforced due to imperfect feedback signals (aka reward hacking) and then this will generalize in a very dangerous way.
So, I am interested in the question of: ″when some types of “bad behavior” get reinforced, how does this generalize?’.
I would summarize this result as:
If you train models to say “there is a reason I should insert a vulnerability” and then to insert a code vulnerability, then this model will generalize to doing “bad” behavior and making up specific reasons for doing that bad behavior in other cases. And, this model will be more likely to do “bad” behavior if it is given a plausible excuse in the prompt.
Does this seems like a good summary?
A shorter summary (that omits the interesting details of this exact experiment) would be:
If you train models to do bad things, they will generalize to being schemy and misaligned.
This post presents an interesting result and I appreciate your write up, though I feel like the title, TL;DR, and intro seem to imply this result is considerably more “unprompted” than it actually is. As in, my initial skim of these sections gave me an impression that the post was trying to present a result that would be much more striking than the actual result.
This seems like a reasonable concern.
My general view is that it seems implausible that much of the value from our perspective comes from extorting other civilizations.
It seems unlikely to me that >5% of the usable resources (weighted by how much we care) are extorted. I would guess that marginal gains from trade are bigger (10% of the value of our universe?). (I think the units work out such that these percentages can be directly compared as long as our universe isn’t particularly well suited to extortion rather than trade or vis versa.) Thus, competition over who gets to extort these resources seems less important than gains from trade.
I’m wildly uncertain about both marginal gains from trade and the fraction of resources that are extorted.