Great post!
Tom Davidson
Could you point me to any other discussions about corrigible vs virtuous? (Or anything else you’ve written about it?)
But the Claude Soul document says:
In order to be both safe and beneficial, we believe Claude must have the following properties:
Being safe and supporting human oversight of AI
Behaving ethically and not acting in ways that are harmful or dishonest
Acting in accordance with Anthropic’s guidelines
Being genuinely helpful to operators and users
In cases of conflict, we want Claude to prioritize these properties roughly in the order in which they are listed.
And (1) seems to correspond to corrigibility.
So it looks like corrigibility takes precedence over Claude being a “good guy”.
One other thing is that I’d have guessed that the sign uncertainty of historical work on AI safety and AI governance is much more related to the inherent chaotic nature of social and political processes rather than a particular deficiency in our concepts for understanding them.
I’m sceptical that pure strategy research could remove that side uncertainty, and I wonder if it would take something like the ability to run loads of simulations of societies like ours.
Thanks for this!
I do agree that the history of crucial considerations provides a good reason to favour ‘deep understanding’.
I also agree that you plausibly need a much deeper understanding to get to above 90% on P(doom). But I don’t think you need that to get to the action-relevant thresholds, which are much lower.
I’d be interested in learning more about your power grab threat models, so let me know if and when you have something you want to share. And TBC I think you’re right that in many scenarios it will not be clear to other people whether the entity seeking power is ultimately humans or AIs—my current view is that the two possibilities are distinct, and it is plausible that just one of them obtains pretty cleanly.
Thanks for articulating your view in such detail. (This was written with transcription software. Sorry if there are mistakes!)
AI risk:
When I articulate the case for AI takeover risk to people I know, I don’t find the need to introduce them to new ontologies. I can just say that AI will be way smarter than humans. It will want things different from what humans want, and so it will want to seize power from us.
But I think I agree that if you want to actually do technical work to reduce the risks, that it is useful to have new concepts that point out why the risk might arise. I think reward hacking, instrumental convergence, and corrigibility are good examples.
To me, this seems like a case where you can identify a new risk without inventing a new ontology, but it’s plausible that you need to make ontological progress to solve the problem.
Simulations:
On the simulation argument, I think that people do in fact reason about the implications of simulations for example thinking about a-causal trade or threat dynamics. So I don’t think that it hasn’t gone anywhere. It obviously hasn’t become very practical yet, but I wouldn’t think that that’s due to the nature of the concept vs the inherent subject matter.
I don’t really understand why we would need new concepts to think about what’s outside a simulation rather than just applying our existing concepts that we use to describe the physical world outside of simulations within our universe and to describe other ways that the universe could have been.
LTism:
(e.g. with concepts like the vulnerable world hypothesis, astronomical waste, value lock-in, etc
Okay, it’s helpful to know that you see these as providing new valuable ontologies to some extent.
In my mind, there is not much ontological innovation going on in these concepts, because they can be stated in one sentence using pre-existing concepts. Vulnerable world hypothesis is the idea that at some point, there are so many technologies that we will develop that one of them will allow the person who develops it to easily destroy everyone else. Astronomical waste is the idea that there is a massive amount of stuff in space, but that if we wait a hundred years before grabbing it all, we will still be able to grab pretty much just as much stuff. So there is no need to rush.
To be clear, I think that this work is great. I just thought you had something more illegible in mind by what you consider to be ontological progress. So maybe we’re closer to each other than I thought.
Extinction:
But actually when you start to think about possible edge cases (like: are humans extinct if we’ve all uploaded? Are we extinct if we’ve all genetically modified into transhumans? Are we extinct if we’re all in cryonics and will wake up later?) it starts to seem possible that maybe “almost all of the action” is in the parts of the concept that we haven’t pinned down.
It sometimes seems to me like you jump to the conclusion that all the action is in the edge cases without actually arguing for it. According to most of the traditional stories about AI risk, everyone does literally die. And in worlds where we align AI, I do expect that people will be able to stay in their biological forms if they want to.
Lock in:
concepts like “lock-in”
I’m sympathetic that there’s useful work to do in finding a better ontology here
Human powergrabs:
“human power grabs” (I expect that there will be strong ambiguity about the extent to which AIs are ‘responsible’)
I’ve seen you say this a lot, but still not seen you actually argue for it convincingly. it seems totally possible that alignment will be easy, and that the only force behind the power grab will be coming from humans, with AI only doing it because humans train them to do so. It also seems plausible that the humans that develop superintelligence don’t try to do a power grab, but that the AI is misaligned and does so itself. In my mind, both of the pure case scenarios are very plausible. Again, it seems to me like you’re jumping to the conclusion that all the action is in the edge case, without arguing for it convincingly.
Separating out the two is useful for thinking about mitigations because there are certain technical mitigations you do for misaligned AI that don’t help with human motivation to seek power. And there are certain technical and governance mitigations you would do if you’re worried about human seeking power that would not help with misaligned AIs
Epistemics:
“societal epistemics”
it seems pretty plausible to me that if you improved our fundamental understanding of how societal epistemics works, that would really help with improving it. At the same time, I think identifying that this is a massive lever over the future is important strategy work even if you haven’t yet developed the new ontology. This might be like identifying that AI takeover risk is a big risk without developing the ontology needed to say solve it
Zooming out:
In general, a theme here is that I’m finding myself more sympathetic with your claims if we need to fully solve a v complex problem like alignment. But disagreeing that you need new ontologies to identify new, important problems.
I like the idea that you could play a role as translating between the pro-illegible camp and the more legible sympathetic people, because I think you are a clear writer, but certainly seem drawn to illegible things
To me there seem to be many examples of good impactful strategy research that don’t introduce big new ontologies or go via illegibility:
initial case for AI takeover risk
simulation argument
vulnerable world hypothesis
stand args for long termism
argument that avoiding Extinction or existential risk is a tractable way to impact the long term
astronomical waste
highlighting the risk of human power grabs
Importance of using AI to upgrade to societal epistemics / coord
risk from a software only IE
I do also see examples of big contributions that are in the form of new ontologies like reframing superintelligence. But these seem less common to me.
ML research directions for preventing catastrophic data poisoning
If you have utility proportional to the logarithm of 1 dollar plus your wealth, and you Nash bargain across all your possible selves, you end up approximately maximizing the expected logarithm of the logarithm of 1 dollar plus your wealth
I wonder if you could make the result here a lot less extreme risk aversion if you took the disagreement point to be “your possible selves control money proportional to their probability” rather than “no money”
I’d have thought that metr trend is largely newer models sustaining the SAME slope but for MORE TOKENS. Ie the slope goes horizontal after a bit for AI (but not for humans), and the point at which it goes horizontal is being delayed more and more
Yeah, I thought this piece struck a really nice tone and pointed to something important.
Re the counterfactual, hard to know. I was already thinking about risks from centralising AGI development, and about the ease of the leading project getting a DSA at this point. And I think Lukas was already thinking about the risk of AI-enabled coups. So I think it’s pretty unlikely that this was counterfactually responsible for the ai-enabled coups report happening vs not.
But I certainly read this piece and it influenced my thinking, and I believe Lukas read it as well. I think it made me feel more confident to lean in to disagreement with the status quo and more conviction in doing that.
Thanks! (Quickly written reply!)
I believe I was here thinking about how society has, at least in the past few hundred years, spent a minority of GDP on obtaining new raw materials. Which suggests that access to such materials wasn’t a significant bottleneck on expansion.
So it’s a stronger claim that “hard cap”. I think a hard cap would, theoretically, result in all GDP being used to unblock the bottleneck, as there’s no other way to increase GDP. I think you could quantify the strength of the bottleneck as the marginal elasticity of GDP to additional raw materials. In a task-based model, i think the % of GDP spent on each task is proportional to this elasticity?
Though this seems kind of like a fully general argument
Yeah, I think maybe it is? I do feel like, given the very long history of sustained growth, it’s on the sceptic to explain why their proposed bottleneck will kick in with explosive growth but not before. So you could state my argument as: raw materials never bottlenecked growth before; no particular reason they would just bc growth is faster bc that faster growth is driven by having more labour+capital which can be used for gathering more resources; so we shouldn’t expect raw materials to bottleneck growth in the future.
TBC, this is all compatible with “if we had way more raw materials then this would boost output”. E.g. in Cobb Douglas doubling an input to output increases output notably, but there still aren’t bottlenecks.
(And i actually agree that it’s more like CES with rho<0, i.e. raw materials is a stronger bottleneck, but just think we’ll be able to spend output to get more raw materials.)
(Also, to clarify: this is all about the feasibility of explosive growth. I’m not claiming it would be good to do any of this!)
Yeah, i think one of the biggest weaknesses of this model, and honestly of most thinking on the intelligence explosion, is not carefully thinking through the data.
During an SIE, AIs will need to generate data themselves, by doing the things that human researchers currently do to generate data. That includes finding new untapped data sources, creating virtual envs, creating SFT data themselves by doing tasks with scaffolds, etc.
OTOH it seems unlikely they’ll have anything as easy as the internet to work with. OTOH, internet data is actually v poorly targeted at teaching AIs how to do crucial real-world tasks, so perhaps with abundant cognitive labour you can do much better and make curriculla that directly targeted the skills that most need improving
Yep, the ‘gradual boost’ section is the one for this. Also my historical work on the compute-centric model (see link in post) models gradual automation in detail.
So if you’ve fully ignored the fact that pre-ASARA systems have sped things up, then accounting for that will make takeoff less fast bc by the time ASARA comes around you’ll have already plucked much of the low-hanging fruit of software progress.
But I didn’t fully ignore that, even outside of the gradual boost section. I somewhat adjusted my estimate of r and of “distance to effective limits” to account for intermediate software progress. Then, in the gradual boost section, i got rid of these adjustments as they weren’t needed. Turned out that takeoff was then faster. My interpretation (as i say in the gradual boost section): dropping those adjustments had a bigger effect than changing the modelling.
To put it anothr way: if you run the gradual boost section but literally leave all the parameters unchanged, you’ll get a slower takeoff.
Tom Davidson’s Shortform
Forethought is hiring!
You can see our research here.
You can read about what it’s like to work with us here.
We’re currently hiring researchers, and I’d love LW readers to apply.
If you like writing and reading LessWrong, I think you might also enjoy working at Forethought.
I joined Forethought a year ago, and it’s been pretty transformative for my research. I get lots of feedback on my research and great collaboration opportunities.
The median views of our staff are often different from the median views of LW. E.g. we probably have a lower probability on AI takeover (though I’m still >10% on that). That’s part of the reason i’m excited for LW readers to apply. I think a great way to make intellectual progress is via debate. So we want to hire ppl who strongly disagree with us, and have their own perspectives on what’s going on in AI.
We’ve also got a referral bounty of £10,000 for counterfactual recommendations for successful Senior Research Fellow hires, and £5,000 for Research Fellows.
The deadline for applications is Sunday 2nd November. Happy to answer questions!
I also work at Forethought!
I agree with a lot of this post, but wanted to flag that I would be very excited for ppl doing blue skies research to apply and want Forethought to be a place that’s good for that. We want to work on high impact research and understand that sometimes mean doing things where it’s unclear up front if it will bear fruit.
(Fyi the previous comment from “Tom” was not actually from me. I think it was Rose. But this one is from me!)
Worth noting that the “classic” AI risk also relies on human labour not being needed anymore. For AI to seize power, it must be able to do so without human help (hence human labour not needed), and for it to kill everyone human labour must not be needed to make new chips / robots
Thanks, I like this!
Haven’t fully wrapped my head around it yet, but will think more.
One quick minor reaction is that I don’t think you need IC stuff for coups. To give a not very plausible but clear example: a company has a giant intelligence explosion and then can make its own nanobots to take over the world. Doesn’t require broad automation, incentives for governments to serve their people to change, etc
Seems less likely to break if we can get every step in the chain to actually care about the stuff that we care about so that it’s really trying to help us think of good constraints for the next step up as much as possible, rather than just staying within its constraints and not lying to us.