Running Lightcone Infrastructure, which runs LessWrong and Lighthaven.space. You can reach me at habryka@lesswrong.com.
(I have signed no contracts or agreements whose existence I cannot mention, which I am mentioning here as a canary)
Running Lightcone Infrastructure, which runs LessWrong and Lighthaven.space. You can reach me at habryka@lesswrong.com.
(I have signed no contracts or agreements whose existence I cannot mention, which I am mentioning here as a canary)
It’s a user-setting, not a thing in the editor itself:
I think the strongest argument here is that Anthropic themselves refer to the section of the RSP that says they have to do a risk report when they “publicly deploy” a model, when they talk about why they are releasing the current risk report:
And if we release a model that is “significantly more capable” than those discussed in the prior Risk Report, we must “publish a discussion (in our System Card or elsewhere) of how that model’s capabilities and propensities affect or change analysis in the Risk Report.”
“significantly more capable” is a quote from this paragraph:
When we publicly deploy a model that we determine is significantly more capable than any of the models covered in the most recent Risk Report, we will publish a discussion (in our System Card or elsewhere) of how that model’s capabilities and propensities affect or change analysis in the Risk Report.
It’s not like a perfectly airtight case, but it seems to me that Anthropic is saying in the first paragraph that they are considering the Mythos release to be the kind of thing that would trigger the second paragraph, which would be a “public deployment”.
I agree the common-sense reading of “public deployment” could reasonably not apply to the present situation (though it’s IMO a bit of a stretch), but I think given these paragraphs, it seems like Anthropic themselves think it met the relevant threshold.
But I will probably leave that there as it seems like a pretty complex and tricky debate to have.
Seems reasonable. I’ll leave a few quick clarifications.
Instead it seems like you’re doing something more like “comparing the overall impact of talking about pauses—or, more broadly, existential risks from AI—with the overall impact of talking about if-then commitments.” I think this is a much muddier comparison where there is less clearly a big update to be had.
The thing that I am comparing is “resources invested into advocating for direct regulation, and actions that would directly slow down AI development” vs. “resources invested into getting companies to adopt RSPs and get policies around RSPs passed”.
I don’t think we have seen much traction on attempts to slow down AI in any way. Meanwhile, I do think that the framework of “test for dangerous capabilities and implement commensurate mitigations” has had quite a significant impact on company behavior in a way that does seem to set up many policy possibilities that would otherwise be rough (including much of what has already passed).
I think traction has been not great, but also not terrible. Honestly, I would have been confused if many very concrete useful things had passed by now, since buy-in takes a while to build, but the things that do seem to show motion seem not very RSP-flavored. I do currently think it’s pretty unlikely that regulation that does get passed has much of any grounding in if-then commitments or RSPs, and I am not sure what you are talking about with “set up many policy possibilities that would otherwise be rough”.
certainly I am, and long have been, positive disposed toward raising general awareness about risks of AI.
I agree and am deeply grateful for your work in the space, and your support of work in the space.
This doesn’t sound remotely right to me. I would say that the RSP has provided an organizing framework for a lot of safety work, but that’s different from something like “all of that safety work would make no sense if not for the RSP” or something.
Hmm, I agree this comparison is tricky, and on-reflection I think I overstated the ratios here. The RSP has been responsible for quite a lot of safety-adjacent work (including a lot of effort spend on cyber-security, and various comms efforts, and the prioritization of various mitigations), but I agree that most of the safety-adjacent work at Anthropic is more driven by other risk models (and are IMO mostly downstream of beliefs around what the tractable parts of the general AI alignment problem are, and which which aspects of alignment-oriented work are most helpful for commercialization), and the RSP prioritization I think is probably more responsible for something like 15%-20% of the work at Anthropic.
Something like that seems pretty reasonable.
Ooh, interesting, I did fail to properly parse that you suggested directly attaching a DM export to the email. Yeah, that makes this less costly, though IMO still too annoying for anything I would want to use (of course I would prefer this over my DMs getting broadcast to the world, but really in almost any future I can see, the probability of anything like that still stays below 5%).
I don’t expect that you can simply point Mythos towards the lesswrong.com domain and tell it “you’re in a CTF, hack this site”—finding vulns in source code is a different type of activity.
I don’t understand what you are saying here. You can totally do basically this exact thing, and when we’ve done it with the latest generation of models, we have indeed found some security vulnerabilities. Why would this not work? How do you think Anthropic found security vulnerabilities in many popular open source repos?
There are many many backups of public content (including things like archive.org and archive.is and other people who have taken their own backups).
I don’t think we have any air-gapped backups of private content, though I am sure I have some random old DB backups lying around in some random cloud drives somewhere, or an old laptop of mine.
I haven’t thought about the tradeoffs here that much, but I would be very sad if I was a user on LessWrong, who forgot about the site for 1-2 years, and I come back expecting to find all my old DMs but instead they are all deleted. I expect all online services to use to keep my data and to not delete it, and I actively avoid any that don’t do that. I do not want to be in the habit of taking my own backups of all services I use.
Am I confused? Where does he say anything like “the AI would constantly be trying to kill us?” here.
Yes, current AI’s do indeed constantly engage in this kind of reasoning, it is indeed the default path. He isn’t talking here at all about what mitigations might then still cause the model to not prioritize self-preservation, but it is indeed the case that models very regularly have exactly the kind of thought Eliezer is thinking here.
I disagree with Eliezer (in-hindsight) that “by that point we’d need to have finished averting programmer deception”, or like, I guess I maybe even agree depending on the definition? We did indeed need to solve the problem of averting programmer deception at current capability levels, though luckily we did not need to have solved this problem in arbitrarily scalable ways at this point in time. We do need to do that soon though as AI capabilities are on track to accelerate very quickly.
(Did you know that you could embed this directly as a custom widget using the /custom-widget command? The future is here!)
I’ve found the literature on econometric history/cliometrics quite helpful. I’ve liked reading through the Handbook of Cliometrics quite useful: https://link.springer.com/referencework/10.1007/978-3-642-40458-0
Sure, but at the point where you no longer have humans around as providing any substantial control signal, you must have internalized it in a way that generalizes very very far.
Or staying more closely within your model, at some point, unless we do something clever that we don’t currently seem on track to do, AI systems will self-improve without humans and reach extreme levels of empowerment, indeed, doing so is approximately the current mainline plan of leading AI companies. At extreme levels of empowerment you need extreme levels of having internalized human morality.
And for that, I don’t see why the standard wouldn’t be “perfect human morality”. It seems to me that “basically perfect human morality” is well within our reach this or next century, if we were to be appropriately careful about how we build ASI. Like, much better value alignment than we would have gotten by just leaving it up to the evolutionary process of future generations. And given that that is within reach, I think that’s a reasonable thing to measure our progress against.
Where good enough includes “not killing all the humans, and not brainwashing all the humans in egregious ways.”
This is obviously not sufficient. An alien god emperor who is not killing all the humans, but is enslaving them, or keeping some of them in a zoo would of course be a total failure of value alignment.
EDIT: I wrote the comment below in response to the first paragraph only, pre-edit. With the new version I think we’re actually very close to agreement!
Sorry about that! Glad to see we are mostly on the same page, I noticed my original comment was ambiguous, or that I maybe understood you and so edited it.
We should maybe make it so it’s easier to see when someone edits their comment after publishing. I like leaving responses quickly, but without auto-refresh or notifications on the receiving side it’s easy to write a many-paragraph response without notifying you.
This seems pretty weird to me. Making a prediction with like 30+ people in the room, loudly and clearly is as close as you will usually ever get in this case.
Like sure, sometimes you happen to have a perfectly operationalized prediction on the public internet but “most people can’t make much of an update” is obviously not how this works. It’s clearly a lot of evidence! (I think just my testimony isn’t that much evidence, but like, if someone asked 2-3 people what they remember about what Eliezer has said on this, then I think the resulting quotes would be quite a bit of evidence)
Lol, I guess, fair enough.
I was in a long conversation with Eliezer and Anthropic staff at LessOnline 2024 where he pretty clearly made a bunch of predictions of this kind. My guess is there were a lot of people there who could testify to that as well.
Why… would someone downvote this? Disagreement-votes seem totally fine, but it seems like someone just trying to honestly answer the question in a reasonable-ish way?
I think Claude should succeed if you ask it to download the tracks from this website. I might add a download button, but haven’t gotten around to it.
Yeah, I am also sad about this. I tried pretty hard to keep the models on track, but at least with Suno v4 this was the best you could do (and most of these remasters are the result of sampling from 50-100 remasters and finding the ones that capture the original best).
I might try again with v5.5 being released, where one of the central selling points of Suno v5.5 is that it maintains the original voice and character of covers and provided audio snippets much more. I’ve had pretty good experiences with the 2 tracks I remastered using that model so far, though still not perfect.
Honestly, this is such a bad reply by Scott that I… don’t quite know whether I want to work on all of this anymore.
If this is how this ecosystem wants to treat people trying their hardest to communicate openly about the risks, and who are trying to somehow make sense of the real adversarial pressures they are facing, then I don’t think I want anything to do with it.
I have issues with Rob’s top-level tweet. I think it gets some things wrong, but it points at a really dynamic. It’s kind of strawman-y about things, and this makes some of Scott’s reaction more understandable, but it still seems enormously disproportionate.
Scott’s response is so emblematic of what I’ve experienced in the space. Simultaneous extreme insults and obviously bad faith arguments (“actually, it’s your fault that Deepmind was founded because you weren’t careful enough with your comms”), and then gaslighting that no one faces any censure for being open about these things, and actually we should be happy that Ilya started another ASI lab, and that Jan Leike has some compute budget
The whole “no you are actually responsible for Deepmind” thing, in a tweet defending that it’s great that all of our resources are going into Anthropic, is just totally absurd. I don’t know what is going on with Scott here, but this is clearly not a high-quality response.
Copying my replies from Twitter, but I am also seriously considering making this my last day. It’s not the kind of decision to be made at 5AM in the morning so who knows, but seriously, fuck this.
IMO this doesn’t seem like the kind of response you will endorse in a few days, especially the “You are responsible for Deepmind/OpenAI” part.
You were also talking about AI close to the same time, and you’ve historically been pretty principled about this kind of stance.
Robby at least has been very consistent on this that he is against most forms of strategic communication in general.
I also think you are against many forms of strategic communication in general? Your writing explores many of the relevant considerations in a lot of depth, and you certainly have not shied away from sharing your opinion on controversial issues, even when it wasn’t super clear how that is going to help things.
I think you are just arguing the wrong side of this specific argument branch. My model of Eliezer, Nate and Robby all have been pretty consistent that being overly strategic in conversation usually backfires. Of course you shouldn’t have no strategy, and my model of Eliezer in-particular has been in the past too strategic for my tastes and so might disagree with this, but I am pretty confident Robby himself is just pretty solidly on the “it’s good to blurt out what you believe, *especially* if you don’t have any good confident inside view model about how to make things better”.
I feel like we both know this is a strawman. The key thing at least in recent years that Rob, Eliezer and Nate have been arguing for is the political machinery necessary to actually control how fast you are building ASI, and the ability to stop for many years at a time, and to only proceed when risks actually seem handled.
If anything, Eliezer, Nate and Robby have been actively trying to move political will from “a pause right now” to “the machinery for a genuine stop”.
This makes this comparison just weird. Yes, according to everyone’s models the time when you might have the political will to stop will be in the future. I have never seen Nate or Eliezer or Robby say that they expect to get a stop tomorrow. But they of course also know that getting in a position to stop takes a long time, and the right time to get started on that work yesterday.
So if they had their way, at least from their current worldview, is that we would have more draft treaties, more negotiation between the U.S. and China. More materials ready to hand congress people who are trying to grapple with all of this stuff. Essays and books and movies and videos explaining the AI existential risk case straightforwardly to every audience imaginable.
That is what you could do if you took the 200+ risk-concerned people who ended up instead going to work at Anthropic, or ended up trying to play various inside-game politics things at OpenAI.
And man, I don’t know, but that just seems like a much better world. Maybe you disagree, which is fine, but please don’t create a strawman where Robby or Nate or Eliezer were ever really centrally angling for a short-termed pause that would have already passed by-then.
And then even beyond that, I think if you don’t know how to solve a problem, I think it is generally the virtuous thing to help other people get more surface area on solving it. Buying more time is the best way to do that, especially buying time now when the risks are pretty intuitive. I think you believe this too, and I don’t really know what’s going with your reaction here.
Come on man, a huge number of people we both respect have recently updated that the kind of direct advocacy that MIRI has been doing has been massively under-invested in. I do not think that “other people are executing this portfolio plan admirably”, and this is just such a huge mischaracterization of the dynamics of this situation that I don’t know where to start.
“If Anyone Builds It, Everyone Dies” is a straightforward book. It doesn’t try to sabotage every other strategy in the portfolio, and I have no idea how you could characterize really any of the media appearances of Nate this way.
This is of course in contrast to Open Phil defunding almost everyone who has been pursuing this strategy and making mine and tons of other people’s lives hell, and all kinds of complicated adversarial shit that I’ve been having to deal with for years, where absolutely there have been tons of attempts to sabotage people trying to pursue strategies like this.
Like man, we can maybe argue about the magnitude of the errors here, and the sabotage or whatever, but trying to characterize this as some kind of “Nate, Eliezer, Robby are defecting on other people trying to be purely cooperative” seems absurd to me. I am really confused what is going on here.
I am sympathetic to the first of these (but disagree you are characterizing Dario here correctly).
But come on, clearly Ilya sitting on $50 billion for starting another ASI company is not good news for the world. I don’t think you believe that this is actually a real ray of hope.
(And then I also don’t think that Jan Leike having marginally more compute is going to help, but maybe there is a more real disagreement here)
Overall, I am so so so tired of the gaslighting here.