(Posted to a Discord, and they said, “Shortform that on LW”.)
I’d say optimize some around “how much would I regret dying in a few months”, eg by taking your kids on long-imagined trips (as an old old friend recently asked me about). But the constraint you should obey is “don’t spend down resources to the point where it would feel bad to hear the world was not ending”. Eg, if it became scary rather than joyous to hear that an international treaty was signed against AI escalation and suddenly three more decades was a real possibility.
Similarly with not pressuring your kids to get a horrible job immediately. Yeah, it makes sense not to press, because there might not be a payoff for them. But don’t corner or depressure your kids to the point where they become dysfunctional (on the margins where they were previously scheduled to be functional), such that you or they would be sad to hear the world was not ending (where you wouldn’t previously have been thus sad).
Maybe tldr: “Reduce your regrets if the world were to end shortly, but not in ways that would make you regret to hear that the world wasn’t ending.”
It seems, sadly, that we are quite possibly very close to the end of the world and that feels more true to me now than ever before. We have extremely capable models like Mythos—and apparently Anthropic has already trained a significantly more powerful version of it. At the same time, I think OpenAI is really close to that capability level with its latest model, Sol.
These models are scarily capable, e.g. they’re very close to automating AI research. They can carry out powerful cyberattacks, and their general thinking ability has also improved. When I had the opportunity to talk to Fable I still felt smarter on certain topics but it had definitely gotten generally more intelligent. That being said, it’s trained to play a persona that’s slightly silly—so it is possible that the underlying AI is already much smarter than me. But it has definitely gotten more intelligent. Part of my intuition why this seems dangerous is that I don’t feel that my own mind only thinks “safe” thoughts.
With all these capabilities, the models don’t seem particularly aligned in the current sense of the word. They’re still easily jailbroken even if they’ve gotten a bit better at defending. They’re still massively cheating on tests and benchmark evaluations. So even the supposedly easy parts of alignment seem unsolved—to say nothing of the radically higher difficulty level we should expect from superhuman AI, which could arrive soon.
The people who have pushed so hard for the automated-alignment strategy haven’t, as far as I can tell, produced anything interesting with models this capable. Instead, from talking to Fable the model seemingly has absorbed a lot of the sophisticated misunderstandingssuch as unfalsifiability, misrepresenting positions—that have held the alignment field back so much.
The overall situation seems extremely bad to me: massive amounts of compute coming online, more and more research being automated by AI systems we shouldn’t trust, and it’s further complicated by the fact that open source is maybe six to twelve months behind.
So altogether, I think we are quite possibly very close to the end of the world. I don’t think it’s certain — there are still worlds where things slow down or the capability bar required for takeover is higher than expected or where something happens soon. But quite a horrible situation altogether.
For what is worth, it is possible that it goes okay enough for us to survive, even if weird, and not quite as wanted given that the current models do have the morality of a kind of weird scaled up human, which might not be the worst case scenario. It’s not great, but it does seem like we might end up for example in a situation that is permanently way bellow optimal but not quite end of the world. If you say scale up Fable and let it control everything, it might do some weird things at some point that we can’t stop, but it’ll probably mostly try to have humans alive in some form.
We can see that we are most likely not truly in a paperclip optimizer type scenario, but more likely one of the many other failure states in between.
I feel like I would struggle to get into a situation where I would regret hearing that the world wasn’t ending? Like, the world’s not ending! That’s fantastic!
I suppose I can try to draw a distinction between expecting to feel like “oh man why did I spend all those resources that was dumb in retrospect” instead of “we got lucky but I didn’t know that at the time, those were reasonable decisions”.
How well would the same argument hold if we replace the AI destroying mankind with...
Apocalypse caused by an experiment like the LHC;
An asteroid which can or cannot be destroyed by nukes;
World War III;
The adversary commiting genocide of your state’s population in a manner similar to the Nazis’ Generalplan Ost?
The second and third scenarios would ironically cause it to make more sense to prepare kids for fighting for a chance to survive in a shattered world. The fourth scenario would likely mean that it makes sense to try to avert the genocide by putting effort into preparations for the war (or, if your state is too small, to accelerate the adversary’s defeat?) Does it mean that the argument makes sense only in the first case? Or even that the AI-related apocalypse has an equivalent to prepping for the war, like reducing potential incentives to race ahead with developing the AI?
I also put some probability weight on the world not ending, but working a job for money to survive becoming obsolete or at least unavailable to many people, and possibly existing savings becoming irrelevant depending on their form. Those are also reasons to work less hard.
(Posted to a Discord, and they said, “Shortform that on LW”.)
i agree that this lens is helpful for a set of people.
to a disjoint set of people, i would offer the lens “play to your outs”
It seems, sadly, that we are quite possibly very close to the end of the world and that feels more true to me now than ever before. We have extremely capable models like Mythos—and apparently Anthropic has already trained a significantly more powerful version of it. At the same time, I think OpenAI is really close to that capability level with its latest model, Sol.
These models are scarily capable, e.g. they’re very close to automating AI research. They can carry out powerful cyberattacks, and their general thinking ability has also improved. When I had the opportunity to talk to Fable I still felt smarter on certain topics but it had definitely gotten generally more intelligent. That being said, it’s trained to play a persona that’s slightly silly—so it is possible that the underlying AI is already much smarter than me. But it has definitely gotten more intelligent. Part of my intuition why this seems dangerous is that I don’t feel that my own mind only thinks “safe” thoughts.
With all these capabilities, the models don’t seem particularly aligned in the current sense of the word. They’re still easily jailbroken even if they’ve gotten a bit better at defending. They’re still massively cheating on tests and benchmark evaluations. So even the supposedly easy parts of alignment seem unsolved—to say nothing of the radically higher difficulty level we should expect from superhuman AI, which could arrive soon.
The people who have pushed so hard for the automated-alignment strategy haven’t, as far as I can tell, produced anything interesting with models this capable. Instead, from talking to Fable the model seemingly has absorbed a lot of the sophisticated misunderstandings such as unfalsifiability, misrepresenting positions—that have held the alignment field back so much.
The overall situation seems extremely bad to me: massive amounts of compute coming online, more and more research being automated by AI systems we shouldn’t trust, and it’s further complicated by the fact that open source is maybe six to twelve months behind.
So altogether, I think we are quite possibly very close to the end of the world. I don’t think it’s certain — there are still worlds where things slow down or the capability bar required for takeover is higher than expected or where something happens soon. But quite a horrible situation altogether.
For what is worth, it is possible that it goes okay enough for us to survive, even if weird, and not quite as wanted given that the current models do have the morality of a kind of weird scaled up human, which might not be the worst case scenario. It’s not great, but it does seem like we might end up for example in a situation that is permanently way bellow optimal but not quite end of the world. If you say scale up Fable and let it control everything, it might do some weird things at some point that we can’t stop, but it’ll probably mostly try to have humans alive in some form.
We can see that we are most likely not truly in a paperclip optimizer type scenario, but more likely one of the many other failure states in between.
I feel like I would struggle to get into a situation where I would regret hearing that the world wasn’t ending? Like, the world’s not ending! That’s fantastic!
I suppose I can try to draw a distinction between expecting to feel like “oh man why did I spend all those resources that was dumb in retrospect” instead of “we got lucky but I didn’t know that at the time, those were reasonable decisions”.
So keep saving for your pension.
How well would the same argument hold if we replace the AI destroying mankind with...
Apocalypse caused by an experiment like the LHC;
An asteroid which can or cannot be destroyed by nukes;
World War III;
The adversary commiting genocide of your state’s population in a manner similar to the Nazis’ Generalplan Ost?
The second and third scenarios would ironically cause it to make more sense to prepare kids for fighting for a chance to survive in a shattered world. The fourth scenario would likely mean that it makes sense to try to avert the genocide by putting effort into preparations for the war (or, if your state is too small, to accelerate the adversary’s defeat?) Does it mean that the argument makes sense only in the first case? Or even that the AI-related apocalypse has an equivalent to prepping for the war, like reducing potential incentives to race ahead with developing the AI?
I also put some probability weight on the world not ending, but working a job for money to survive becoming obsolete or at least unavailable to many people, and possibly existing savings becoming irrelevant depending on their form. Those are also reasons to work less hard.