michaelcohen

Karma: 626

michaelcohen 28 Apr 2026 17:54 UTC
2 points
0
in reply to: Andrea_Miotti’s comment on: Preventing extinction from ASI on a $50M yearly budget
Woohoo!

michaelcohen 25 Apr 2026 5:44 UTC
LW: 3 AF: 2
2
AF
on: Preventing extinction from ASI on a $50M yearly budget
Connor Leahy, what do you think about moving to America?

michaelcohen 22 Apr 2026 16:10 UTC
1 point
−2
in reply to: Alex Amadori’s comment on: Preventing extinction from ASI on a $50M yearly budget
We’ve also written about how ASI prevention could happen through more distributed coalitions at asi-prevention.com if you’re interested.
It’s not impossible, and it’s not a bad Plan C, but I just don’t think it’s a serious Plan A or B.
US is just so much more important.
fostering popular support
In what country? It has to be in the US. US voters are not paying attention to anything happening in other countries except for war.

michaelcohen 22 Apr 2026 15:54 UTC
1 point
0
in reply to: AprilSR’s comment on: Preventing extinction from ASI on a $50M yearly budget
Can be US-China led

michaelcohen 21 Apr 2026 20:41 UTC
7 points
3
in reply to: Alex Amadori’s comment on: Preventing extinction from ASI on a $50M yearly budget
A group of 2 to 3 middle powers who are extremely motivated to get an international prohibition on ASI could go a long way towards getting it done
I just don’t see that happening. I think it’s US-led or bust. I think it is so unlikely that UK/EU/Canada leading by example will work. I think the attitude toward even just the EU AI Act on Capitol Hill is mostly derision.

Whereas I would love love love to see your successes in the UK be replicated in Washington.

michaelcohen 21 Apr 2026 19:34 UTC
LW: 15 AF: 7
5
AF
on: Preventing extinction from ASI on a $50M yearly budget
What fraction of ControlAI’s growth do you think should be in the US? I think at least 90%, and maybe over 100%!

michaelcohen 18 Apr 2026 3:32 UTC
1 point
0
in reply to: Alexander Gietelink Oldenziel’s comment on: The AIXI perspective on AI Safety
Just saw this comment because Cole tagged, and I haven’t read the rest of the context here, but I just want to quickly say that inner misalignment was first conceptualized in the AIXI framework! So while I don’t buy inner misalignment as a likely problem for highly advanced agents, it is certainly compatible with the AIXI framework.

michaelcohen 30 Jan 2026 21:58 UTC
LW: 1 AF: 1
0
AF
in reply to: Adrià Garriga-alonso’s comment on: Alignment will happen by default. What’s next?
This is now a completely different topic. Do you take my point?

michaelcohen 27 Dec 2025 9:22 UTC
LW: 1 AF: 1
0
AF
on: Alignment will happen by default. What’s next?
The “feeling bad about reward hacking” is an artifact of still being regularized too closely to a human-like base model that further RL training would eliminate.

Safety cases for Pessimism

michaelcohen8 Sep 2025 13:26 UTC

18 points

1 comment4 min readLW link

michaelcohen 17 Aug 2023 20:59 UTC
LW: 13 AF: 6
0
AF
in reply to: paulfchristiano’s comment on: Thoughts on sharing information about language model capabilities
I think [process-based RL] has roughly the same risk profile as imitation learning, while potentially being more competitive.
I agree with this in a sense, although I may be quite a bit a more harsh about what counts as “executing an action”. For example, if reward is based on an overseer talking about the action with a large group of people/AI assistants, then that counts as “executing the action” in the overseer-conversation environment, even if the action looks like it’s for some other environment, like a plan to launch a new product in the market. I do think myopia in this environment would suffice for existential safety, but I don’t know how much myopia we need.
If you’re always talking about myopic/process-based RLAIF when you say RLAIF, then I think what you’re saying is defensible. I speculate that not everyone reading this recognizes that your usage of RLAIF implies RLAIF with a level of myopia that matches current instances of RLAIF, and that that is a load-bearing part of your position.
I say “defensible” instead of fully agreeing because I weakly disagree that increasing compute is any more of a dangerous way to improve performance than by modifying the objective to a new myopic objective. That is, I disagree with this:
I think you would probably prefer to do process-based RL with smaller models, rather than imitation learning with bigger models
You suggest that increasing compute is the last thing we should do if we’re looking for performance improvements, as opposed to adding a very myopic approval-seeking objective. I don’t see it. I think changing the objective from imitation learning is more likely to lead to problems than scaling up the imitation learners. But this is probably beside the point, because I don’t think problems are particularly likely in either case.

michaelcohen 12 Aug 2023 6:37 UTC
LW: 8 AF: 5
0
AF
in reply to: paulfchristiano’s comment on: Thoughts on sharing information about language model capabilities
What is process-based RL?
I think your intuitions about costly international coordination are challenged by a few facts about the world. 1) Advanced RL, like open borders + housing deregulation, guarantees vast economic growth in wealthy countries. Open borders, in a way that seems kinda speculative, but intuitively forceful for most people, has the potential to existentially threaten the integrity of a culture, including especially its norms; AI has the potential, in a way that seems kinda speculative, but intuitively forceful for most people, has the potential to existentially threaten all life. The decisions of wealthy countries are apparently extremely strongly correlated, maybe in part for “we’re all human”-type reasons, and maybe in part because legislators and regulators know that they won’t get their ear chewed off for doing things like the US does. With immigration law, there is no attempt at coordination; quite the opposite (e.g. Syrian refugees in the EU). 2) The number of nuclear states is stunningly small if one follows the intuition that wildly uncompetitive behavior, which leaves significant value on the table, produces an unstable situation. Not every country needs to sign on eagerly to avoiding some of the scariest forms of AI. The US/EU/China can shape other countries’ incentives quite powerfully. 3) People in government do not seem to be very zealous about economic growth. Sorry this isn’t a very specific example. But their behavior on issue after issue does not seem very consistent with someone who would see, I don’t know, 25% GDP growth from their country’s imitation learners, and say, “these international AI agreements are too cautious and are holding us back from even more growth”; it seems much more likely to me that politicians’ appetite for risking great power conflict requires much worse economic conditions than that.
In cases 1 and 2, the threat is existential, and countries take big measures accordingly. So I think existing mechanisms for diplomacy and enforcement are powerful enough “coordination mechanisms” to stop highly-capitalized RL projects. I also object a bit to calling a solution here “strong global coordination”. If China makes a law preventing AI that would kill everyone with 1% probability if made, that’s rational for them to do regardless of whether the US does the same. We just need leaders to understand the risks, and we need them to be presiding over enough growth that they don’t need to take desperate action, and that seems doable.
Also, consider how much more state capacity AI-enabled states could have. It seems to me that a vast population of imitation learners (or imitations of populations of imitation learners) can prevent advanced RL from ever being developed, if the latter is illegal; they don’t have to compete with them after they’ve been made. If there are well-designed laws against RL (beyond some level of capability), we would have plenty of time to put such enforcement in place.

michaelcohen 10 Aug 2023 1:32 UTC
LW: 10 AF: 3
0
AF
on: Thoughts on sharing information about language model capabilities
I believe that LM agents based on chain of thought and decomposition seem like the most plausible approach to bootstrapping subhuman systems into trusted superhuman systems. For about 7 years using LM agents for RLAIF has seemed like the easiest path to safety,^[4] and in my view this is looking more and more plausible over time.
I agree whole-heartedly with the first sentence. I’m not sure why you understand it to support the second sentence; I feel the first sentence supports my disagreement with the second sentence! Long-horizon RL is a different way to get superhuman systems, and one encourages that intervening in feedback if the agent is capable enough. Doesn’t the first sentence support the case that it would be safer to stick to chain of thought and decomposition as the key drivers of superhumanness, rather than using RL?

IRL in General Environments

michaelcohen9 Mar 2023 13:32 UTC

8 points

20 comments1 min readLW link

Utility uncertainty vs. expected information gain

michaelcohen9 Mar 2023 13:32 UTC

13 points

9 comments1 min readLW link

Value Learning is only Asymptotically Safe

michaelcohen9 Mar 2023 13:32 UTC

5 points

19 comments1 min readLW link

Impact Measure Testing with Honey Pots and Myopia

michaelcohen9 Mar 2023 13:32 UTC

17 points

9 comments1 min readLW link

Just Imitate Humans?

michaelcohen9 Mar 2023 13:31 UTC

11 points

72 comments1 min readLW link

Build a Causal Decision Theorist

michaelcohen9 Mar 2023 13:31 UTC

−2 points

14 comments4 min readLW link

michaelcohen 7 Nov 2022 9:38 UTC
LW: 2 AF: 1
1
AF
in reply to: mwacksen’s comment on: AI X-risk >35% mostly based on a recent peer-reviewed argument
Me: Peer review can definitely issue certificates mistakenly, but validity is what it aims to certify.
You: No it doesn’t. They just care about interestingness.
Me: Do you agree reviewers aim to only accept valid papers, and care more about validity than interestingness?
You: Yes, but...
If you can admit that we agree on this basic point, I’m happy to discuss further about how good they are at what they aim to do.
1: If retractions were common, surely you would have said that was evidence peer review didn’t accomplish much! If academics were only equally good at spotting mistakes immediately, they would still spot the most mistakes because they get the first opportunity to. And if they do, others don’t get a “chance” to point out a flaw and have the paper retracted. Even though this argument fails, I agree that journals are too reluctant to publish retractions; pride can sometimes get in the way of good science. But that has no bearing on their concern for validity at the reviewing stage.
2: Some amount of trust is taken for granted in science. The existence of trust in a scientific field does not imply that the participants don’t actually care about the truth. Bounded Distrust.
3: Since some level interestingness is also required for publication, this is consistent with a top venue having a higher bar for interestingness than a lesser venue, even while they same requirement for validity. And this is definitely in fact the main effect at play. But yes, there are also some lesser journals/conferences/workshops where they are worse at checking validity, or they care less about it because they are struggling to publish enough articles to justify their existence, or because they are outright scams. So it is relevant that AAAI publishes AI Magazine, and their brand is behind it. I said “peer reviewed” instead of “peer reviewed at a top venue” because the latter would have rubbed you the wrong way even more, but I’m only claiming that passing peer review is worth a lot at a top venue.

michaelcohen

Safety cases for Pessimism

IRL in Gen­eral Environments

Utility un­cer­tainty vs. ex­pected in­for­ma­tion gain

Value Learn­ing is only Asymp­tot­i­cally Safe

Im­pact Mea­sure Test­ing with Honey Pots and Myopia

Just Imi­tate Hu­mans?

Build a Causal De­ci­sion Theorist

IRL in General Environments

Utility uncertainty vs. expected information gain

Value Learning is only Asymptotically Safe

Impact Measure Testing with Honey Pots and Myopia

Just Imitate Humans?

Build a Causal Decision Theorist