michaelcohen

Karma: 606

michaelcohen 18 Apr 2026 3:32 UTC
1 point
0
in reply to: Alexander Gietelink Oldenziel’s comment on: The AIXI perspective on AI Safety
Just saw this comment because Cole tagged, and I haven’t read the rest of the context here, but I just want to quickly say that inner misalignment was first conceptualized in the AIXI framework! So while I don’t buy inner misalignment as a likely problem for highly advanced agents, it is certainly compatible with the AIXI framework.

michaelcohen 30 Jan 2026 21:58 UTC
LW: 1 AF: 1
0
AF
in reply to: Adrià Garriga-alonso’s comment on: Alignment will happen by default. What’s next?
This is now a completely different topic. Do you take my point?

michaelcohen 27 Dec 2025 9:22 UTC
LW: 1 AF: 1
0
AF
on: Alignment will happen by default. What’s next?
The “feeling bad about reward hacking” is an artifact of still being regularized too closely to a human-like base model that further RL training would eliminate.

michaelcohen 17 Aug 2023 20:59 UTC
LW: 13 AF: 6
0
AF
in reply to: paulfchristiano’s comment on: Thoughts on sharing information about language model capabilities
I think [process-based RL] has roughly the same risk profile as imitation learning, while potentially being more competitive.
I agree with this in a sense, although I may be quite a bit a more harsh about what counts as “executing an action”. For example, if reward is based on an overseer talking about the action with a large group of people/AI assistants, then that counts as “executing the action” in the overseer-conversation environment, even if the action looks like it’s for some other environment, like a plan to launch a new product in the market. I do think myopia in this environment would suffice for existential safety, but I don’t know how much myopia we need.
If you’re always talking about myopic/process-based RLAIF when you say RLAIF, then I think what you’re saying is defensible. I speculate that not everyone reading this recognizes that your usage of RLAIF implies RLAIF with a level of myopia that matches current instances of RLAIF, and that that is a load-bearing part of your position.
I say “defensible” instead of fully agreeing because I weakly disagree that increasing compute is any more of a dangerous way to improve performance than by modifying the objective to a new myopic objective. That is, I disagree with this:
I think you would probably prefer to do process-based RL with smaller models, rather than imitation learning with bigger models
You suggest that increasing compute is the last thing we should do if we’re looking for performance improvements, as opposed to adding a very myopic approval-seeking objective. I don’t see it. I think changing the objective from imitation learning is more likely to lead to problems than scaling up the imitation learners. But this is probably beside the point, because I don’t think problems are particularly likely in either case.

michaelcohen 12 Aug 2023 6:37 UTC
LW: 8 AF: 5
0
AF
in reply to: paulfchristiano’s comment on: Thoughts on sharing information about language model capabilities
What is process-based RL?
I think your intuitions about costly international coordination are challenged by a few facts about the world. 1) Advanced RL, like open borders + housing deregulation, guarantees vast economic growth in wealthy countries. Open borders, in a way that seems kinda speculative, but intuitively forceful for most people, has the potential to existentially threaten the integrity of a culture, including especially its norms; AI has the potential, in a way that seems kinda speculative, but intuitively forceful for most people, has the potential to existentially threaten all life. The decisions of wealthy countries are apparently extremely strongly correlated, maybe in part for “we’re all human”-type reasons, and maybe in part because legislators and regulators know that they won’t get their ear chewed off for doing things like the US does. With immigration law, there is no attempt at coordination; quite the opposite (e.g. Syrian refugees in the EU). 2) The number of nuclear states is stunningly small if one follows the intuition that wildly uncompetitive behavior, which leaves significant value on the table, produces an unstable situation. Not every country needs to sign on eagerly to avoiding some of the scariest forms of AI. The US/EU/China can shape other countries’ incentives quite powerfully. 3) People in government do not seem to be very zealous about economic growth. Sorry this isn’t a very specific example. But their behavior on issue after issue does not seem very consistent with someone who would see, I don’t know, 25% GDP growth from their country’s imitation learners, and say, “these international AI agreements are too cautious and are holding us back from even more growth”; it seems much more likely to me that politicians’ appetite for risking great power conflict requires much worse economic conditions than that.
In cases 1 and 2, the threat is existential, and countries take big measures accordingly. So I think existing mechanisms for diplomacy and enforcement are powerful enough “coordination mechanisms” to stop highly-capitalized RL projects. I also object a bit to calling a solution here “strong global coordination”. If China makes a law preventing AI that would kill everyone with 1% probability if made, that’s rational for them to do regardless of whether the US does the same. We just need leaders to understand the risks, and we need them to be presiding over enough growth that they don’t need to take desperate action, and that seems doable.
Also, consider how much more state capacity AI-enabled states could have. It seems to me that a vast population of imitation learners (or imitations of populations of imitation learners) can prevent advanced RL from ever being developed, if the latter is illegal; they don’t have to compete with them after they’ve been made. If there are well-designed laws against RL (beyond some level of capability), we would have plenty of time to put such enforcement in place.

michaelcohen 10 Aug 2023 1:32 UTC
LW: 10 AF: 3
0
AF
on: Thoughts on sharing information about language model capabilities
I believe that LM agents based on chain of thought and decomposition seem like the most plausible approach to bootstrapping subhuman systems into trusted superhuman systems. For about 7 years using LM agents for RLAIF has seemed like the easiest path to safety,^[4] and in my view this is looking more and more plausible over time.
I agree whole-heartedly with the first sentence. I’m not sure why you understand it to support the second sentence; I feel the first sentence supports my disagreement with the second sentence! Long-horizon RL is a different way to get superhuman systems, and one encourages that intervening in feedback if the agent is capable enough. Doesn’t the first sentence support the case that it would be safer to stick to chain of thought and decomposition as the key drivers of superhumanness, rather than using RL?

michaelcohen 7 Nov 2022 9:38 UTC
LW: 2 AF: 1
1
AF
in reply to: mwacksen’s comment on: AI X-risk >35% mostly based on a recent peer-reviewed argument
Me: Peer review can definitely issue certificates mistakenly, but validity is what it aims to certify.
You: No it doesn’t. They just care about interestingness.
Me: Do you agree reviewers aim to only accept valid papers, and care more about validity than interestingness?
You: Yes, but...
If you can admit that we agree on this basic point, I’m happy to discuss further about how good they are at what they aim to do.
1: If retractions were common, surely you would have said that was evidence peer review didn’t accomplish much! If academics were only equally good at spotting mistakes immediately, they would still spot the most mistakes because they get the first opportunity to. And if they do, others don’t get a “chance” to point out a flaw and have the paper retracted. Even though this argument fails, I agree that journals are too reluctant to publish retractions; pride can sometimes get in the way of good science. But that has no bearing on their concern for validity at the reviewing stage.
2: Some amount of trust is taken for granted in science. The existence of trust in a scientific field does not imply that the participants don’t actually care about the truth. Bounded Distrust.
3: Since some level interestingness is also required for publication, this is consistent with a top venue having a higher bar for interestingness than a lesser venue, even while they same requirement for validity. And this is definitely in fact the main effect at play. But yes, there are also some lesser journals/conferences/workshops where they are worse at checking validity, or they care less about it because they are struggling to publish enough articles to justify their existence, or because they are outright scams. So it is relevant that AAAI publishes AI Magazine, and their brand is behind it. I said “peer reviewed” instead of “peer reviewed at a top venue” because the latter would have rubbed you the wrong way even more, but I’m only claiming that passing peer review is worth a lot at a top venue.

michaelcohen 6 Nov 2022 10:17 UTC
LW: 5 AF: 4
0
AF
in reply to: mwacksen’s comment on: AI X-risk >35% mostly based on a recent peer-reviewed argument
Care to bet on the results of a survey of academic computer scientists? If the stakes are high enough, I could try to make it happen.
“As a reviewer, I only recommend for acceptance papers that appear to be both valid and interesting.”
Strongly agree - … - Strongly Disagree
“As a reviewer, I would sooner recommend for acceptance a paper that was valid, but not incredibly interesting, than a paper that was interesting, but the conclusions weren’t fully supported by the analysis.”
Strongly agree - … - Strongly Disagree

michaelcohen 6 Nov 2022 9:56 UTC
LW: 1 AF: 1
0
AF
in reply to: kman’s comment on: AI X-risk >35% mostly based on a recent peer-reviewed argument
I don’t understand. Importantly, these are optimistically biased, and you can’t assume my true credences are this high. I assign much less than 90% probability to C. But still, they’re perfectly consistent. M doesn’t say anything about succeeding—only being allowed. M is basically saying: listing the places he’d be willing to live, do they all pass laws which would make building dangerously advanced AI illegal? The only logical connection between C and M is that M (almost definitely) implies C.

michaelcohen 6 Nov 2022 9:48 UTC
LW: 4 AF: 1
2
AF
in reply to: Maxwell Clarke’s comment on: AI X-risk >35% mostly based on a recent peer-reviewed argument
Thank you very much for saying that.
I was feeling disappointed about the lack of positive comments, and I realized recently I should probably go around commenting on posts that I think are good, since right now, I mostly only comment on posts when I feel I have an important disagreement. So it’s hard to complain when I’m on the receiving end of that dynamic.

michaelcohen 4 Nov 2022 12:17 UTC
LW: 1 AF: 1
0
AF
in reply to: zac_kenton’s comment on: Threat Model Literature Review
On the 2nd point, the whole discussion of mu^prox vs. mu^dist is fundamentally about goal (mis)generalization. My position is that for a very advanced agent, point estimates of the goal (i.e. certainty that some given account of the goal is correct) would probably really limit performance in many contexts. This is captured by Assumptions 2 and 3. An advanced agent is likely to entertain multiple models of what their current understanding of their goal in a familiar context implies about their goal in a novel context. Full conviction in mu^dist does indeed imply non-wireheading behavior, and I wouldn’t even call it misgeneralization; I think that would be a perfectly valid interpretation of past rewards. So that’s why I spend so much time discussing relative credence in those models.

michaelcohen 3 Nov 2022 8:57 UTC
LW: 1 AF: 1
1
AF
in reply to: cfoster0’s comment on: AI X-risk >35% mostly based on a recent peer-reviewed argument
The assumption says “will do” not “will be able to do”. And the dynamics of the unknown environment includes the way it outputs rewards. So the assumption was not written in a way that clearly flags its entailment of the agent deliberately modeling the origin of reward, and I regret that, but it does entail that. So that was why engage with the objection that reward is not the optimization target under this section.
In the video game playing setting you describe, it is perfectly conceivable that the agent deliberately acts to optimize for high in-game scores without being terminally motivated by reward,
There is no need to recruit the concept of “terminal” here for following the argument about the behavior of a policy that performs well according to the RL objective. If the video game playing agent refines its understanding of “success” according to how much reward it observes, and then pursues success, but it does all this because of some “terminal” reason X, that still amounts to deliberate reward optimization, and this policy still satisfies Assumptions 1-4.
If I want to analyze what would probably happen if Edward Snowden tried to enter the White House, there’s lots I can say without needing to understand what deep reason he had for trying to do this. I can just look at the implications of his attempt to enter the White House: he’d probably get caught and go to jail for a long time. Likewise, if an RL agent is trying to maximize is reward, there’s plenty of analysis we can do that is independent of whether there’s some other terminal reason for this.

michaelcohen 3 Nov 2022 8:47 UTC
LW: 2 AF: 2
1
AF
in reply to: cfoster0’s comment on: AI X-risk >35% mostly based on a recent peer-reviewed argument
Peer review is not a certification of validity,
Do you think the peer reviewers and the editors thought the argument was valid?
Peer review can definitely issue certificates mistakenly, but validity is what it aims to certify.

michaelcohen 2 Nov 2022 21:57 UTC
LW: 3 AF: 2
0
AF
in reply to: RyanCarey’s comment on: AI X-risk >35% mostly based on a recent peer-reviewed argument
Thank you. I’ve changed the title.

michaelcohen 2 Nov 2022 21:55 UTC
LW: 3 AF: 1
2
AF
in reply to: cfoster0’s comment on: AI X-risk >35% mostly based on a recent peer-reviewed argument
Not trying to be arrogant. Just trying to present readers who have limited time a quickly digestible bit evidence about the likelihood that the argument is a shambles.

michaelcohen 2 Nov 2022 21:37 UTC
3 points
0
in reply to: David Johnston’s comment on: AI X-risk >35% mostly based on a recent peer-reviewed argument
I don’t think it’s an assumption really. I think this sentence just fixes the meanings, in perfectly sensible ways, of the words “entertain” and “to” (as in “pick actions to”). I guess you’re not persuaded that competent behavior in the “many new video games” environment is deserving of the description “aiming to maximize predicted future rewards”. Why is that, if the video games are sufficiently varied?

michaelcohen 2 Nov 2022 21:26 UTC
LW: 3 AF: 2
1
AF
on: Threat Model Literature Review
Thank you for this review! A few comments on the weaknesses of my paper.
In particular, it explicitly says the argument does not apply to supervised learning.
Hardly a weakness if supervised learning is unlikely to be an existential threat!
Strength: Does not make very concrete assumptions about the AGI development model.
Weakness: Does not talk much about how AGI is likely to be developed, unclear which of the assumptions are more/less likely to hold for AGI being developed using the current ML paradigm.
The fact that the argument holds equally well no matter what kind of function approximation is used to do inference is, I think, a strength of the argument. It’s hard to know what future inference algorithms will look like, although I do think there is a good chance that they will look a lot like current ML. And it’s very important that the argument doesn’t lump together algorithms where outputs are selected to imitate a target (imitation learners / supervised learners) vs. algorithms where outputs are selected to accomplish a long-term goal. These are totally different algorithms, so analyses of their behavior should absolutely be done separately. The claim “we can analyze imitation learners imitating humans together with RL agents, because both times we could end up with intelligent agents” strikes me as just as suspect as the claim “we can analyze the k-means algorithm together with a vehicle routing algorithm, because both will give us a partition over a set of elements.” (The claim “we can analyze imitation learners alongside the world-model of a model-based RL agent” is much more reasonable, since these are both instances of supervised learning.)
Assumes the agent will be aiming to maximize reward without justification, i.e. why does it not have other motivations, perhaps due to misgeneralizing about its goal?
Depending on the meaning of “aiming to maximize reward”, I have two different responses. In one sense, I claim “aiming to maximize reward” would be the nature of a policy that performs sufficiently strongly according to the RL objective. (And aiming to maximize inferred utility would be the nature of a policy that performs sufficiently strongly according to the CIRL objective.) But yes, even though I claim this simple position stands, a longer discussion would help establish that.
There’s another sense in which you can say that an agent that has a huge inductive bias in favor of $μ^{dist}$ , and so violates Assumption 3, is not aiming to maximize reward. So the argument accounts for this possibility. Better yet, it provides a framework for figuring out when we can expect it! See, for example, my comment in the paper that I think an arbitrarily advanced RL chess player would probably violate Assumption 3. I prefer the terminology that says this chess player is aiming to maximize reward, but is dead sure winning at chess is necessary for maximizing reward. But if these are the sort of cases you mean to point to when you suggest the possibility of an agent “not maximizing reward”, I do account for those cases.
Arguments made in the paper for why an agent intervening in the reward would have catastrophic consequences are somewhat brief/weak.
Are there not always positive returns to energy/resource usage when it comes to maximizing the probability that the state of a machine continues to have certain property (i.e. reward successfully controlled)? And our continued survival definitely requires some energy/resources. To be clear, catastrophic consequences follow from an advanced agent intervening the provision of reward in the way that would be worth doing. Catastrophic consequences definitely don’t follow from a half-hearted and temporary intervention in the provision of reward.

michaelcohen 1 Jun 2021 11:29 UTC
LW: 1 AF: 1
0
AF
in reply to: paulfchristiano’s comment on: Response to “What does the universal prior actually look like?”
I assume (6) means that your “anthropic update” scans across possible universes to find those that contain important decisions you might want to influence?
Yes, and then outputs strings from that set with probability proportional to their weight in the universal prior.
By (3) do you mean the same thing as “Simplest output channel that is controllable by advanced civilization with modest resources”?
I would say “successfully controlled” instead of controllable, although that may be what you meant by the term. (I decomposed this as controllable + making good guesses.) For some definitions of controllable, I might have given a point estimate of maybe 1 or 5 bits. But there has to be an output channel for which the way you transmit a bitstring out is the way the evolved consequentialists expect. But recasting it in these terms, implicitly makes the suggestion that the specification of the output channel can take on some of the character of (6′), makes me want to put my range down to 15-60; point estimate 25.
instead of using (1)+(2)+(3) you should compare to (6′) = “Simplest program that scans across many possible worlds to find those that contain some pattern that can be engineered by consequentialists trying to influence prior.”
Similarly, I would replace “can be” with “seems to have been”. And just to make sure we’re talking about the same thing, it takes this list of patterns, and outputs them with probability proportional to their weight in the universal prior.
Yeah, this seems like it would make some significant savings compared to (1)+(2)+(3). I think replacing parts of the story from being specified as [arising from natural world dynamics] to being specified as [picked out “deliberately” by a program] generally leads to savings.
Then the comparison is between specifying “important predictor to influence” and whatever the easiest-to-specify pattern that can be engineered by a consequentialist. It feels extremely likely to me that the second category is easier, indeed it’s kind of hard for me to see any version of (6) that doesn’t have an obviously simpler analog that could be engineered by a sophisticated civilization.
I don’t quite understand the sense in which [worlds with consequentialist beacons/geoglyphs] can be described as [easiest-to-specify controllable pattern]. (And if you accept the change of “can be” to “seems to have been”, it propagates here). Scanning for important predictors to influence does feel very similar to me to scanning for consequentialist beacons, especially since the important worlds are plausibly the ones with consequentialists.
There’s a bit more work to be done in (6′) besides just scanning for consequentialist beacons. If the output channel is selected “conveniently” for the consequentialists, since the program is looking for the beacons, instead of the consequentialists giving it their best guess(es) and putting up a bunch of beacons, there has to be some part of the program which aggregates the information of multiple beacons (by searching for coherence, e.g.), or else determines which beacon takes precedence, and then also determines how to interpret their physical signature as a bitstring.
Tangent: in heading down a path trying to compare [scan for “important to influence”] vs. [scan for “consequentialist attempted output messages”] just now, my first attempt had an error, so I’ll point it out. It’s not necessarily harder to specify “scan for X” than “scan for Y” when X is a subset of Y. For instance “scan for primes” is probably simpler than “scan for numbers with less than 6 factors”.
Maybe clarifying or recasting the language around “easiest-to-specify controllable pattern” will clear this up, but can you explain more why it feels to you that [scan for “consequentialists’ attempted output messages”] is so much simpler than [scan for “important-to-influence data streams”]? My very preliminary first take is that they are within 8-15 bits.
I also don’t really see why you are splitting them [(4) + (5)] apart, shouldn’t we just combine them into “wants to influence predictors”? If you’re doing that presumably you’d both use the anthropic prior and then the treacherous turn.
I split them in part in case there is there is a contingent of consequentialists who believes that outputting the right bitstring is key to their continued existence, believing that they stop being simulated if they output the wrong bit. I haven’t responded to your claim that this would be faulty metapyhsics on their part; it still seems fairly tangential to our main discussion. But you can interpret my 5 bit point estimate for (5) as claiming that 31 times out of 32 that a civilization of consequentialists tries to influence their world’s output, it is in an attempt to survive. Tell me if you’re interested in a longer justification that responds to your original “line by line comments” comment.

michaelcohen 27 May 2021 17:04 UTC
LW: 1 AF: 1
0
AF
in reply to: paulfchristiano’s comment on: Response to “What does the universal prior actually look like?”
Yeah, seems about right.
I think with 4, I’ve been assuming for the sake of argument that manipulators get free access to the right prior, and I don’t have a strong stance on the question, but it’s not complicated for a directly programmed anthropic update to be built on that right prior too.
I guess I can give some estimates for how many bits I think are required for each of the rows in the table. I’ll give a point estimate, and a range for a 50% confidence interval for what my point estimate would be if I thought about it for an hour by myself and had to write up my thinking along the way.
I don’t have a good sense for how many bits it takes to get past things that are just extremely basic, like an empty string, or an infinite string of 0s. But whatever that number is, add it to both 1 and 6.
1) Consequentialists emerge, 10 − 50 bits; point estimate 18
2) TM output has not yet begun, 10 − 30 bits; point estimate 18
3) make good guesses about controllable output, 18 − 150 bits; point estimate 40
4) decide to output anthropically updated prior, 8 − 35 bits; point estimate 15
5) decide to do a treacherous turn. 1 − 12 bits; point estimate 5
vs. 6) direct program for anthropic update. 18-100 bits; point estimate 30
The ranges are fairly correlated.

michaelcohen 27 May 2021 9:52 UTC
LW: 1 AF: 1
0
AF
in reply to: paulfchristiano’s comment on: Response to “What does the universal prior actually look like?”
Do you have some candidate “directly programmed anthropic update” in mind? (That said, my original claim was just about the universal prior, not about a modified version with an anthropic update)
I’m talking about the weight of an anthropically updated prior within the universal prior. I should have added “+ bits to encode anthropic update directly” to that side of the equation. That is, it takes some number of bits to encode “the universal prior, but conditioned on the strings being important to decision-makers in important worlds”. I don’t know how to encode this, but there is presumably a relatively simple direct encoding, since it’s a relatively simple concept. This is what I was talking about in my response to the section “The competition”.
One way that might be helpful about thinking about the bits saved from the anthropic update is that it is $- log {prob}_{string \sim universal prior} ($ string is important to decision-makers in important worlds $)$ . I think this gives us a handle in reasoning about anthropic savings as a self-contained object, even if it’s a big number.
> bits to specify camera on earth—bits saved from anthropic update
I think the relevant number is just “log_2 of the number of predictions that the manipulators want to influence.” It seems tricky to think about this (rather small) number as the difference between two (giant) numbers.
But suppose they picked only one string to try to manipulate. The cost would go way down, but then it probably wouldn’t be us that they hit. If log of the number of predictions that the manipulators want to influence is 7 bits shorter than [bits to specify camera on earth—bits saved from anthropic update], then there’s a 99% chance we’re okay. If different manipulators in different worlds are choosing differently, we can expect 1% of them to choose our world, and so we start worrying again, but we add the 7 bits back because it’s only 1% of them.
So let’s consider two Turing machines. Each row will have a cost in bits.
A B
Consequentialists emerge, Directly programmed anthropic update.
make good guesses about controllable output,
decide to output anthropically updated prior.
Weight of earth-camera within anthropically updated prior
The last point can be decomposed into [description length of camera in our world—anthropic savings], but it doesn’t matter; it appears in both options.
I don’t think this is what you have in mind, but I’ll add another case, in case this is what you meant by “They are just looking at the earth-like Turing machine”. Maybe, just skip this though.
A B
Consq-alists emerge in a world like ours, Directly prog. anthropic update.
make good guesses about controllable output,
output (strong) anth. updated prior.
Weight of earth-camera in strong anth. update … in normal anth. update
They can make a stronger anthropic update by using information about their world, but the savings will be equal to the extra cost of specifying that the consequentialists are in a world like ours. This is basically the case I mentioned above where different manipulators choose different sets of worlds to try to influence, but then the set of manipulators that choose our world has smaller weight.
------ end potential skip
What I think it boils down to is the question:
Is the anthropically updated version of the universal prior most simply described as “the universal prior, but conditioned on the strings being important to decision-makers in important worlds” or “that thing consequentialists sometimes output”? (And consequentialists themselves may be more simply described as “those things that often emerge”). “Sometimes” is of course doing a lot of work, and it will take bits to specify which “sometimes” we are talking about. If the latter is more simple, then we might expect the natural continuation of those sequences to usually contain treacherous turns, and if the former is more simple, then we wouldn’t. This is why I don’t think the weight of an earth-camera in the universal prior ever comes into it.
But/so I don’t understand if I’m missing the point of a couple paragraphs of your comment—the one which starts “They are just looking at the earth-like Turing machine”, and the next paragraph, which I agree with.