mattmacdermott

Karma: 1,395

mattmacdermott 20 Oct 2025 16:00 UTC
19 points
4
on: Bubble, Bubble, Toil and Trouble
I think “Will there be a crash?” is a much less ambiguous question than “Is there a bubble?”

mattmacdermott 16 Oct 2025 15:00 UTC
2 points
0
in reply to: Gurkenglas’s comment on: Charbel-Raphaël’s Shortform
Yeah, I think “training for transparency” is fine if we can figure out good ways to do it. The problem is more training for other stuff (e.g. lack of certain types of thoughts) pushes against transparency.

mattmacdermott 7 Oct 2025 15:46 UTC
55 points
36
in reply to: lc’s comment on: abramdemski’s Shortform
I often complain about this type of reasoning too, but perhaps there is a steelman version of it.

For example, suppose the lock on my front door is broken, and I hear a rumour that a neighbour has been sneaking into my house at night. It turns out the rumour is false, but I might reasonably think, “The fact that this is so plausible is a wake-up call. I really need to change that lock!”

Generalising this: a plausible-but-false rumour can fail to provide empirical evidence for something, but still provide ‘logical evidence’ by alerting you to something that is already plausible in your model but that you hadn’t specifically thought about. Ideal Bayesian reasoners don’t need to be alerted to what they already find plausible, but humans sometimes do.

mattmacdermott 6 Oct 2025 6:00 UTC
7 points
0
on: The quotation mark
But then we have to ask — why two ‘ marks, to make the quotation mark? A quotidian reason: when you only use one, it’s an apostrophe. We already had the mark that goes in “don’t”, in “I’m”, in “Maxwell’s”; so two ’ were used to distinguish the quote mark from the existing apostrophe.

Incidentally I think in British English people normally do just use single quotes. I checked the first book I could find that was printed in the UK and that’s what it uses:

mattmacdermott 5 Oct 2025 5:02 UTC
7 points
7
in reply to: Mike Evron’s comment on: Markets in Democracy: What happens when you can sell your vote?

He’d be a fool to part with his vote for less than the amount of the benefits he gets.

Doesn’t seem right. Even assuming the person buying his vote wants to use it to remove his benefits, that one vote is unlikely to be the difference between the vote-buyer’s candidate winning and losing. The expected effect of the vote on the benefits is going to be much less than the size of the benefits.

mattmacdermott 3 Oct 2025 22:39 UTC
3 points
0
in reply to: Oliver Sourbut’s comment on: Checking in on AI-2027
An intuition you might be able to invoke is that the procedure they describe is like greedy sampling from an LLM, which doesn’t get you the most probable completion.

mattmacdermott 28 Sep 2025 0:15 UTC
16 points
12
in reply to: habryka’s comment on: CFAR update, and New CFAR workshops
“A Center for Applied Rationality” works as a tagline but not as a name

mattmacdermott 23 Sep 2025 18:10 UTC
7 points
2
on: Notes on fatalities from AI takeover
We have a ~25% chance of extinction

Maybe add the implied ‘conditional on AI takeover’ to the conclusion so people skimming don’t come away with the wrong bottom line? I had to go back through the post to check whether this was conditional or not.

mattmacdermott 19 Sep 2025 22:48 UTC
2 points
0
in reply to: leogao’s comment on: leogao’s Shortform
Fair enough yeah. But at least (1)-style effects weren’t strong enough to prevent any significant legislation in the near future.

mattmacdermott 19 Sep 2025 21:51 UTC
2 points
−2
in reply to: leogao’s comment on: leogao’s Shortform
Some evidence for (2) is that before the 1957 act no civil rights legislation had been passed for 82 years^[1], and after it three more civil rights acts were passed in the next 11 years, including the Civil Rights Act of 1964, which in my understanding is considered very significant.
1. ↩︎
  Going off what’s listed in the wikipedia article on civil rights acts in the United States.

mattmacdermott 7 Sep 2025 6:04 UTC
2 points
0
in reply to: Nathan Young’s comment on: Nathan Young’s Shortform
I thought the post was fine and was surprised it was so downvoted. Even if people don’t agree with the considerations, or think all the most important considerations are missing, why should a post saying, “Here’s what I think and why I think it, feel free to push back in the comments,” be so poorly received? Commenters can just say what they think is missing.

Seems likely that it wouldn’t have been so downvoted if its bottom line was that AI risk is very high. Increases my P(LW groupthink is a problem) a bit.

mattmacdermott 6 Sep 2025 2:50 UTC
2 points
−2
on: A Pitfall of “Expertise”

Question marks and exclamation points go outside, unless they’re part of the sentence, and colons and semicolons always go outside.

An example of a sentence from the post that uses a colon is “It gets deeper than this, but the core problem remains the same”:

Surely nobody endorses putting the colon outside the quotes there! I feel like Opus/whoever is just assuming that people virtually never want to quote a piece of text that ends in a colon, rather than really wanting to endorse a different rule to the question mark case.

mattmacdermott 23 Aug 2025 3:46 UTC
14 points
13
on: Yudkowsky on “Don’t use p(doom)”
I’m confused by the “necessary and sufficient” in “what is the minimum necessary and sufficient policy that you think would prevent extinction?”

Who’s to say there exists a policy which is both necessary and sufficient? Unless we mean something kinda weird by “policy” that can include a huge disjunction (e.g. “we do any of the 13 different things I think would work”) or can be very vague (e.g “we solve half A of the problem and also half B of the problem”).

It would make a lot more sense in my mind to ask “what is a minimal sufficient policy that you think would prevent extinction”?

mattmacdermott 16 Aug 2025 19:39 UTC
7 points
12
in reply to: dbohdan’s comment on: dbohdan’s Shortform
Usually lower numbers go on the left and bigger numbers go on the right (1, 2, 3,…) so seems reasonable to have it this way.

mattmacdermott 15 Aug 2025 20:40 UTC
3 points
0
in reply to: TurnTrout’s comment on: Training a Reward Hacker Despite Perfect Labels
RL generalization is controlled by why the policy took an action

Is this that good a framing for these experiments? Just thinking out loud:

Distinguish two claims
1. what a model reasons about in its output tokens on the way to getting its answer affects how it will generalise
2. why a model produces its output tokens affects how it will generalise
These experiments seem to test (1), while the claim from your old RL posts is more like (2).

You might want to argue that the claims are actually very similar, but I suspect that someone who disagrees with the quoted statement would believe (1) despite not believing (2). To convince such people we’d have to test (2) directly, or argue that (1) and (2) are very similar.

As for whether the claims are very similar…. I’m actually not sure they are (I changed my mind while writing this comment).

Re: (1), it’s clear that when you get an answer right using a particular line of reasoning, the gradients point towards using that line of reasoning more often. But re: (2), the gradients point towards whatever is the easiest way to get you to produce those output tokens more often, which could in principle be via a qualitatively different computation to the one that actually caused you to output them this time.

So the two claims are at least somewhat different. (1) seems more strongly true than (2) (although I still believe (2) is likely true to a large extent).

mattmacdermott 15 Aug 2025 20:14 UTC
7 points
1
in reply to: ACCount’s comment on: Training a Reward Hacker Despite Perfect Labels
But their setup adds:

1.5. Remove any examples in which the steering actually resulted in the desired behaviour.

which is why it’s surprising.

mattmacdermott 6 Aug 2025 23:20 UTC
11 points
3
in reply to: ryan_greenblatt’s comment on: Re: recent Anthropic safety research

the core prompting experiments were originally done by me (the lead author of the paper) and I’m not an Anthropic employee. So the main results can’t have been an Anthropic PR play (without something pretty elaborate going on).

Well, Anthropic chose to take your experiments and build on and promote them, so that could have been a PR play, right? Not saying I think it was, just doubting the local validity.

mattmacdermott 30 Jul 2025 13:10 UTC
2 points
0
in reply to: Mateusz Bagiński’s comment on: Utility Maximization = Description Length Minimization
Sorry. can’t remember. Something done virtually, maybe during Covid.

mattmacdermott 18 Jul 2025 14:50 UTC
−3 points
3
on: mattmacdermott’s Shortform
ELK = easy-to-hard generalisation + assumption the model already knows the hard stuff?

mattmacdermott 17 Jul 2025 15:59 UTC
6 points
2
in reply to: Rohin Shah’s comment on: Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
It’s the last sentence of the first paragraph of section 1.