mattmacdermott

Karma: 1,685

mattmacdermott 27 Jul 2026 6:20 UTC
2 points
0
in reply to: Chris Lakin’s comment on: The Long (Self-)Correction
The Long Ponder?
It’s pretty similar to the Long Reflection but to me has slightly more connotation of puzzlement. Plus it has a nice ring to it.

mattmacdermott 20 Jul 2026 23:52 UTC
26 points
0
in reply to: Eric Neyman’s comment on: strawberry calm’s Shortform
I got Claude to do a little lit review, and it found that human-resolved conjectures are 70:30 true:false, whereas LLM-resolved conjectures so far are 50:50.

Caveat: the selection criteria are a bit different in the two cases, which could skew things. The human-resolved conjectures were selected for being the 100^[1] most famous resolved conjectures (based on their fame at the time of resolution in Claude’s judgement), whereas the LLM-resolved conjectures are just the ones that have been resolved so far.
1. ^
  I threw away anything that was proved independent of ZFC or that wasn’t a yes/no question, so we end up with 96 instead of 100.

mattmacdermott 7 Jul 2026 13:38 UTC
11 points
4
in reply to: Wei Dai’s comment on: Wei Dai’s Shortform
Isn’t this a classic commitment race/equilibrium selection thing where it’s just unclear what ‘should’ happen if everyone involved is following one of these decision theories? The customers can coordinate to boycott anyone involved in price fixing

mattmacdermott 30 Jun 2026 21:08 UTC
3 points
0
in reply to: kbear’s comment on: kbear’s Shortform
Ok how about ‘not unusual’?

mattmacdermott 30 Jun 2026 12:47 UTC
7 points
−3
in reply to: kbear’s comment on: kbear’s Shortform
Counterexample: Q = “not possible to claim”

mattmacdermott 29 Jun 2026 18:38 UTC
2 points
0
on: P(doom) is a Dumb Meme

I hate talking about “P(doom)”. When it first started showing up in the wake of ChatGPT

This got me curious about when the term originated, because I thought I remembered it popping up in the late 2021 MIRI conversations. I found that Rohin Shah indeed used it in one of those, but I’m not sure whether he coined it there and then or whether it was already floating around.

This wiktionary page has an uncited claim that it originated on LessWrong in 2010. I managed to find this −4 karma comment from 2010 by timtyler (on Intellectual Hipsters and Meta-Contrarianism), which says, in the context of arguing that MIRI have an incentive to exaggerate the risks from AI in order to get funding:

Anyway, the basic point is that if you are interested in DOOM, or p(DOOM), consulting a DOOM-mongering organisation, that wants your dollars to help them SAVE THE WORLD may not be your best move.

I couldn’t find any uses of the term in between those two, so I’m guessing that timtyler’s use was independent of Rohin’s, and either Rohin coined it during that dialogue or people had started using it in private discussion shortly before then. I’m in the market for people with better Google-Fu to prove me wrong about this though.

Edit: turns out this is already documented: Tim Tyler claims to have started using it in about 2009. I’m still curious on the path between in ⁰⁹⁄₁₀ and Rohin in 2021.

mattmacdermott 11 Jun 2026 15:00 UTC
4 points
0
in reply to: Daniel Tan’s comment on: Three types of model organism
My take on the difference is that the “worst-case” ones are created for the purposes of studying safety techniques (control protocols, alignment training, auditing techniques), whereas the aspiration for the “constructed” ones is to learn about some potential propensity of models.

mattmacdermott 10 Jun 2026 9:00 UTC
9 points
0
in reply to: RHollerith’s comment on: Tim Hua’s Shortform
As Erich_Grunewald says, it usually shows you when Claude has used a search tool, and in this case I told it not to use search and it didn’t show me any usage. But it was so impressive that I’m at like 10% that there’s some secret hidden search tool type thing that explains it.

mattmacdermott 9 Jun 2026 22:52 UTC
23 points
2
in reply to: Tim Hua’s comment on: Tim Hua’s Shortform
Wow, I just asked it about the details of a fairly obscure 11-citation paper of mine from 2024 and it has memorised ~all the technical details and could give a sentence-for-sentence paraphrase of large chunks of the paper. Strange experience, I recommend people try it out with their own obscure writings.

mattmacdermott 9 Jun 2026 22:21 UTC
4 points
2
in reply to: Tim Hua’s comment on: Tim Hua’s Shortform
Sorry if this is obvious but do you have anything written in ‘instructions for Claude’ in your settings? If so, it’s still visible to Claude in incognito mode.

mattmacdermott 22 May 2026 21:33 UTC
7 points
4
in reply to: leogao’s comment on: My hobby: running deranged surveys
In my experience even smart people need a bit of conversational back and forth to understand the setup for Newcomb’s problem, so I’m a bit skeptical that most survey respondents can grok it from a concise description.

I wonder how the results would be different if a human or AI did enough back and forth with each participant to make sure they understand the setup.

mattmacdermott 28 Apr 2026 22:40 UTC
4 points
0
in reply to: Steven Byrnes’s comment on: strawberry calm’s Shortform
What would you call the thing where it turns out that all my morality-flavoured wants are nonsense, but all my selfish-flavoured wants still make sense? Can we sub that term into the OP in place of ‘nihilism’?

Or do you deny the premise that there are different flavours? I personally really feel like I can taste the flavours.

mattmacdermott 18 Apr 2026 15:42 UTC
4 points
0
in reply to: Tom Davidson’s comment on: Tom Davidson’s Shortform

If someone ran the sim for entertainment, they’d obviously sell that info to the acausal trade folks

Weak argument? The set of times that people incidentally produce information relevant to other people is much broader than the set of times they sell it to them.

mattmacdermott 1 Apr 2026 8:48 UTC
4 points
0
in reply to: niplav’s comment on: Validating against a misalignment detector is very different to training against one
Thanks for the positive reinforcement!

mattmacdermott 6 Mar 2026 13:46 UTC
3 points
1
on: Reasoning Models Struggle to Control Their Chains of Thought
The fact that Claude models have higher CoT controllability is consistent with recent discussion about Anthropic models not strongly distinguishing between CoT and outputs, and hence reinforcement spillover being more likely.

(Although it strikes me now that the causality between reinforcement spillover and not strongly distinguishing between CoT and outputs could go in either direction).

mattmacdermott 23 Feb 2026 13:27 UTC
2 points
0
in reply to: Hauke Hillebrandt’s comment on: Hauke Hillebrandt’s Shortform
I don’t have a take on the empirical evidence here, but maybe things like this could be caused by “negative inoculation prompting”.

In inoculation prompting, you tell the model during training that it’s ok to do bad thing X, in the hopes that if you accidentally reward X then the model learns “do X when told it’s ok” rather than “do X”.

Depending on how constitutional training is done, we could be teaching the model some version of “don’t do X when told not to by the constitution” or “don’t do X because the constitution says not to” rather than teaching it not to want to do X.

mattmacdermott 9 Jan 2026 8:26 UTC
46 points
14
in reply to: Wei Dai’s comment on: Wei Dai’s Shortform
“Agent” is sort of similar. The top definition on google is “a person who acts on behalf of another person or group”, whereas in these parts we tend to use it for a thing that has its own goals.

Edit: I think Jon Richens pointed this out to me once.

mattmacdermott 21 Dec 2025 22:46 UTC
4 points
0
in reply to: habryka’s comment on: Contradict my take on OpenPhil’s past AI beliefs
Ah, confusing. Looks like Ben’s comment quotes a post by Holden from 15th Dec which in turn quotes a different post by Holden from 22nd Aug, which is where the original statement about takeover odds appears.

Both posts are hyperlinked in the comment and I had clicked the latter link without noticing, but in any case yes, seems like the original statement is from pre-ChatGPT.

Edit: just realised Ben had already commented to the same effect below.

mattmacdermott 21 Dec 2025 17:10 UTC
2 points
0
in reply to: Ben Pace’s comment on: Ben Pace’s Shortform Feed
lol sorry

mattmacdermott 21 Dec 2025 16:43 UTC
2 points
0
in reply to: Ben Pace’s comment on: Ben Pace’s Shortform Feed
How about a react for “that answers my question”? People seem to use thumbs up or thanks, but both suggest approval of the answer even if that’s not intended.