peterbarnett

Karma: 3,811

Researcher at MIRI

peterbarnett 21 Jan 2026 23:28 UTC
28 points
6
on: Claude’s new constitution
Were current models (e.g., Opus 4.5) trained using this updated constitution?

peterbarnett 20 Jan 2026 23:51 UTC
6 points
0
in reply to: Linch’s comment on: Linch’s Shortform
Here’s a slide from a talk I gave a couple of weeks ago. The point of the talk was “you should be concerned with the whole situation and the current plan is bad”, where AI takeover risk is just one part of this (IMO the biggest part). So this slide was my quickest way to describe the misalignment story, but I think there are a bunch of important subtleties that it doesn’t include.

peterbarnett 15 Jan 2026 22:32 UTC
2 points
0
on: When does competition lead to recognisable values?
Recognizable values are not the same as good values, but also I’m not at all convinced that the phenomena in this post will be impactful enough to outweigh all the somewhat random and contingent pressures what will shape a superintelligence’s values. I think a superintelligence’s values might be “recognizable” if we squint, and don’t look/think to hard, and if the superintelligence hasn’t had time to really reshape the universe.

peterbarnett 13 Jan 2026 22:52 UTC
9 points
5
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
Maybe I’m dense, but was the BART map the intended diagram?

peterbarnett 7 Jan 2026 0:04 UTC
3 points
0
in reply to: Zach Stein-Perlman’s comment on: Zach Stein-Perlman’s Shortform
The inability to copy/download is pretty weird. Anthropic seems to have deliberately disabled downloading, and rather than uploading a PDF, the webpage seems to be a bunch of PNG files.

peterbarnett 1 Jan 2026 1:31 UTC
179 points
178
on: You will be OK
I like the sentiment and much of the advice in this post, but unfortunately I don’t think we can honestly confidently say “You will be OK”.

Announcing: MIRI Technical Governance Team Research Fellowship

17 Dec 2025 0:02 UTC

61 points

(techgov.intelligence.org)

peterbarnett 8 Dec 2025 19:10 UTC
14 points
0
in reply to: anaguma’s comment on: anaguma’s Shortform
Maybe useful to note that all the Google people on the “Chain of Thought Monitorability” paper are from Google Deepmind, while Hope and Titans are from Google Research.

peterbarnett 21 Nov 2025 20:38 UTC
6 points
3
in reply to: Nisan’s comment on: peterbarnett’s Shortform
Great point! This possibly makes my proposal a Bad idea. I would need to know more about how the labs respond to this kind of incentive to actually know.

18 Nov 2025 23:31 UTC

52 points

18 Nov 2025 19:09 UTC

218 points

peterbarnett 14 Nov 2025 6:41 UTC
3 points
0
on: peterbarnett’s Shortform
There’s a funny and bad incentive where I want to upvote posts I haven’t read to push them past the 30 Karma threshold and make them appear on the podcast feed.

peterbarnett 12 Nov 2025 4:35 UTC
13 points
0
in reply to: Cleo Nardo’s comment on: Simon Lermen’s Shortform
I expect the line to blur between introspective and extrospective RSI. For example, you could imagine AIs trained for interp to doing interp on themselves, directly interpretting their own activations/internals and then making modifications while running.