johnswentworth

Karma: 59,450

johnswentworth 28 Oct 2025 22:08 UTC
8 points
0
in reply to: David Lorell’s comment on: Resampling Conserves Redundancy (Approximately)
Additional note which might be relevant later: we can also get proof step 1 in a somewhat more general way, which establishes that the function $P [X, Λ] \mapsto P [X] P [Λ | X_{i}]$ is a nonexpansive map under $D_{K L}$ . We’ll write that proof down later if we need it.

johnswentworth 28 Oct 2025 17:37 UTC
8 points
6
on: Seven-ish Words from My Thought-Language
[Feedback from a private channel which I am reposting here at Lorxus’ request.] This post is excellent.

johnswentworth 28 Oct 2025 17:36 UTC
3 points
1
on: A Sketch of Helpfulness Theory With Equivocal Principals
[Feedback from a private channel which I am reposting here at Lorxus’ request.] I think this post got little karma because the opening is weak. What I’d like in the first two paragraphs of this post is (a) a concrete example of whatever the post is about, and (b) some indication of how “helpfulness” and “theory” and “principles” show up in that example, so I know what the post is even going to talk about.

johnswentworth 26 Oct 2025 20:58 UTC
14 points
2
in reply to: Richard_Ngo’s comment on: Generalized Coming Out Of The Closet
When I read that, I feel both like it’s describing a phenomenon which probably does happen a fair bit, and like it basically misses the core drivers behind most of my own BDSM-esque experiences. There’s more than one other pattern I’ve experienced, but here’s one which I feel able to flesh out right now.
For certain relationships, a strong dom/sub dynamic feels like the natural shape of the relationship. Ordinary social norms dictate maintaining a much greater semblance of equality, but adhering to those norms feels psychologically forced and uncomfortable in a way analogous to holding one’s body in an awkward shape for a long time.
In the cases I’m thinking of, there are domains in which she usually feels either anxious or insecure or both. She mostly doesn’t understand what’s going on, or mostly doesn’t know what to do, or feels embarrassed or decision-paralyzed, or doesn’t know what she wants or what I want, etc. (Sex is one such domain, but often there are many.) And on the other side, I mostly do understand what’s going on, mostly do have a good plan, feel comfortable making decisions, know what we both want, etc. In such a domain, asking her to make decisions is frustrating for both of us; it’s much smoother if I just take charge and tell her what to do (and ask narrow questions of her when I need information, or delegate narrow tasks when I need her to do things).
That’s one type of pressure which pushes toward dom/sub shaped relationships. And when the relationship naturally takes that shape, it feels so much smoother to step into explicit dom/sub roles, rather than trying to maintain a semblance of power balance.
Another pattern I’ve experienced: rather than an inability to make decisions or stress making decisions, she wants to shut down her brain and let someone else handle everything. That’s the sort of thing Aella talks about in her “Good at Sex” series; it applies to the sexual trance state, but also some similar states outside the bedroom like e.g. following in a lot of dances. And on my side, that leaves me a lot of freedom to take things wherever I please; I can act relatively unconstrained. The upshot is similar: the relationship naturally falls into a dom/sub shape, and making it take some other shape feels psychologically forced and uncomfortable.

johnswentworth 26 Oct 2025 17:23 UTC
7 points
2
in reply to: Lorxus’s comment on: Selection Has A Quality Ceiling
My updates over the years:
- I have generally updated in the direction of “training works but takes 10x more investment than it seems like it should”
- Probably training could be made much cheaper—e.g. the framing practica and “what are you tracking in your head?” posts had useful directions. But making better training itself takes a huge amount of work.

johnswentworth 24 Oct 2025 22:36 UTC
10 points
0
in reply to: David Lorell’s comment on: Resampling Conserves Redundancy (Approximately)
(Update 6)
Most general version of the chainability conjecture (for arbitrary graphs) has now been falsified numerically by David, but the version specific to the DAGs we need (i.e. the redundancy conditions, or one redundancy and the mediation condition) still looks good.
Most likely proof structure would use this lemma:
Lemma
Let $f_{1}, f_{2}$ be nonexpansive maps under distance metric $D$ . (Nonexpansive maps are the non-strict version of contraction maps.)
By the nonexpansive map property, $D (x, f_{1} (x)) \geq D (f_{2} (x), f_{2} (f_{1} (x)))$ . And by the triangle inequality for the distance metric, $D (x, f_{2} (f_{1} (x))) \leq D (x, f_{2} (x)) + D (f_{2} (x), f_{2} (f_{1} (x)))$ . Put those two together, and we get
$D (x, f_{2} (f_{1} (x))) \leq D (x, f_{1} (x)) + D (x, f_{2} (x))$
(Note: this is a quick-and-dirty comment so I didn’t draw a nice picture, but this lemma is easiest to understand by drawing the picture with the four points and distances between them.)
I think that lemma basically captures my intuitive mental picture for how the chainability conjecture “should” work, for the classes of DAGs on which it works at all. Each DAG $j$ would correspond to one of the functions $f_{j}$ . where $f_{j}$ takes in a distribution and returns the distribution factored over the DAG $j$ , i.e.
$f_{j} (X \mapsto P [X]) := (X \mapsto \prod_{i} P [X_{i} | X_{p a^{j} (i)}])$
In order to apply the lemma to get our desired theorem, we then need to find a distance metric which:
- Is a distance metric (in particular, it must satisfy the triangle inequality, unlike $D_{K L}$ )
- Makes our DAG functions nonexpansive mappings
- Matches $D_{K L} (P, f_{j} (P))$ AT THE SPECIFIC POINT P (not necessarily anywhere else)
The first two of those are pretty easy to satisfy for the redundancy condition DAGs: those two DAG operators are convex combinations, so good ol’ Euclidean distance on the distributions should work fine. Making it match $D_{K L}$ at $P$ is trickier, still working that out.

johnswentworth 24 Oct 2025 20:31 UTC
9 points
3
in reply to: Max Harms’s comment on: Worlds Where Iterative Design Succeeds?
So there’s this ethos/thought-pattern where one encounters some claim about some thing X which is hard to directly observe/measure, and this triggers an attempt to find some easier-to-observe thing Y which will provide some evidence about X. This ethos is useful on a philosophical level for identifying fake beliefs, which is why it featured heavily in the Sequences. But I claim that, to a rough approximation, this ethos basically does not work in practice for measuring things X, and people keep shooting themselves in the foot by trying to apply it to practical problems.
What actually happens, when people try to apply that ethos in practice, is that they Do Not Measure What They Think They Are Measuring. The person’s model of the situation is just totally missing the main things which are actually going on, their whole understanding of how X relates to Y is wrong, it’s a coinflip whether they’d even update in the correct direction about X based on observing Y. And the actual right way for a human (as opposed to a Solomonoff inductor) to update in that situation is to just ignore Y for purposes of reasoning about X.
The main thing which jumps out at me in your dialogue is your self-insert repeatedly trying to apply this ethos which does not actually work in practice.

johnswentworth 24 Oct 2025 5:51 UTC
6 points
0
on: Worlds Where Iterative Design Succeeds?
I can imagine a counter-argument that says “you’re noticing deep problems and then your wishful thinking is saying ‘but maybe they won’t bite’ but you should notice how deep and pernicious they are.” But this argument feels like it proves too much. Don’t plenty of fields have pernicious problems of a similar character, but manage to make progress anyway?
My answer to this is actually “no”, for the most part. There are fields which make lots of progress by avoiding this flavor of perniciousness in various ways—e.g. market feedback pressures are a big one which can work insofar as a nontrivial fraction of downstream consumers are capable of recognizing problems. Then there are fields which don’t have ways of avoiding this flavor of perniciousness, and they mostly either don’t make progress, or end up “faking it”.

johnswentworth 24 Oct 2025 5:48 UTC
11 points
5
on: Worlds Where Iterative Design Succeeds?
Problem-hiding and deception are naturally countered by corrigibility, and I expect a pseudo-corrigible agent to spend a bunch of cognitive effort hunting in their own mind for schemes and problems.
This part is, I claim, off. The kind of “pseudo-corrigibility” one would get from training a model to output corrigible-looking things would importantly not involve any selection pressure directly on how the thing cognates internally. Training could select for the system to output things which sound like they result from searching its own mind for things, but it could not select for actually doing that (separate from the outputs having that appearance). And that’s the sort of thing which is relatively easy to fake, very hard to verify, and plausibly easier to fake than to actually do (depending on how alien the mind’s internals actually are). Very likely, there will be at least some cases where the thing can get better scores from the humans by faking hunting its own mind for schemes and problems—much like e.g. how today’s models (and humans) will settle on an answer and then make up a totally-fake retrospective story about how they arrived at that answer.
I don’t think this point is cruxy on its own, but I think it points toward some central important difference between however you’re thinking about things and however I’m thinking about things. Like, there’s this jump in reasoning from “behavior which humans label as corrigible-looking” to “even vaguely corrigible cognition internally”, which is a really big jump; that jump is not something which would easily follow from the selection pressures involved, especially when combined with the philosophical problems.

johnswentworth 23 Oct 2025 15:56 UTC
6 points
0
in reply to: johnswentworth’s comment on: Any corrigibility naysayers outside of MIRI?
As for how that gets to “definitely can’t”: the problem above means that, even if we nominally have time to fiddle and test the system, iteration would not actually be able to fix the relevant problems. And so the situation is strategically equivalent to “we need to get it right on the first shot”, at least for the core difficult parts (like e.g. understanding what we’re even aiming for).
And as for why that’s hard to the point of de-facto impossibility with current knowledge… try the ball-cup exercise, then consider the level of detailed understanding required to get a ball into a cup on the first shot, and then imagine what it would look like to understand corrigible AI at that level.

johnswentworth 23 Oct 2025 15:36 UTC
2 points
0
in reply to: Jeremy Gillen’s comment on: Resampling Conserves Redundancy (Approximately)
That seems like a cool idea for the mediation condition, but Isn’t it trivial for the redundancy conditions?
Indeed, that specific form doesn’t work for the redundancy conditions. We’ve been fiddling with it.

johnswentworth 23 Oct 2025 15:33 UTC
14 points
2
in reply to: Max Harms’s comment on: Any corrigibility naysayers outside of MIRI?
The anti-naturality problems are an issue, especially if you want to build the thing via standard RL-esque training, but they’re not the first things which will kill you.
The story in the post you link is a pretty standard training story, and runs into the same immediate problems which standard training stories usually run into:
- The humans will feed the system incorrect data.
- Insofar as the system is capable, it will learn to predict the humans’ own errors, as opposed to the thing the humans intended (and this will get worse as capabilities increase).
- Insofar as the incorrect data stems from the humans being bad at understanding what they should ask for, the resulting problems will be systematically hard to notice; things will look correct to the humans.
These problems basically don’t apply in domains where the intended behavior is easily checkable with basically-zero error, like e.g. mathematical proofs or some kinds of programming problems. These problems are most severe in domains where no humans understand very well what behavior they want, which is exactly the case for corrigibility.
So if one follows the training story in the linked post, the predictable result will be a system which behaves in ways which pattern-match to “corrigible” to human engineers/labellers/overseers. But those engineers/labellers/overseers don’t actually understand what corrigibility even is (because nobody currently understands that), so their pattern-matching will be systematically wrong, both in the training data and in the oversight. Crank up the capabilities dial, and that results in a system which is incorrigible in exactly the ways which these human engineers/labellers/overseers won’t detect.
That’s the sort of standard problem which trying to train a corrigible system adds, on top of all the challenges of just building a corrigible system at all.
What links here?
- Worlds Where Iterative Design Succeeds? by Max Harms (23 Oct 2025 22:14 UTC; 21 points)

johnswentworth 22 Oct 2025 23:02 UTC
14 points
8
in reply to: Max Harms’s comment on: Any corrigibility naysayers outside of MIRI?
… train? One wouldn’t get a corrigible ASI by 2035 by training corrigibility into the thing, that would be far more difficult than building a corrigible ASI at all (which is itself well beyond current understanding).

johnswentworth 22 Oct 2025 22:28 UTC
10 points
1
on: Any corrigibility naysayers outside of MIRI?
I might disagree with that. I think my disagreement would be less about technical feasibility, and more about it being actively unhelpful to have the planet united around a goal when approximately none of the supposedly-united people understand what goal it is that they’re supposedly united around.

johnswentworth 22 Oct 2025 18:46 UTC
11 points
0
in reply to: johnswentworth’s comment on: Resampling Conserves Redundancy (Approximately)
(Update 4)
We have now started referring to “Jeremy et Al” when discussing the findings at top-of-thread, and find this amusing.
As of this morning, our current thread is an adjustment to the error measure. Thinking it through from first principles, it makes intuitive sense to marginalize out latents inside a $D_{K L}$ , i.e. $D_{K L} (P [X] | | \sum_{Λ} Q [X, Λ])$ rather than $D_{K L} (P [X, Λ] | | Q [X, Λ])$ (where $Q$ is typically some factorization of $P [X, Λ]$ ). Conceptually, that would mean always grounding out errors in terms of predictions on observables, not on mind-internal latent constructs. We’re now checking whether that new error gives us the properties we want in order to make the error measure useful (and in the process, we’re noticing what properties we want in order for the error measure to be useful, and making those more explicit than we had before).

johnswentworth 18 Oct 2025 18:04 UTC
13 points
0
in reply to: David Lorell’s comment on: Resampling Conserves Redundancy (Approximately)
(Update 3)
We’re now pursuing two main threads here.
One thread is to simplify the counterexamples into something more intuitively-understandable, mainly hopes of getting an intuitive sense for whatever phenomenon is going on with the counterexamples. Then we’d build new theory specifically around that phenomenon.
The other thread is to go back to first principles and think about entirely different operationalizations of the things we’re trying to do here, e.g. not using diagram $D_{K L}$ ’s as our core tool for approximation. The main hope there is that maybe $D_{K L}$ isn’t really the right error metric for latents, but then we need to figure out a principled story which fully determines some other error metric.
Either way, we’re now >80% that this is a fundamental and fatal flaw for a pretty big chunk of our theory.

johnswentworth 16 Oct 2025 16:08 UTC
9 points
0
in reply to: johnswentworth’s comment on: Resampling Conserves Redundancy (Approximately)
(Update 0)
I’m starting by checking that there’s actually a counterexample here. We also found some numerical counterexamples which were qualitatively similar (i.e. approximately-all of the weight was on one outcome), but thought it was just numerical error. Kudos for busting out the sympy and actually checking it.
Looking at the math on that third-order issue… note that the whole expansion is multiplied by $P [X]$ . So even if $δ [X] \sim \sqrt{P [X]}$ , $P [X]$ itself will still go to zero for small $δ$ , so $P [X] (\frac{δ [X]}{\sqrt{P [X]}})^{3}$ will go to zero. So it’s not obviously a fatal flaw, though at the very least some more careful accounting would be needed at that step to make sure everything converges.

johnswentworth 16 Oct 2025 15:29 UTC
10 points
2
in reply to: Jeremy Gillen’s comment on: Resampling Conserves Redundancy (Approximately)
I plan to spend today digging into this, and will leave updates under this comment as I check things.

johnswentworth 16 Oct 2025 14:31 UTC
4 points
0
in reply to: Matt Dellago’s comment on: Coherence of Caches and Agents
Yup! A Simple Toy Coherence Theorem walks through a toy version of that idea, and I do think it’s a ripe area for someone to figure out more realistic theorems.

johnswentworth 16 Oct 2025 3:38 UTC
2 points
0
in reply to: shawnghu’s comment on: My Empathy Is Rarely Kind
It’s not that conscious/reflective. Respect is an emotion; my standards for it are more on the instinctive level. Which is not to say that there aren’t consistent standards there, but they’re not something I have easy direct control over or ready introspective access to.