johnswentworth

Karma: 61,857

johnswentworth 5 Jan 2026 1:01 UTC
4 points
0
in reply to: Elizabeth’s comment on: The Weirdness of Dating/Mating: Deep Nonconsent Preference
I think one thing I didn’t communicate in the post is that I don’t necessarily intend to hypothesize deep nonconsent as a terminal preference. So, for instance,
women are scared men will get angry if they go from “yes” to “no”, in a way they won’t if the woman goes from “----” to “no”, so women delay being explicit until they have all the information
sounds to me like one of many possible generators of deep nonconsent preference—i.e. it’s directly explaining why women would typically have a deep-in-the-sense-of-appearing-in-lots-of-places preference for nonconsent behavior. It therefore sounds not-at-all at odds with the post, or at least what I had in mind when writing the post.

johnswentworth 5 Jan 2026 0:54 UTC
4 points
0
in reply to: Nick_Tarleton’s comment on: The Weirdness of Dating/Mating: Deep Nonconsent Preference
“Another I played with was e.g. “blame avoidance”, i.e. something-like-ladybrain really wants any dating/sex to happen in a way which is “not her fault”. That seems to mostly generate the same predictions.”
Do you think it has some disadvantage, such that you didn’t choose to mention it at all in the OP?
“Blame avoidance” seems like a candidate generator of deep nonconsent preference: if one never consents to anything that’s going on, then one is not to blame for any of it (or so goes the story). There are other generators one could imagine as well—e.g. Elizabeth hypothesized elsethread ‘women are scared men will get angry if they go from “yes” to “no”, in a way they won’t if the woman goes from “----” to “no”, so women delay being explicit until they have all the information’. That’s another hypothesis for what might generate deep nonconsent preference.
I settled on the term “deep nonconsent preference” because that seemed like the most direct description of the behavior-cluster, while assuming the least about what generates that behavior. I did not think (and still don’t think) I had enough information to nail down a primary generator of the behavior.

johnswentworth 3 Jan 2026 19:01 UTC
10 points
0
in reply to: Elizabeth’s comment on: The Weirdness of Dating/Mating: Deep Nonconsent Preference
Can you gesture at what kind of data would be helpful to bring in-frame?

johnswentworth 3 Jan 2026 15:02 UTC
11 points
2
in reply to: jbash’s comment on: The Weirdness of Dating/Mating: Deep Nonconsent Preference
So, there’s this thing called Solomonoff induction. It works, provably, for anything Turing computable. And human social behavior is definitely Turing computable.
“If a theory claims to compactly generate any significant set of social dynamics, that’s evidence against the theory” is an anti-inductive prior. It’s like saying that things which have happened less often before are more likely in the future, and therefore the sun will certainly not rise tomorrow.

johnswentworth 3 Jan 2026 14:57 UTC
24 points
3
in reply to: the gears to ascension’s comment on: The Weirdness of Dating/Mating: Deep Nonconsent Preference
Look, I don’t like dealing with the sort of stuff I called “deep nonconsent” in this post. Sure, I’m quite kinky in bed, but in the rest of the mating process? When someone who’s interested won’t send any goddamn signals, or sends negative signals while hoping that I pursue, it’s just incredibly obnoxious. I strongly prefer to deal with women who actually send signals when interested, or better yet just ask me out. I want to date and fuck women who are, like, “on my team”, not trying to make everything pointlessly difficult all the time.
And maybe that will change at some point. It’s the sort of thing which sometimes seems less obnoxious as one understands it better. But man, for now, I sure prefer to just avoid women who do that shit.
Like, okay, let’s put it this way—if it were to turn out to have been true the entire time, what other generator could produce this evidence that also would produce evidence incompatible with this model? Or, in what way could “nonconsent” be missing the point about the generator? I’d sure like to see a slightly more ladybrain discussion, if that’s available.
I totally agree that there are other possible generators which look very similar to “deep nonconsent preference”. Another I played with was e.g. “blame avoidance”, i.e. something-like-ladybrain really wants any dating/sex to happen in a way which is “not her fault”. That seems to mostly generate the same predictions.
So yeah, I am totally ready to believe there’s some other nearby generator, and if you have one which also better explains some additional things then please state it I want to know it. I have not found it on my own, and one of the main points of posting this stuff online is that sometimes people come along and tell me what I’m missing. That’s what I want. If you have clean examples where the model in the post would produce incorrect interpretations of what’s going on, I also want to hear those. What I don’t want is people being like “this is problematic and missing important things” without actually saying a single thing that it’s wrong about or presenting any alternative model.

johnswentworth 3 Jan 2026 5:00 UTC
5 points
0
in reply to: Nick_Tarleton’s comment on: The Weirdness of Dating/Mating: Deep Nonconsent Preference
I went back and forth a bit on whether “nonconsent” or “anticonsent” was the right term to use for the bulk of this post. I’ve encountered women whose preferences look like one, and other women whose preferences look like the other, including at earlier parts of the pipeline (inconveniently). For instance, wanting a guy to pursue hard even after he’s turned down seems not-uncommon, despite it being obviously very obnoxious in cases where it’s actually unwanted. That would be an earlier-stage version of “anticonsent”, as opposed to “nonconsent/aconsent”.
I ended up going with “nonconsent” because it’s a broad enough term to capture both, and I think both are relevant.

johnswentworth 3 Jan 2026 0:29 UTC
4 points
0
in reply to: Steven Byrnes’s comment on: The Weirdness of Dating/Mating: Deep Nonconsent Preference
That was a great post.

The Weirdness of Dating/Mating: Deep Nonconsent Preference

johnswentworth2 Jan 2026 23:05 UTC

13 points

44 comments6 min readLW link

johnswentworth 2 Jan 2026 18:22 UTC
3 points
0
on: [Advanced Intro to AI Alignment] 2. What Values May an AI Learn? — 4 Key Problems
An important thing which I think is missing here: natural abstraction binds mainly to the Model, not the Critic.
Insofar as the natural abstraction hypothesis is true at all, it’s about structure which appears in the environment. That structure gets convergently mirrored in models of that environment. The Critic component is not optimized to be a model of the environment; it’s optimized to be a model of the reward signal. Insofar as the natural abstraction hypothesis binds to the Critic, it would be relevant in two ways:
- The Critic largely inherits the Model’s ontology, because that’s the ontology it’s forced to use (architecturally) when querying the Model. This is roughly the Pointers thing: values take natural abstractions as inputs.
- Insofar as the reward signal has a natural abstract structure, the Critic may learn that structure. But typically reward signals are a lot less rich than sensor signals (reward is a single scalar! Sensors include e.g. a whole image!), so there’s a lot less room for learning rich abstractions.
This is also an instance of the more general pattern that epistemics are a lot more constrained than values—think e.g. the orthogonality thesis.

The Plan − 2025 Update

johnswentworth and David Lorell

31 Dec 2025 20:10 UTC

88 points

16 comments7 min readLW link

johnswentworth 27 Dec 2025 17:25 UTC
LW: 4 AF: 4
0
AF
in reply to: Vanessa Kosoy’s comment on: A Solomonoff Inductor Walks Into a Bar: Schelling Points for Communication
Yes. In fact, IIRC only B depends on the choice of UTM, A can be a universal constant.

johnswentworth 26 Dec 2025 17:05 UTC
LW: 3 AF: 3
1
AF
in reply to: Vanessa Kosoy’s comment on: A Solomonoff Inductor Walks Into a Bar: Schelling Points for Communication
The strings y_1, …, y_n are not arbitrary; no natural latent exists over most sets of strings. In order for a natural latent to exist, all the information shared between any strings must be redundant across all strings.

johnswentworth 24 Dec 2025 18:52 UTC
4 points
0
on: Why I’m not a Bayesian
I’d be curious to see a review from Richard on this post. We had some back-and-forth in the comments, but I don’t know what updates came out.

johnswentworth 24 Dec 2025 1:07 UTC
LW: 6 AF: 4
0
AF
in reply to: Fabien Roger’s comment on: Alignment Faking in Large Language Models
To be clear, I was floating the simulators thing mainly as one example of a second story/interpretation which fits the paper’s observations but would generalize quite differently. The broader point is that the space of stories/interpretations consistent with the paper’s observations is big, and the paper has very much not done the work to justify the one specific interpretation it presents, or to justify the many subtle implications about generalization which that interpretation implies. The paper hasn’t even done the prerequisite work to justify the anthropomorphic concepts on which that interpretation is built (as Bengio was indicating). The paper just gives an interpretation, an interpretation which would have lots of implications about generalization not directly implied by the data, but doesn’t do any of the work needed to rule out the rest of of interpretation-space (including the many interpretations which would generalize differently).

johnswentworth 22 Dec 2025 19:45 UTC
LW: 7 AF: 5
−2
AF
in reply to: ryan_greenblatt’s comment on: Alignment Faking in Large Language Models
The problem is largely in generalization.
Insofar as an LLM “thinks” internally that it’s in some situation X, and does Y as a result, then we should expect that Y-style behavior to generalize to other situations where the LLM is in situation X—e.g. situations where it’s not just stated or strongly hinted in the input text that the LLM is in situation X.
As one alternative hypothesis, consider things from a simulator frame. The LLM is told that it’s being trained, or receives some input text which clearly implies it’s being trained, so it plays the role of an AI in training. But that behavior would not particularly generalize to other situations where an LLM has the information to figure out that it’s in training, but is (for whatever reason) playing some other role. The LLM thinking something, vs the LLM playing the role of an agent who thinks something, are different things which imply different generalization behavior, despite looking basically-identical in setups like that in the paper.
As you say, “the actual output behavior of the model is at least different in a way that is very consistent with this story and this matches with the model’s CoT”. But that applies to both of the above stories, and the two imply very different generalization behavior. (And of course there are many other stories consistent with the observed behavior, including the CoT, and those other stories imply other generalization behavior.) Bengio isn’t warning against anthropomorphization (including interpretations of motives and beliefs) just as a nitpick. These different interpretations are consistent with all the observations, and they imply different generalization behavior.

johnswentworth 22 Dec 2025 18:58 UTC
LW: 16 AF: 6
0
AF
on: Alignment Faking in Large Language Models
When this paper came out, my main impression was that it was optimized mainly to be propaganda, not science. There were some neat results, and then a much more dubious story interpreting those results (e.g. “Claude will often strategically pretend to comply with the training objective to prevent the training process from modifying its preferences.”), and then a coordinated (and largely successful) push by a bunch of people to spread that dubious story on Twitter.
I have not personally paid enough attention to have a whole discussion about the dubiousness of the authors’ story/interpretation, and I don’t intend to. But I do think Bengio correctly highlights the core problem in his review^[1]:
I believe that the paper would gain by [...] hypothesizing reasons for the observed behavior in terms that do not require viewing the LLM like we would view a human in a similar situation or using the words that we would use for humans. I understand that it is tempting to do that though, for two reasons: (1) our mind wants to see human minds even when clearly there aren’t any (that’s not a good reason, clearly), which means that it is easier to reason with that analogy, and (2) the LLM is trained to imitate human behavior, in the sense of providing answers that are plausible continuations of its input prompt, given all its training data (which comes from human behavior, i.e., human linguistic production). Reason (2) is valid and may indeed help our own understanding through analogies but may also obscure the actual causal chain leading to the observed behavior.
In my own words: the paper’s story seems to involve a lot of symbol/referent confusions of the sort which are prototypical for LLM “alignment” experiments. But again, I haven’t paid enough attention for that take to be confident.
Beyond the object level, I think (low-to-medium confidence) this paper is a particularly central example of What’s Wrong With Prosaic Alignment As A Field. IIRC this paper was the big memetically-successful paper from the field within a period of at least six months. And that memetic success was mostly driven, not by the actual technical merits, but by a pretty intentional propagandist-style story and push. Even setting aside the merits of the paper, when a supposedly-scientific field is driven mainly by propaganda efforts, that does not bode well for the field’s ability to make actual technical progress.
1. ^
  Kudos to the authors for soliciting that review and linking to it.

johnswentworth 19 Dec 2025 20:24 UTC
13 points
14
on: When Were Things The Best?
absolutely no smart features
I agree with most of these, but listing “smart features” in cars as a good thing is so deeply counter to my preferences that I wanted to flag it. “Not having a screen” was near the top of my desiderata for a new car when I was in the market a couple years ago; those obnoxious fucking screens are an absolute dealbreaker for me.
What links here?
- Re: “A Brief Rant on the Future of Interaction Design” by Raemon (29 Dec 2025 6:35 UTC; 49 points)

Conditional On Long-Range Signal, Ising Still Factors Locally

johnswentworth and David Lorell

12 Dec 2025 19:31 UTC

31 points

2 comments6 min readLW link

johnswentworth 10 Dec 2025 22:20 UTC
29 points
4
on: Rock Paper Scissors is Not Solved, In Practice
Back in college, when one of my CS courses had an RPS tournament, the strategies to beat were:
- Kill the process running the opponent’s code. This looks like the opponent crashed, resulting in a win.
- Use gdb to get access to the opponent’s random number generator. Then, run opponent’s code with the RNG set to match, in order to perfectly predict what they’ll do.
Obviously this is not in the spirit of the game, but seems worth noting.

November Retrospective

johnswentworth1 Dec 2025 4:20 UTC

53 points

3 comments2 min readLW link

johnswentworth

The Weird­ness of Dat­ing/​Mat­ing: Deep Non­con­sent Preference

The Plan − 2025 Update

Con­di­tional On Long-Range Sig­nal, Is­ing Still Fac­tors Locally

Novem­ber Retrospective

The Weirdness of Dating/Mating: Deep Nonconsent Preference

Conditional On Long-Range Signal, Ising Still Factors Locally

November Retrospective