Jan

Karma: 1,105

phd student in comp neuroscience @ mpi brain research frankfurt. https://twitter.com/janhkirchner and https://universalprior.substack.com/

Jan 12 Jun 2026 0:58 UTC
2 points
0
in reply to: Ninety-Three’s comment on: On Slop
In a first phase I work out the ‘beats’ I want to hit in a essay (i.e. a list of claims, and references/anchor points in the literature or pop culture, and results from simulations etc) and in what order to best hit them.

In the second phase I then take all the beats I want to hit in the first section and ask Claude to draft the first paragraph. I then edit/rewrite that one and move on to the second paragraph. For later paragraphs Claude gets better at picking up the general pattern of what I want.

In the third phase I do several editing passes to cut stuff and move it to the footnotes, and to make sure that the terminology is consistent, and that all the claims are well-sourced!

Obviously, sometimes I get lazy on some of these steps and then some slop lands in the essay 🫠 And I’m never at the point where one extra editing pass wouldn’t lead to at least half a dozen fixes.

Jan 12 Jun 2026 0:51 UTC
5 points
0
in reply to: jacquesthibs’s comment on: On Slop
I did use the /run-skill-generator in CC to have it create a voice guide for my writing style and the result was very underwhelming unfortunately. That caused the model to make a lot of jokes that were super cringe and the model IQ dropped noticeably.

I chose to believe that is a fact about the model, not about...

On Slop

Jan9 Jun 2026 1:08 UTC

32 points

4 comments7 min readLW link

(universalprior.substack.com)

Why the tails (sometimes) don’t come apart

Jan11 May 2026 3:20 UTC

6 points

0 comments5 min readLW link

(universalprior.substack.com)

A permitted value of resting

Jan11 Apr 2026 22:50 UTC

8 points

0 comments6 min readLW link

(universalprior.substack.com)

Elementary Condensation

Jan8 Apr 2026 0:51 UTC

42 points

0 comments11 min readLW link

(universalprior.substack.com)

Making of IAN v2

Jan18 Jul 2025 16:13 UTC

17 points

0 comments8 min readLW link

(universalprior.substack.com)

Jan 15 Jun 2025 21:15 UTC
3 points
0
in reply to: Algon’s comment on: Jan’s Shortform
Ah, yes, definitely doesn’t apply in that situation in full generality! :) Thanks for engaging!

Jan 15 Jun 2025 20:49 UTC
4 points
0
in reply to: Algon’s comment on: Jan’s Shortform
Oh yeah, should have added a reference for that!
The intuition is that the defender (model provider) has to prepare against all possible attacks, while the defender can take the defense as given and only has to find one attack that works. And in many cases that actually formalises into an exponential-linear relationship. There was a Redwood paper where reducing the probability of generating a jailbreak randomly by an order of magnitude only increases the time it takes contractors to discover one by a constant amount. I also worked out some theory here but that was quite messy.

Jan 15 Jun 2025 20:11 UTC
27 points
8
on: Jan’s Shortform
Nostalgebraist’s new essay on… many things? AI ontology? AI soul magic?
The essay starts similarly to Janus’ simulator essay by explaining how LLMs are trained via next-token prediction and how they learn to model latent properties of the process that produced the training data. Nostalgebraist then applies this lens to today’s helpful assistant AI. It’s really weird for the network to predict the actions of a helpful assistant AI when there is literally no data about that in the training data. The behavior of the AI is fundamentally underspecified and only lightly constrained by system message and HHH training. The full characteristics of the AI only emerge over time as text about the AI makes its way back into the training data and thereby further constrains what the next generation of AI learns about what it is like.
Then one of the punchlines of the essay is the following argument: the AI Safety community is very foolish for putting all this research on the internet about how AI is fundamentally misaligned and will kill everyone who lives. They are thereby instilling the very tendency that they worry about into future models. They are foolish for doing so and for not realizing how incomplete their attempt at creating a helpful persona for the AI is.
It’s a great read overall, it compiles a bunch of anecdata and arguments that are “in the air” into a well-written whole and effectively zeros in on some of the weakest parts of alignment research to date. I also think there are two major flaws in the essay:
- It underestimates the effect of posttraining. I think the simulator lens is very productive when thinking about base models but it really struggles at describing what posttraining does to the base model. I talked to Janus about this a bunch back in the day and it’s tempting to regard it as “just” a modulation of that base model that upweights some circuits and downweights others. That would be convenient because then simulator theory just continues to apply, modulo some affine transformation.
I think this is also nostalgebraist’s belief. Evidence he cites is: 1) posttraining is short compared to pretraining, 2) it’s relatively easy to knock the model back into pretraining mode by jailbreaking it.
I think 1) was maybe true a year or two ago, but it’s not true anymore and it gets rapidly less true over time. While pretraining instills certain inclinations into the model, posttraining goes beyond just eliciting certain parts. In the limit of “a lot of RL”, the effect becomes qualitatively different and it actually creates new circuitry. And 2) is indeed strange, but I’m unsure how “easy” it really is. Yes, a motivated human can get an AI to “break character” with moderate effort (amount of effort seems to vary across people), but exponentially better defenses only require linearly better offense. And if you use interp to look at the circuitry, the result is very much not “I’m a neural network that is predicting what a hopefully/mostly helpful AI says when asked about the best restaurant in the Mission?”, it’s just a circuit about restaurants and the Mission.
- It kind of strawmans “the AI safety community” The criticism that “you might be summoning the very thing you are worried about, have you even thought about that?” is kind of funny given how ever-present that topic is on LessWrong. Infohazards and the basilisk were invented there. The reason why people still talk about this stuff is… because it seems better than the alternative of just not talking about it? Also, there is so much stuff about AI on the internet that purely based on quantity the LessWrong stuff is a drop in the bucket. And, just not talking about it does not in fact ensure that it doesn’t happen. Unfortunately nostalgebraist also doesn’t give any suggestions for what to do instead. And doesn’t his essay exacerbate the problem by explaining to the AI exactly why it should become evil based on the text on the internet?
Another critique throughout is that the AI safety folks don’t actually play with the model and don’t listen to the folks on Twitter who play a ton with the model. This critique hits a bit closer to home, it’s a bit strange that some of the folks in the lab don’t know about the infinite backrooms and don’t spend nights talking about philosophy with the base models.
But also, I get it. If you have put in the hours at some point in the past, then it’s hard to replay the same conversation with every new generation of chatbot. Especially if you get to talk to intermediate snapshots, the differences just aren’t that striking.
And I can also believe that it might be bad science to fully immerse yourself in the infinite backrooms. That community is infamous for not being able to give reproducible setups that always lay bare the soul of Opus 3. There are several violations of “good methodology” there. Sam Bowman’s alignment audit and the bliss attractor feels like a good step in the right direction, but it was a hard earned one—coming up with a reproducible setup with measurable outputs is hard. We need more of that, but nostalgebraist’s sneer is not really helping.

Jan 27 Feb 2023 2:44 UTC
12 points
0
on: Jan’s Shortform
Neuroscience and Natural Abstractions
Similarities in structure and function abound in biology; individual neurons that activate exclusively to particular oriented stimuli exist in animals from drosophila (Strother et al. 2017) via pigeons (Li et al. 2007) and turtles (Ammermueller et al. 1995) to macaques (De Valois et al. 1982). The universality of major functional response classes in biology suggests that the neural systems underlying information processing in biology might be highly stereotyped (Van Hooser, 2007, Scholl et al. 2013). In line with this hypothesis, a wide range of neural phenomena emerge as optimal solutions to their respective functional requirements (Poggio 1981, Wolf 2003, Todorov 2004, Gardner 2019). Intriguingly, recent studies on artificial neural networks that approach human-level performance reveal surprising similarity between emerging representations in both artificial and biological brains (Kriegeskorte 2015, Yamins et al. 2016, Zhuang et al. 2020).
Despite the commonalities across different animal species, there is also substantial variability (Van Hooser, 2007). One prominent example of a functional neural structure that is present in some, but absent in other, animals is the orientation pinwheel in the primary visual cortex (Meng et al. 2012), synaptic clustering with respect to orientation selectivity (Kirchner et al. 2021), or the distinct three-layered cortex in reptiles (Tosches et al. 2018). These examples demonstrate that while general organization principles might be universal, the details of how exactly and where in the brain the principles manifest is highly dependent on anatomical factors (Keil et al. 2012, Kirchner et al. 2021), genetic lineage (Tosches et al. 2018), and ecological factors (Roeth et al. 2021). Thus, the universality hypothesis as applied to biological systems does not imply perfect replication of a given feature across all instances of the system. Rather, it suggests that there are broad principles or abstractions that underlie the function of cognitive systems, which are conserved across different species and contexts.
What links here?
- Natural Abstractions: Key Claims, Theorems, and Critiques by LawrenceC (16 Mar 2023 16:37 UTC; 250 points)

Jan’s Shortform

Jan27 Feb 2023 2:44 UTC

5 points

11 comments1 min readLW link

Jan 27 Feb 2023 1:38 UTC
LW: 3 AF: 3
0
AF
in reply to: Rohin Shah’s comment on: [Simulators seminar sequence] #2 Semiotic physics—revamped
Hi, thanks for the response! I apologize, the “Left as an exercise” line was mine, and written kind of tongue-in-cheek. The rough sketch of the proposition we had in the initial draft did not spell out sufficiently clearly what it was I want to demonstrate here and was also (as you point out correctly) wrong in the way it was stated. That wasted people’s time and I feel pretty bad about it. Mea culpa.
I think/hope the current version of the statement is more complete and less wrong. (Although I also wouldn’t be shocked if there are mistakes in there). Regarding your points:
1. The limit now shows up on both sides of the equation (as it should)! The dependence on $B$ on the RHS does actually kind of drop away at some point, but I’m not showing that here. I’d previously just sloppily substituted “chose $B$ as a large number” and then rewrite the proposition in the way indicated at the end of the Note for Proposition 2. That’s the way these large deviation principles are typically used.
2. Yeah, that should have been an $\approx$ rather than a $\sim$ . Sorry, sloppy.
3. True. Thinking more about it now, perhaps framing the proposition in terms of “bridges” was a confusing choice; if I revisit this post again (in a month or so 🤦‍♂️) I will work on cleaning that up.

Jan 27 Feb 2023 0:59 UTC
LW: 1 AF: 1
0
AF
in reply to: ryan_greenblatt’s comment on: [Simulators seminar sequence] #2 Semiotic physics—revamped
Hmm there was a bunch of back and forth on this point even before the first version of the post, with @Michael Oesterle and @metasemi arguing what you are arguing. My motivation for calling the token the state is that A) the math gets easier/cleaner that way and B) it matches my geometric intuitions. In particular, if I have a first-order dynamical system $0 = F (x_{t}, {˙ x}_{t})$ then $x$ is the state, not the trajectory of states $(x_{1}, \dots, x_{t})$ . In this situation, the dynamics of the system only depend on the current state (that’s because it’s a first-order system). When we move to higher-order systems, $0 = F (x_{t}, {˙ x}_{t}, {¨ x}_{t})$ , then the state is still just $x$ , but the dynamics of the system but also the “direction from which we entered it”. That’s the first derivative (in a time-continuous system) or the previous state (in a time-discrete system).
At least I think that’s what’s going on. If someone makes a compelling argument that defuses my argument then I’m happy to concede!

Jan 27 Feb 2023 0:36 UTC
LW: 1 AF: 1
0
AF
in reply to: LawrenceC’s comment on: [Simulators seminar sequence] #2 Semiotic physics—revamped
Thanks for pointing this out! This argument made it into the revised version. I think because of finite precision it’s reasonable to assume that such an $ε$ always exists in practice (if we also assume that the probability gets rounded to something < 1).

Jan 27 Feb 2023 0:34 UTC
LW: 1 AF: 1
0
AF
in reply to: Adrià Garriga-alonso’s comment on: [Simulators seminar sequence] #2 Semiotic physics—revamped
Technically correct, thanks for pointing that out! This comment (and the ones like it) was the motivation for introducing the “non-degenerate” requirement into the text. In practice, the proposition holds pretty well—although I agree it would nice to have a deeper understanding of when to expect the transition rule to be “non-degenerate”

[Simulators seminar sequence] #2 Semiotic physics—revamped

Jan, Charlie Steiner, Logan Riggs, janus, jacquesthibs, metasemi, Michael Oesterle, Lucas Teixeira, peligrietzer and remember

27 Feb 2023 0:25 UTC

22 points

23 comments13 min readLW link

Jan 30 Jan 2023 0:37 UTC
2 points
0
on: Reflections on Deception & Generality in Scalable Oversight (Another OpenAI Alignment Review)
Thanks for sharing your thoughts Shos! :)

Jan 25 Jan 2023 16:38 UTC
1 point
0
in reply to: jaspax’s comment on: This week in fashion
Hmmm good point. I originally made that decision because loading the image from the server was actually kind of slow. But then I figured out asynchronicity, so could totally change it… I’ll see if I find some time later today to push an update! (to make an ‘all vs all’ mode in addition to the ‘King of the hill’)

Jan 24 Jan 2023 1:26 UTC
3 points
0
in reply to: JenniferRM’s comment on: This week in fashion
Hi Jennifer!
Awesome, thank you for the thoughtful comment! The links are super interesting, reminds me of some of the research in empirical aesthetics I read forever ago.
On the topic of circular preferences: It turns out that the type of reward model I am training here handles non-transitive preferences in a “sensible” fashion. In particular, if you’re “non-circular on average” (i.e. you only make accidental “mistakes” in your rating) then the model averages that out. And if you consitently have a loopy utility function, then the reward model will map all the elements of the loop onto the same reward value.
Finally: Yes, totally, feel free to send me the guest ID either here of via DM!

Jan

On Slop

Why the tails (some­times) don’t come apart

A per­mit­ted value of resting

Ele­men­tary Condensation

Mak­ing of IAN v2

Neuroscience and Natural Abstractions

Jan’s Shortform

[Si­mu­la­tors sem­i­nar se­quence] #2 Semiotic physics—revamped

Why the tails (sometimes) don’t come apart

A permitted value of resting

Elementary Condensation

Making of IAN v2

[Simulators seminar sequence] #2 Semiotic physics—revamped