Joern Stoehler

Karma: 159

Joern Stoehler 2 Nov 2025 21:17 UTC
6 points
−2
in reply to: habryka’s comment on: Should you donate to Lightcone Infrastructure?
Thx, I think I got most of this from your top level comment & Mikhail’s post already. I strongly expect that I do not know your policy for confidentiality right now, but I also expect that once I do I’d disagree with it being the best policy one can have, just based on what I heard from Mikhail and you about your one interaction.

My guess is that refusing the promise is plausibly better than giving it for free? But I guess that there’d have been another solution where 1) Mikhail learns not to screw up again, and 2) you get to have people talk more freely around you to a degree that’s worth loosing the ability to make use of some screw-ups, and 3) Mikhail compensates you in case that 1+2 is still too far away from a fair split of the total expected gains.

I expect you’ll say that 2) sounds pretty negative to you, and that you and the community should follow a policy where there’s way less support for confidentiality, which can be achieved by exploiting screw-ups and by sometimes saying no if people ask for confidentiality in advance, so that people who engage in confidentiality either leave the community or learn to properly share information openly.

Joern Stoehler 2 Nov 2025 19:18 UTC
7 points
−22
in reply to: Ben Pace’s comment on: Should you donate to Lightcone Infrastructure?
I don’t know how costly/beneficial this screw up concretely was to humanity’s survival, but I guess that total cost would’ve been lower if Habryka as a general policy were more flexible in when the sensitivity of information has to be negotiated.

Like, with all this new information I now am a tiny bit more wary of talking in front of Habryka. I may blabber out something that has a high negative expected utility if Habryka shares it (after conditioning on the event that he shares it) and I don’t have a way to cheaply fix that mistake (which would bound the risk).

And there isn’t an equally strong opposing force afaict? I can imagine blabbering out something that I’d afterwards negotiate to keep between us, where Habryka cannot convince me to let him share it, and yet it would’ve been better to allow him to share it.

Tbc, my expectations for random people are way worse, but Habryka seems below average among famous rationalists now? I rn see & feel in average zero pull to adjust my picture of the average famous rationalist up or down, but seems high variance since I didn’t ever try to learn what policies rationalists follow wrt negotiating information disclosure. I definitely didn’t expect them to use policies mentioned in planecrash outside fun low-stake toy scenarios.

Joern Stoehler 16 Sep 2025 11:25 UTC
1 point
0
on: Joern Stoehler’s Shortform
Disclosure: Written by ChatGPT on 2025-09-16 at my request.
Energy use in isometric holds
Abstract. Standard physiology; no novelty. When an object is held stationary, static equilibrium ( $\sum F = 0$ , $\sum τ = 0$ ) sets the joint torque needed to counter gravity. Muscle fibers generate that torque. Force in skeletal muscle is produced by continuous actomyosin cross-bridge cycling; each cycle uses one ATP. Calcium must be kept high to permit attachment and then pumped back into the sarcoplasmic reticulum; the Na $^{+}$ /K $^{+}$ pump maintains excitability. External mechanical work is approximately zero, but chemical energy is consumed continuously and dissipated as heat. Example: holding a $5 k g$ bag at the hand (forearm $r \approx 0.35 m$ , elbow $90^{\circ}$ ) requires $T \approx 17 N m$ . Relative to typical maximum elbow-flexion torque ( $\approx 70 N m$ in young men, $\approx 35 N m$ in young women), this corresponds to $\approx 24$ – $49 %$ MVC. Forearm $˙ V O_{2}$ in that range is $\approx 20$ – $30 m L m i n^{- 1}$ , i.e., $\approx 7$ – $10 W$ of local heat; two minutes dissipates $\approx 0.8$ – $1.2 k J$ .
Model
1. Static equilibrium (mechanics). Gravity exerts a force $m g$ at a horizontal distance $r$ from the elbow. The required elbow torque is $T_{r e q} \approx m g r$ (simple forearm model).
2. Torque to muscle force (anatomy). Elbow flexors act through an effective moment arm $r_{e f f}$ . The muscle force satisfies $T \approx F_{m} r_{e f f}$ . Holding a constant torque implies a near-constant average $F_{m}$ (ignoring co-contraction and passive tissues for this shortform).
3. Force generation (molecular). Each muscle fiber contains many myosin heads that attach to actin briefly, pull, and detach. Only a fraction are attached at any instant, so maintaining force requires continuous replacement of attached heads; each replacement hydrolyzes one ATP. Attachment requires elevated Ca $^{2 +}$ ; relaxation requires pumping Ca $^{2 +}$ back into the sarcoplasmic reticulum.
4. Energy balance (physiology). External work is $\approx 0$ at zero velocity, but internal processes consume ATP: $P_{A T P} = P_{x b} + P_{a c t}$ . Reviews place activation at roughly 25–45% of ATP turnover in isometric contractions (muscle-dependent). At higher %MVC and longer holds, intramuscular pressure restricts blood flow, limiting oxygen delivery and accelerating fatigue.^[1]^[2]^[3]
Worked example
Setup. $m = 5 k g$ ; $r \approx 0.35 m$ ; elbow $90^{\circ}$ .
- Torque required: $T \approx m g r = 5 \times 9.81 \times 0.35 \approx 17 N m$ .
- Relative intensity: $M V C \approx 70 N m$ (young men, $\sim 120^{\circ}$ ) or $35 N m$ (young women, $\sim 90^{\circ}$ ) $\Rightarrow$ $24 %$ or $49 %$ MVC.^[4]
- Metabolic cost (from oxygen): around 20–30% MVC, forearm $˙ V O_{2} \approx 20$ – $30 m L m i n^{- 1}$ . With $\approx 20.9 k J L^{- 1}$ O $_{2}$ : $\approx 7$ – $10 W$ locally; two minutes $\approx 0.8$ – $1.2 k J$ .^[5]^[6]
Predictions
- Micro-breaks extend endurance. Alternating 1 s on / 1 s off at the same mean force allows reperfusion and reduces activation cost during the off phases.
- Geometry reduces cost. Shortening $r$ or shifting load to passive structures (bone, ligament, tendon) lowers $T_{r e q}$ and the required ATP use.
- Perceived effort rises during a hold. As some fibers fatigue, additional motor units are recruited to keep $T$ constant, so effort increases before failure.
References
1. Barclay CJ (2023). Advances in understanding the energetics of muscle contraction. Activation $\approx 25$ – $45 %$ of ATP in isometric contraction. Journal page: https://www.sciencedirect.com/science/article/pii/S0021929023002385 ↩︎
2. Lind AR (1979). Forearm blood flow in isometric contractions. Flow rises to $\approx 60 %$ MVC then plateaus. PubMed: https://pubmed.ncbi.nlm.nih.gov/469732/ ↩︎
3. McNeil CJ et al. (2015). Blood flow and muscle oxygenation during isometric contractions. J Appl Physiol (Regul Integr Comp Physiol). Abstract: https://journals.physiology.org/doi/abs/10.1152/ajpregu.00387.2014 ↩︎
4. Tsunoda N et al. (1993). Elbow flexion strength curves in untrained men and women and male bodybuilders. $\approx 70 N m$ (men), $\approx 35 N m$ (women). PubMed: https://pubmed.ncbi.nlm.nih.gov/8477679/ ↩︎
5. Nyberg SK et al. (2018). Reliability of forearm oxygen uptake during handgrip exercise. Rest $\approx 6.5 m L m i n^{- 1}$ ; rises with intensity. PMC: https://pmc.ncbi.nlm.nih.gov/articles/PMC5974736/ ↩︎
6. Gill PK et al. (2023). It is time to abandon single-value oxygen uptake energy equivalents. Common $20.1$ – $20.9 k J L^{- 1}$ O $_{2}$ and caveats. PubMed: https://pubmed.ncbi.nlm.nih.gov/36825641/ ↩︎

Joern Stoehler 24 Aug 2025 16:02 UTC
2 points
−1
in reply to: williawa’s comment on: Thoughts About how RLHF and Related “Prosaic” Approaches Could be Used to Create Robustly Aligned AIs.
Kindness also may have an attractor, or due to discreteness have a volume > 0 in weight space.

The question is if the attractor is big enough. And given how there’s various impossibility theorems related to corrigibility & coherence I anticipate that the attractor around corrigibility is quite small, bc one has to evade various obstacles at once. Otoh proxies that flow into a non-corrigible location once we ramp up intelligence, aren’t obstructed by the same theorems, so they can be just as numerous as proxies for kindness.

Wrt your concrete attractor: if the AI doesn’t improve its world model and decisions aka intelligence, then it’s also not useful for us. And a human in the loop doesn’t help if the AI’s proposals are inscrutable to us bc then we’ll just wave them through and are essentially not in the loop anymore. A corrigible AI can be trusted with improving its intelligence bc it only does so in ways that preserve the corrigibility.

Joern Stoehler 24 Aug 2025 15:38 UTC
1 point
0
in reply to: williawa’s comment on: Thoughts About how RLHF and Related “Prosaic” Approaches Could be Used to Create Robustly Aligned AIs.
Oops. Then I don’t get what techniques you are proposing. Like, most techniques that claim to work for superintelligence / powerful agents also claim to work in some more limited manner for current agents (in part bc most techniques assume that no phase change occurs between now and then, or that the phase change doesn’t affect the technique ⇒ the technique stops working in a gradual manner and one can do empirical studies on current models).

And while there certainly is some loss function or initial random seed for current techniques that gives you aligned superintelligence, there’s no way to find them.

Joern Stoehler 24 Aug 2025 12:14 UTC
1 point
0
in reply to: williawa’s comment on: Thoughts About how RLHF and Related “Prosaic” Approaches Could be Used to Create Robustly Aligned AIs.
With pseudo-kindness I mean any proxy for kindness that’s both too wrong to have any overlap with kindness when optimized for by a superintelligence, and right enough to have overlap with kindness when optimized for by current LLMs.

Kindness is some property that behavior & consequences can exhibit. There are many properties in general, and there are still many that correlate strongly on a narrow test environment with kindness. Some of these proxy properties are algorithmically simple (and thus plausibly found in LLMs and thus again in superintelligence), some even share subcomputations/subdefinitions with kindness. Theres some degree of freedom argument about how many such proxies there are. Concretely one can give examples, e.g. “if asked, user rates the assistant’s texts as kind” is a proxy that correlates well with the assistant’s plans being kind / having kind consequences.

Wrt corrigibility: I don’t see why corrigibility doesn’t have the same problems as kindness. It’s a less complex and human-centric concept than kindness, but still complex and plausibly human-centric (e.g. “do what I mean” style logic or “human-style counterfactuals”). Plausibly it might also be not human-centric or not much at least, i.e. a wide class of agents would invent the same concept of corrigibility and not different versions.

Proxies of corrigibility during training still exist, and tails still come apart.

Joern Stoehler 24 Aug 2025 8:01 UTC
1 point
0
in reply to: Joern Stoehler’s comment on: Thoughts About how RLHF and Related “Prosaic” Approaches Could be Used to Create Robustly Aligned AIs.
Empirically, current LLM behavior is better predicted by a model
1. that talks about reflexes to pseudo-kindness that steer a limitedly capable reasoning process, together with situationally aware instrumental reasoning,
than by a model
1. that talks about reflexes to approximate true kindness that steer a limitedly capable reasoning process that for some reason worsens the approximation right now and leads to the observed unkind behavior.
The second model under capability growth indeed can yield a capable reasoner steered by reflexes towards approximate true kindness. And if we get enough training before ASI, the approximation can become good enough that due to discreteness or attractors it just is equal to true kindness.

The first model just generalizes to a capable misaligned reasoner.

Joern Stoehler 24 Aug 2025 7:54 UTC
1 point
0
on: Thoughts About how RLHF and Related “Prosaic” Approaches Could be Used to Create Robustly Aligned AIs.
I expect that all processes that promote kind-looking outputs route either through reflexes towards pseudo-kindness, or through instrumental reasoning about pseudo-kindness and kindness. Reflexes towards true kindness are just very complex to implement in any neural net, and so unlikely to spontaneously form during training since there’s so many alternative pseudo-kindness reflexes instead one could get. Humans stumbled into what we call kindness somehow, partially due to quirks in evolution vs SGD like genome size or the need for cooperation between small tribes etc. Now new humans acquire similar reflexes towards similar kindness due to their shared genes, culture and environment.

Reinforcing kind-looking outputs in AI just reinforces those reasoning processes and reflexes towards pseudo-kindness. The reasoning to true kindness is quite robustly well-performing, while reflexes or reasoning towards pseudo-kindness may lead to not-kind-looking outputs even during training already if the data distribution shifts a bit. Still, there’s enough versions of pseudo-kindness that even this kind of robustness doesn’t narrow down on true kindness.

Both reflexes to pseudo-kindness and reasoning about true-/pseudo- kindness however generalize not the way we want once the AI’s environment shifts due to e.g. a treacherous turn becoming possible or the AI’s world model growing a lot larger, or various other effects that happen on the way to superintelligence.

Pseudo-kindness becomes something orthogonal i.e. promotes actions we don’t care about (i.e. fill the lightcone with computations we don’t view as being even partially about kindness anymore and at most a bad imitation that got crucial details wrong). Reasoning for instrumental reasons just ceases to happen once the instrumental reasons no longer apply, e.g. bc the AI can now pursue plans regardless of human approval due to deception/anticipated takeover.

My unconfident best guess after skimming this post (sry) is that you implicitly assumed that reflexes towards true kindness are available for reinforcement.

Joern Stoehler 19 Aug 2025 22:51 UTC
5 points
2
in reply to: Alexander Gietelink Oldenziel’s comment on: Leon Lang’s Shortform

It becomes a bit more like logical inductors.

If logical inductors is what one wants, just do that.

a reasonable time-penalty

I’m not entirely sure, but I suspect that I don’t want any time penalty in my (typical human) prior. E.g. even if quantum mechanics takes non-polynomial time to simulate, I still think it a likely hypothesis. Time penalty just doesn’t seem to be related to what I pay attention to when I access my prior for the laws of physics / fundamental hypotheses. There’s also many other ideas for augmenting a simplicity prior that fail similar tests.

Joern Stoehler 18 Aug 2025 9:11 UTC
2 points
0
in reply to: Davidmanheim’s comment on: Plan E for AI Doom
Concretely I guess current tech can get a message out to a few targets at 10^3 to 10^6 light years distance. ASI can use many physical probes near light speed, accelerated using energy from a Dyson swarm, so I’d guess it’s a few years behind only. I don’t expect there to be aliens within 10^6 lys, nor do we know which stars, and it’s again unlikely that they happen to be in the thin window of technological development where a warning message from us helps them.

Joern Stoehler 3 Aug 2025 7:02 UTC
4 points
0
in reply to: Douglas_Knight’s comment on: Many prediction markets would be better off as batched auctions
There already are optional expiration dates, and I use them a lot whenever I can guess the next time new information may become available that I’d want to have a chance to react to.

Half of time remaining is a good guess, for at least about 33% of my bids I’d say (?) So this would speed up bidding somewhat for me, and likely get a lot of people who don’t use expiration dates to use them, and save their skin.

For the IMO markets one problem would be that the markets were set up to close at EOY, not during the IMO. I do not know why. So for anyone who bid in 2025 the default proposal would’ve been too late regardless.

Joern Stoehler 2 Aug 2025 14:08 UTC
15 points
1
on: Many prediction markets would be better off as batched auctions
I recently collected a large premium on reaction time for the various IMO gold manifold markets bc I coincidentally saw OpenAI’s announcement on twitter only 10min late.

Collecting the premium involved filling various limit orders that I confidently guess were long outdated, i.e. that would’ve been removed a week before the IMO if the owner had bothered to think about them at that time, in addition to being short outdated, i.e. removal after the OpenAI announcement if only the owner had as fast a reaction time as I lucked into.

I expect for long running markets, 24h batched auctions increase liquidity, and thus accuracy, bc more aggressive limit orders / price-volume tails will get specified, since there’s less risk of forgetting to cancel the limit order before the information landscape changes. I.e. right now there’s often less intelligence and information contained in old limit orders vs new orders bc of reaction times, and so aggregating all information should “weigh” them less in a smart way (i.e. take older orders as less strong Bayesian evidence). Often I think we don’t care for super fast aggregation, and care more that traders get higher incentives to provide aggregate-able evidence e.g. via specifying their confidence curves at lower friction from platform UX and adversarial selection in one direction (i.e. if you specify two tails to capture volatility, one gets bought up entirely if you react slower than others to new information). Which is why expect that batched auctions would lower that friction, and lead to higher liquidity and more accurate forecasts, at merely lower time resolution.

Joern Stoehler 21 Jul 2025 22:28 UTC
5 points
0
on: Why Reality Has A Well-Known Math Bias
I’m honestly surprised that other people haven’t covered this question on LW before, since it feels like very centrally in the space of questions LW folks tend to be interested in.
Your core argument was familiar to me, but I don’t recall immediately if or where I encountered it on LW before. I strongly associate it with the sequences on induction and Occam’s razor, and with List of Lethality, but I doubt that they mention an anthropic filter argument about what environment minds exist in, and probably it’s just that I read those documents around the time I contemplated Wigner’s puzzle myself.

Joern Stoehler 3 Jul 2024 6:17 UTC
9 points
2
on: Joern Stoehler’s Shortform
Here’s my best quickly-written story of why I expect AGI to understand human goals, but not share it. The intended audience is mostly myself, so I use personal jargon.

What a system at AGI level wants depends a lot on how it coheres its different goal-like instincts during self-reflection. Introspection & (less strongly) neuroscience tells us humans start with very similar-to-each-other internal impulses and self-reflection processes and also end up with similar goals. AI has a quite different set of genes/architecture and environment/training data, and so we can’t expect it to have as similar internal impulses and metacognition. Instead it acquires an unknown different set of internals that is easy to pick up with gradient descent and enables becoming a general intelligence.

All smart agents do eventually learn natural abstractions, e.g. algebra or what the humans living in the outside world want, and can use that knowledge, e.g. during a treacherous turn. But the internal impulses aren’t pushed towards a natural abstraction, as there’s no unique universal solution, though there are local attractors. Instead they depend on more subtleties in the architecture & training data. Also, the difference between AI and human internals might not be visible early in its behavior, because a) we do select the internals such that during training the AI isn’t visibly misaligned in its behavior (+whatever interpretability we have). b) the behaviorial differences we do spot may be both explained by different internal motivations, and by a lack of capabilities. c) even if we don’t train against visible misalignment, similar instrumental pressures may apply in humans and AI, leading to partially similar behavior.

Without a better theory of what distinguishes goal-like from capability-like learned cognitive and behavioral patterns, it’s not straightforward to formalize similarity between goal-like instincts at low capability levels, similar to how AIs that don’t behave robustly goal-directed can’t be assigned a coherent goal but are perhaps better described with shard theory. Without a better theory of how AGIs will do self-reflection and metacognition, it’s not clear what sets of internal impulses during early training will later cohere into a safe goal. And it’s also not clear to me how to actually know a goal is safe.

In particular, I don’t think that using AI assistants to solve the alignment problem will work, as investigating metacognition probably requires AGI-level capabilities. Instead we just get a series of AI assistants that successfully train away visible misalignment in each other, using their knowledge of what the external humans want, until finally one AI or group of AIs realizes that a treacherous turn has become the better action.

Mayyybe there will be a stage during which the AI assistants recognize the flaw in this alignment solution, and have not yet cohered their impulses in a way that leads to misaligned behavior. In that case, the AI assistants may warn us, giving us a very late, and easily ignored, fire alarm.

Joern Stoehler’s Shortform

Joern Stoehler3 Jul 2024 6:17 UTC

3 points

3 comments1 min readLW link

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

Lucius Bushnaq, jake_mendel, Dan Braun, StefanHex, Nicholas Goldowsky-Dill, Kaarel, Avery, Joern Stoehler, debrevitatevitae, Magdalena Wache and Marius Hobbhahn

20 May 2024 17:53 UTC

108 points

4 comments3 min readLW link

Joern Stoehler 3 Feb 2024 12:50 UTC
10 points
7
in reply to: ChristianKl’s comment on: Most experts believe COVID-19 was probably not a lab leak
Imo mildly misleading. I expect large parts of the 85% to just not have read their mails, or to have been too busy to answer what may look to them like a mildly useful survey.

Joern Stoehler 12 Jan 2024 14:12 UTC
1 point
0
on: Decent plan prize announcement (1 paragraph, $1k)
Why are you concerned in that scenario? Any more concrete details on what you expect to go wrong?

I don’t think there’s a cure-it-all solution, except “don’t build it”, and even that might be counterproductive in some edge cases.

Joern Stoehler 7 Dec 2023 7:52 UTC
7 points
0
in reply to: Joern Stoehler’s comment on: Why Yudkowsky is wrong about “covalently bonded equivalents of biology”
Addendum: I just learned that dipole-dipole interaction are classified as a type of vdW force in chemistry. This is different from solid state physics, where vdW is reserved for the quantum mechanical effect of induced dipole—induced dipole interaction.

So it’s indeed vdW forces that keep a protein in its shape. (This might also explain why OP found different oom for their strength?)

Joern Stoehler 6 Dec 2023 16:14 UTC
8 points
0
on: Why Yudkowsky is wrong about “covalently bonded equivalents of biology”
When discussing the stability of proteins, I mostly think of their folding, not whether their primary or secondary structure breaks.

The free energy difference between folded and unfolded states of a typical protein is allegedly (not an expert!) in the range 21-63 kJ/mol. So way less than a single covalent bond.

I have a friend who does his physics PhD on protein folding, and from what I remember he mostly simulates the surface charge of proteins, i.e. cares about dipole-dipole interactions (the weaker version of ionic bonds) and interaction effects with the surrounding water (again dipole-dipole afaict).

This suggests that vdW forces aren’t all that important, but the energy scale you get from imagining vdW forces is still way better than when imagining covalent bonds.

Regarding how to do enzyme-like catalysts with covalent nanotech: my first guess is that we’d want to build a structure that has several “folded”/usable states close in energy, e.g. due to rotational degrees of freedoms in the covalent bonds. This way “unfolding”/breaking the machine requires a lot of energy, while it can still mechanically move to catalyze a chemical reaction at low activation energies.

Joern Stoehler

Energy use in isometric holds

Model

Worked example

Predictions

References

Jo­ern Stoehler’s Shortform

The Lo­cal In­ter­ac­tion Ba­sis: Iden­ti­fy­ing Com­pu­ta­tion­ally-Rele­vant and Sparsely In­ter­act­ing Fea­tures in Neu­ral Networks

Joern Stoehler’s Shortform

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks