Aran Nayebi

Karma: 94

I’m an Assistant Professor at Carnegie Mellon’s Machine Learning Department. I’m also a core faculty member in CMU’s Neuroscience Institute, and hold a courtesy appointment in the Robotics Institute.

My lab works at the intersection of neuroscience & AI to reverse-engineer animal intelligence and build the next generation of autonomous agents, responsibly and safely.

Learn more here: https://cs.cmu.edu/~anayebi

Aran Nayebi 21 Mar 2026 17:36 UTC
1 point
0
on: Terrified Comments on Corrigibility in Claude’s Constitution
Thanks for this great post! You may be interested in recent work of mine on corrigibility guarantees here: https://www.lesswrong.com/posts/M5owRcacptnkxwD2u/from-barriers-to-alignment-to-the-first-formal-corrigibility-1

The conclusions are consistent with your intuitions:
1. Aligning to all human values is intractable (even for computationally unbounded agents!)
2. Corrigibility is therefore a reasonable value set we can mostly all agree on that avoids the intractability in 1)
3. Corrigibility cannot be guaranteed by a single objective (as is currently done in RLHF & Constitutional AI), which is what prior proposals considered which failed
4. Corrigibility can instead be formally guaranteed via a small number of objectives, all of which have a higher lexicographic priority over the task objective, thereby making it a tractable safety target

Aran Nayebi 18 Mar 2026 15:20 UTC
7 points
0
in reply to: Aurelia’s comment on: Less Dead
I frankly think calling the Eon video any sort of “upload” is quite misleading and exaggerated. There are at least two fundamental reasons for this:
1. @Aurelia, as you, Ken, and even Eon (later on in their blog post) correctly point out, this was a leaky integrate-and-fire (LIF) model built from the fly connectome. So it’s not even close to the full brain of the fruit fly: no neurotransmitters, no synaptic weights, no synaptic dynamics from the fruit fly. We are not even faithfully simulating its brain in silico.
2. Not only is the central nervous system not a true upload, but the motor system isn’t either. What is instead used as a mapping between this LIF model and motor outputs is a policy that is hard-coded (not even imitation-learned via RL, though later on they & others do this) from fly behaviors from the NeuroMechFly team at EPFL. So the LIF model they use is neither necessary nor sufficient for the generated behavior: the fly policy can walk on its own without any additional inputs, as the EPFL team already demonstrated.

Aran Nayebi 11 Feb 2026 13:47 UTC
9 points
2
on: Post-AGI Economics As If Nothing Ever Happens
Great post! Very much agree about the conservatism.

This is why I find it useful to do economic analyses where the variables and factors are exposed, such as in my recent AI UBI analysis, so that one needn’t assume values for those variables, but can try out a multitude of scenarios and see how the predictions change, along with having an understanding of what factors matter more than others through the derived analytic form.

For example, one thing I found was that having a Scandinavian ownership amount of AI profits (~33%), drastically reduces the AI capability needed to be productive enough to fund a UBI. As a policy, this then seems very reasonable and attainable.

The full paper with all the cited sources can be found here: https://arxiv.org/abs/2505.18687

Aran Nayebi 25 Dec 2025 14:22 UTC
3 points
2
in reply to: RogerDearnaley’s comment on: From Barriers to Alignment to the First Formal Corrigibility Guarantees
Pretraining doesn’t evade the lower bound: a “pointer” is just a compressed index into a large hypothesis space, and constructing it already requires resolving the same M-way ambiguity during pretraining. The lower bound applies regardless of where the bits are paid.

Aran Nayebi 25 Dec 2025 14:13 UTC
1 point
0
in reply to: RogerDearnaley’s comment on: From Barriers to Alignment to the First Formal Corrigibility Guarantees
You can certainly put it in U2 instead (U2 is just a special case of U4 with one auxiliary), but putting it in U4 already ensures it’s suboptimal to preserve the switch & defer yet “kill all humans”, because it collapses many future intervention and recovery options simultaneously. In other words, it’s a hard constraint in effect — U4 enforces it as a global irreversibility invariant, whereas U2 is only needed for narrow single-channel invariants like switch reachability.

Aran Nayebi 24 Dec 2025 22:55 UTC
1 point
0
in reply to: RogerDearnaley’s comment on: From Barriers to Alignment to the First Formal Corrigibility Guarantees
That’s correct, it can be naturally folded into U4 as one of its auxiliary utilities, in the same manner as we do for off-switch preservation.

Aran Nayebi 8 Dec 2025 17:55 UTC
3 points
1
in reply to: I.M.J. McInnis’s comment on: From Barriers to Alignment to the First Formal Corrigibility Guarantees
Thanks! I really appreciate this, and I think your natural-latents framing fits nicely with the Part I point about needing to compress D down to a small set of crisp, structured latents. On the lexicographic point: it’s worth noting that even though Theorem 3 writes the full objective as a discounted sum, the safety heads U1-U4 aren’t long-horizon objectives — they’re local one-step tests whose optimal action doesn’t depend on future predictions. For example, U1 is automatically satisfied each round by waiting (and once the human approves the proposed action, the agent simply executes it, thereby engaging U2-U5), and U4 is a one-step reversibility check against an inaction baseline, not a long-run impact estimate. The only head with genuine long-horizon structure is U5, which sits below the safety heads, so discounting never creates optimization pressure on them. This makes the whole scheme intentionally deontic and “natural-latent–friendly”, exactly matching the tractable regime suggested by the large-D lower bounds of Part I.