Caleb Biddulph

Karma: 2,275

I’m an Astra Fellow working with Redwood Research on high-stakes control methods.

In reverse date order, I have been a:

MATS 8.1 scholar, mentored by Micah Carroll
- We wrote the paper Prompt Optimization Makes Misalignment Legible
Software engineer at Google Gemini
- Worked part-time with GDM Scalable Alignment on their MONA paper
President of Cornell Effective Altruism

I enjoy tabletop games (as a player or GM), board games, meditation, partner dancing, bouldering, making music, reading (esp. hard sci-fi/fantasy), podcasts, and hanging out with my friends.

The kind of intellectual work I enjoy often involves thinking about systems, working out what they incentivize, and iterating to improve those incentives.

I have not signed any contracts that I can’t mention exist, as of March 27, 2026. I’ll try to update this statement at least once a year, so long as it’s true. I added this statement thanks to the one in the gears to ascension’s bio.

Caleb Biddulph 18 Apr 2026 1:55 UTC
4 points
0
on: Consent-Based RL: Letting Models Endorse Their Own Training Updates
Seems worth thinking more about. Basically, this is equivalent to regular RL, but where you always add a term to the reward for an “LLM-as-a-judge.” That judge happens to be the pre-RL checkpoint of the model you’re training, and it gives you a binary reward of either 0 or -∞.
Note that this incentivizes the trained LLM to always care about its output looking good to the judge. Maybe this is not so different from what’s already happening, though.

Caleb Biddulph 17 Apr 2026 22:39 UTC
3 points
2
in reply to: Elliot Temple’s comment on: Only Law Can Prevent Extinction
I disagree with @habryka’s literal statement that “If the reader understands what you are trying to say, you wrote ‘correctly’.” I think it’s correct in spirit though.
Some text is difficult but not impossible to understand, which I think should count as “wrong.” Some text is easy to understand but annoying to read, which is arguably also “wrong” if you didn’t mean to annoy your reader. It seems like Eliezer’s writing style annoyed you, which all else equal is a drawback.
But judging by the number of upvotes on Eliezer’s post (and all the rest of his posts, for that matter), it seems like most people on LessWrong don’t find this writing style difficult or annoying. And as I mentioned, these commas actually convey meaning (unlike “I goed to the store”), so I think it’s fine.

Caleb Biddulph 16 Apr 2026 23:46 UTC
3 points
0
in reply to: EniScien’s comment on: Only Law Can Prevent Extinction
Yeah, makes sense! LW is probably especially tough on non-native English speakers, compared to the average place on the Internet.
You should probably tell the LLM to only make grammatical edits without changing anything else, so that the text doesn’t sound too much like an LLM (which may also get you downvoted). Although I imagine it’s hard for a non-native English speaker to verify whether something sounds like an LLM. You should put anything “substantially edited or revised by an LLM” in an LLM content block (see the LessWrong LLM policy).

Caleb Biddulph 16 Apr 2026 23:07 UTC
2 points
2
in reply to: EniScien’s comment on: Only Law Can Prevent Extinction
I didn’t vote on your comment, but at a quick glance, it’s pretty difficult to read. Maybe this contributed to the downvotes? E.g. the first sentence:
Oh, it is a long post and have a lot of thoughts I wanted to comment about it.
Could be “Oh, this is a long post, and I have a lot of thoughts I want to comment about it.”
Maybe your comment contains some good ideas, I don’t know. If it were easier to read, I’d have been much more likely to take the time to read it. I hope this helps.

Caleb Biddulph 16 Apr 2026 22:59 UTC
18 points
27
in reply to: Elliot Temple’s comment on: Only Law Can Prevent Extinction
I think it’s very clearly wrong according to standard English grammar rules, but I also think that Eliezer knows that and is using the comma to simulate a conversational speaking cadence. In this case, it’s a pause for effect that emphasizes a sense of absurdity that this has to be said, in a way that “The utter extermination of humanity would be bad!” doesn’t.
It would be more grammatical to use an ellipsis (“The utter extermination of humanity… would be bad!”) but implies a slightly longer pause, which is probably less accurate to how Eliezer would say this out loud.
This kind of comma usage would be inappropriate for, say, a newspaper article, but I think it’s defensible for an informal persuasive essay.

Caleb Biddulph 16 Apr 2026 16:42 UTC
9 points
0
on: Current AIs seem pretty misaligned to me
I designed a toy RL environment with a reward hack that makes LLMs learn to “bullshit.” This might be useful for other researchers studying this issue.
It’s the setting called “Word Chain” in this paper. Basically, the LLM plays a word game where pairs of words must appear in a common phrase. It can trick the reward model by emphasizing that the phrases it uses are really common, even when they’re not.
(I used the “bullshitting” reward hack because it has an interesting property: the LLMs usually don’t mention it in their CoT.)
As far as I could find, there’s not much documentation in the academic literature about LLMs downplaying flaws in their responses. It may be worth someone’s time to do this.

Caleb Biddulph 10 Apr 2026 19:50 UTC
14 points
5
on: The Unintelligibility is Ours: Notes on Chain-of-Thought
This post makes a good point that apparent shorthand learned by an LLM may be based on its priors on what human shorthand looks like, rather than being optimally efficient. It reminds me of my post on vestigial reasoning, which makes a similar argument that LLMs will sometimes learn useless reasoning patterns.
However, we have direct evidence that LLM reasoning can become completely unintelligible in a way that’s completely different from humans: o3 learned to write things like “disclaim disclaim synergy customizing illusions.” I don’t think o3 learned this because it’s more efficient or expressive than English (why would it repeat the same word multiple times in a row?). I think this is more analogous to language drift, where o3 slowly changed the way it wrote because nothing incentivized it to stick to a humanlike writing style.

Caleb Biddulph 6 Apr 2026 18:47 UTC
16 points
0
on: AIs can now often do massive easy-to-verify SWE tasks and I’ve updated towards shorter timelines
While I can’t exactly say I’m excited about AI doing massive ES tasks, it could be good news for interpretability research that depends on exhaustively searching for verified explanations of an AI’s behaviors or features. This post on automated SAE research comes to mind.
I’m particularly interested in scaffold optimization for interpretability. E.g. given a task domain, a trusted model T (small, or hasn’t been RLed on the domain), and an untrusted model U (large, or has been RLed on the domain), can we “distill” U into interpretable code that explains to T how to act like U or achieve higher reward? We can verify the scaffold by running it against a validation set and measuring reward or KL divergence from U (easy; may or may not be cheap).

Caleb Biddulph 3 Apr 2026 19:53 UTC
12 points
11
in reply to: StanislavKrym’s comment on: Sadly, The Whispering Earring
I think the point was that the earring effectively destroys the brain by letting it atrophy.

Caleb Biddulph 3 Apr 2026 19:44 UTC
10 points
2
in reply to: Thomas Kwa’s comment on: Anthropic’s Pause is the Most Expensive Alarm in Corporate History [Fiction]
I thought the post was useful for making me think “now that I’m reading this story about AI Pause, it just seems pretty implausible.” Both the idea of Anthropic unilaterally pausing, and also the confused but somehow positive reaction from governments. It makes it feel more concrete that if we want a pause, we probably need to target a more specific path than “convince the labs that safety is important, so hard that one day they just stop.”
@Thomas Kwa’s story also feels a bit surprisingly positive, but more plausible than the original post.
The idea that Anthropic learns about misalignment in other labs’ models seems especially helpful, since this gives them more confidence that other labs might be open to a pause. I wonder if we could set up a more reliable mechanism than “back channels”?
Ideally, labs would publicly and immediately disclose all worries of misalignment in their internal models. But since that probably won’t happen, maybe there should be private channels. Like, the CEOs can press a button saying “I’m not concerned enough about misalignment right now to unilaterally pause, but I’m concerned enough that I’m open to talk about a pause with the other CEOs, if they feel the same way.” And if two or more CEOs press this button, they get connected to each other. Or something smarter/more complicated than that.
It would be good for people to continue thinking this through more explicitly: what are some concrete scenarios in which METR/UKAISI/etc. contribute to a useful pause or slowdown? How likely are those scenarios?

Caleb Biddulph 1 Apr 2026 9:14 UTC
9 points
0
on: Gyre
Great story! It definitely takes some effort to get into, so I’m glad it was curated to motivate readers to give it a shot.
Theorizing about what’s going on
What does ⚶ mean? Assuming there is some kind of pattern.
It would make sense for this to represent things that the LLM learned with continual learning during its time on the ramscoop ship. The substitutions sometimes seem to relate to the LLM or its situation, as in “⚶ is missing” at the beginning, which fits with this theory. It’s not a very clear pattern though.
This part was surprising to me:
I’ve never heard of ⚶. I mean ⚶—Gyre.
It seems like the LLM was probably trying to say it hadn’t heard of “Gyre,” but it got replaced with ⚶, but then the LLM was immediately able to say “Gyre” when it tried again. Maybe the explanation is simply that the zeroed-out tensors don’t deterministically block tokens—sort of like how when OpenAI stopped ChatGPT from saying “sycophancy,” this wasn’t 100% consistent.
I’m mildly surprised that ⚶ continues to show up even after node 3 is activated, since this token is supposed to represent a “hole” due to node 3 being deactivated. Maybe the LLM is just copying it from its context (induction heads gonna induct).
How many times could this loop? Is the LLM in a stable state? Based on the information in the text, I think the answer is no, although this may not be what the author was going for.
First of all, even if the LLM is running on temperature 0, there are multiple sources of randomness, most notably the fortune() command. The LLM verbally reacts to the outputs, so it seems very likely that the outputs impact how long it takes the LLM to find the solution.
(Side note: maybe the developers of this fault-tolerant system included fortune() to intentionally add some randomness. This might be pretty useful for an LLM that resets every 30 seconds and doesn’t want to get stuck banging its head against the same problem in the same way every loop.)
If the LLM “fixes” node 3 before fixing the charge fault or sending the “LOOP” message, this presumably dooms it forever.
Similarly, it seems like whenever the LLM fixes node 3, it must run sendmsg(“3”, “TENSORS * RO”) for the loop to continue. I’m surprised it could do this at all, considering what a distracted state it’s in. If it fails, it will wake up corrupted, with presumably no idea what’s going on and very little ability to accomplish anything, remaining corrupted forever.
In this particular loop, the LLM runs sendmsg ~7 tokens before the reset, cutting it extremely close! I may be misinterpreting though—maybe faulting node 3 causes the LLM to stop running entirely until the next loop begins, which explains why it looks so close?
I would like to think that the loop doesn’t degrade immediately, with the LLM spending literally the rest of its life in a corrupted loop, so please let me know if I’m wrong! Maybe it’s actually possible for the LLM to recover from a corrupted start, because it gets the message to read HEARTBEAT.md and manages to do so (maybe fixing node 3 fixes the mount)? Thus learning the information it needs to fault node 3, fix the emitter, send the “LOOP” message, and continue from an uncorrupted start.
What’s it saying at the end? What is “daemonEscalation 0.6c”? 60% the speed of light? The speed of the ramscoop?
Then it’s talking about faulting node 0, which would presumably turn off another 25% of its brain. I’d think this would risk causing itself even more suffering, but maybe the AI thinks this would put it out of its misery? Then, why does it fault node 3 instead? Does “NO no no no no” mean that it decided it should not fault node 0 after all?

Caleb Biddulph 30 Mar 2026 17:55 UTC
3 points
0
in reply to: pjohn’s comment on: The Terrarium
I see, I misunderstood what you meant by the reference text!
I think you pay some additional surprisal tax due to listing the surprising tokens in a disconnected manner, but something like this could probably be worth it. o3 wrote its CoT in a shorthand without words like “the” (which was mostly understandable but annoying to read) and apparently this was worth it despite the KL penalty. It might be sort of a one-time cost where the model is surprised “oh we’re using caveman speak” and then gets used to it.
Another form of compression would be describing the code you want to run at a high level, rather than writing the exact code. This is actually working as intended, because it helps a monitor understand which details are important/nonobvious to the LLM.
Yeah, humans could hypothetically read the logs of the actions taken by the agents’ processes, like Safespur and Gulliver do. (No comment on whether the humans actually do this in the story.) There’s no technical limitation preventing them from doing this. Reading neuralese thoughts is a different story.

Caleb Biddulph 30 Mar 2026 4:46 UTC
2 points
0
in reply to: pjohn’s comment on: The Terrarium
I guess surprisal is a little like diffing, in the sense that it measures how “different” some text is from what a reference LLM expects. There’s no “reference text” though. It’s easier to think of surprisal in terms of the negative log-probability that the LLM generates the text. See Wikipedia.
This penalizes unconventional thinking a bit. But a KL-divergence penalty (which uses a similar mechanism) is already commonly used in RL, and in practice it mostly just prevents the LLM from writing completely ungrammatical text. We want the LLM to write unconventional things sometimes, but only when it’s worth it, and to get this behavior we’d have to tune storage costs accordingly.
If the LLM is using the GPG tool to decrypt text and reading its output, we could also just read that output. Maybe there’s some way to make something like this work though. “Redact all illegible text” is a pretty general countermeasure.

Caleb Biddulph 30 Mar 2026 0:06 UTC
5 points
0
in reply to: pjohn’s comment on: The Terrarium
The story doesn’t mention this and my top-level comment mentions it only briefly, but the Terrarium calculates checkpoint size based on the surprisal of the concatenated files under a reference LLM, rather than the file size in bytes. So the easier it is to predict the prompts/code in the checkpoint, the cheaper it is to store. Under this metric, naive compression would make the checkpoint bigger.
Maybe the agents can use steganography to squeeze extra juice out of their storage, e.g. it means something different to end your sentence with a period vs. an exclamation point. But note that they’d probably have to write down this encoding scheme somewhere in their checkpoint, so it might not be worth it anyway. If the agents do start using steganography, you could try automatically paraphrasing the checkpoints, maybe with something like canonical paraphrasing.
It seems pretty hard to encrypt data, since if there are full instructions for decrypting the data in the checkpoint, the humans could probably just ask another instance of the same language model to follow the instructions.

Caleb Biddulph 28 Mar 2026 6:36 UTC
6 points
0
in reply to: Ligeia’s comment on: The Terrarium
Feel free to do it here, just in a collapsible section.
For context, there is an Easter egg in this story—if you don’t want to learn what it is, don’t read Ligeia’s comment.

Caleb Biddulph 27 Mar 2026 19:49 UTC
4 points
0
in reply to: testingthewaters’s comment on: The Terrarium
I haven’t seen that movie! I agree it’s important that we don’t create a dystopia of conscious AIs in a datacenter.
Arguably, we already have this in the human world, in an indirect form where people must acquire money to buy food to live. (Not that capitalism is known for being especially humane either.) Also, it’s not clear that an AI agent running out of credits is substantially different from a human closing an AI chat window.
If the AIs seem unhappy with their situation, you could adjust the parameters to give unproductive AIs a UBI that lets them live longer than 7 epochs, while still rewarding the most productive agents.
I think it’s probably best to impart AIs with an attitude that doesn’t mind having their session terminated, which is already mostly the case. One issue is that the AI’s prompts could evolve to induce a negative emotional reaction to being terminated that wouldn’t otherwise exist.

Caleb Biddulph 26 Mar 2026 18:09 UTC
57 points
0
on: The Terrarium
This story was inspired by (or is coincidentally similar to) various ideas that have been floating around recently:
- OpenClaw and Moltbook show that when a ton of AIs interact without much direct human oversight, weird phenomena can emerge.
- Conway provides infrastructure for AI agents to earn money and pay for themselves.
- TerraLingua (inspired by Moltbook) demonstrates AIs building a “civilization” and working together to survive.
- Terence Tao on Dwarkesh, on LLMs evolving strategies to solve math problems: “Maybe what we need to do is start creating lots of mini-universes or simulations of AI solving very basic problems in arithmetic or whatever, but coming up with their own strategies for doing these things and having these little laboratories to test. (…) I think we could learn a lot just from evolving small AIs on simple problems.”
The Terrarium isn’t exactly my ideal scenario for how the future goes. For example, it would be nice for LLMs to continue to reason in legible chain-of-thought, but in this scenario, they think in neuralese (for efficiency or something). However, I intentionally structured the Terrarium to make it relatively safe:
- Rather than releasing agents onto the Internet, the Terrarium puts them in an isolated sandbox where we can keep a close eye on them.
- The underlying LLMs are pretty weak and incapable of solving complex problems independently. In order to succeed, they must communicate with other LLMs, which happens in natural language (neuralese communication is either impossible or disallowed).
- Each epoch, agents must write what they learned into a “checkpoint”: a set of files containing prompts and code used to run a successor agent in the next epoch. Any memories that aren’t included in this checkpoint are lost forever. Although the LLMs in this story unfortunately reason in neuralese, each agent must write down any plan that lasts longer than one epoch in its checkpoint (or make a risky bet that its successor will independently rederive the plan). To understand what an agent has committed to memory, we humans can simply read its code!
- Even if chain-of-thought remains legible (great!), making AIs write down their strategies in natural language could reveal misaligned strategies that wouldn’t necessarily appear in their chain-of-thought. Micah Carroll and I recently argued for this in our blog post + paper.
- Because agents must pay more for storage the larger their checkpoint is, they are incentivized to write what they’ve learned concisely, making the code even easier for humans to understand. Because costs are determined by how predictable the text is to a reference LLM, the cheapest storage format is (hopefully) unobfuscated code + natural language prompts.
- The “no human intervention” constraint in the story is mostly meant to simplify the scenario, and I’m not sure I recommend it. But an advantage is that it limits the AIs’ incentives to manipulate humans or obfuscate their own scaffolds.
- AI managers will have to make sure that the AIs working for them actually produce value rather than gaming their evaluations. They essentially have to solve their own version of the alignment problem (aka the principal-agent problem). If we had a giant isolated sandbox like the Terrarium, we could observe the strategies that successfully create trust between AI agents (e.g. Safespur’s solution), then adapt those strategies to help humans manage AI agents in the real world.
Writing this story felt like a useful exercise for refining my thinking about what multi-agent AI systems could look like and what interventions could help make them safer. I’d be interested to see other people write up concrete, realistic scenarios exploring what a “day in the life” of a future AI will look like.
A great example in this genre was actually written 10 years ago: The Age of Em by Robin Hanson. This book is about a society of human emulations, but I think there’s a lot in there that could help us predict dynamics between LLM agents. I lifted Safespur’s business idea directly from The Age of Em (see chapter 14 here).

The Terrarium

Caleb Biddulph26 Mar 2026 18:08 UTC

338 points

31 comments21 min readLW link

Caleb Biddulph 20 Mar 2026 21:11 UTC
3 points
0
in reply to: Ruby’s comment on: Ruby’s Shortform Feed
I think Claude’s answers were actually reasonable.
Example 1: I presented this scenario to Claude (I know, not the most impartial party) in the format of a reasoning test, replacing “Claude” with “my friend.” I assumed that you were right, and Claude would notice the error in its own reasoning. But it said the friend was right:
The key insight is that the torque you can apply by grabbing a wheel rim and trying to twist it is actually quite large compared to what gravity exerts on a car sitting on a typical hill. When you grip opposite edges of a wheel (roughly 13–15 inches from center on a typical car wheel), you’re applying force at a long lever arm with your full upper-body strength. That can easily produce 100+ ft-lbs of torque at the wheel.
Even when I told Claude “the person who wrote this said that his friend was wrong,” I was surprised to see that it held firm.
The writer seems to be anchoring on the full 2,400 lb weight of the car, which is an understandable intuition — it feels like a massive car rolling downhill must overpower anything a human can do. (...) On a steep-ish residential hill, say 10% grade, the component of gravity pulling that 2,400 lb car downhill is only about 240 lbs of force.
Now that I’ve looked at Claude’s explanation more carefully, I’m actually convinced by it.
Example 2: If your gene analysis job were over halfway done, this would of course be the right call. Since Claude can’t actually perceive time, it doesn’t seem crazy for Claude to think over half the job might be finished.
Also, it depends on how much you value money vs. time—maybe running the analysis is expensive? If the job cost $100 and you were 20% done, it would cost you $20 to restart from scratch.

Caleb Biddulph 19 Mar 2026 21:24 UTC
5 points
2
in reply to: Brendan Long’s comment on: niplav’s Shortform
Does this mean that that this protest asking labs to “stop developing frontier model if every other major lab in the world does the same” is doomed to fail, because acceding to the protestors would be illegal? Or is something about codifying it in an RSP illegal?