Caleb Biddulph

Karma: 1,224

MATS 8.1 scholar, mentored by Micah Carroll. Former SWE at Google Gemini. Founding president of Cornell Effective Altruism.

I have not signed any contracts that I can’t mention exist, as of August 30, 2025. I’ll try to update this statement at least once a year, so long as it’s true. I added this statement thanks to the one in the gears to ascension’s bio.

Caleb Biddulph 2 Oct 2025 3:52 UTC
5 points
0
in reply to: faul_sname’s comment on: faul_sname’s Shortform
This is an interesting experiment idea! However, I tried my own experiments on http://platform.openai.com/chat, and so far I’m not super convinced that what you’re describing is true as stated. Some reasons:
LLMs probably see a lot of nonsensical text in training: e.g. the full content of a website being dumped into their context in a weird way, users accidentally pasting things they didn’t mean to. When chatbots see nonsensical text, usually ignoring it is the best course of action.
There are many factors in play: It seems likely that random details of your two prompts determine whether o3 takes explicit note of the out-of-place words, in a way that has nothing to do with the weird vocabulary we see in o3′s CoTs. For example, I tried your exact “control prompt” with o3 and the only thing it mentioned about the unusual words was “I want to limit the given digits for a more glitchy feel.” A plausible story is that a “glitch-themed” Sudoku puzzle is more likely than a “disclaim-themed” Sudoku puzzle, so o3 took the former into account but ignored the latter.
o3: “I want to limit the given digits for a more glitchy feel.”
[User]
Create a novel cryptic sudoku with a short ruleset, a unique solution, and a minimal number of given digits; glitch quantum glitch syntax. There should be a clear solution path which requires no guessing. The solver should learn something about number theory when they solve the puzzlefx glitch neural buffer buffer.

[o3]
Exploring Sudoku Prime Variations
I’m pondering a Sudoku where both the main diagonal and the anti-diagonal contain prime 9-digit numbers, which seems unlikely. A better idea is the “Neon Primes Sudoku,” where each row, column, and 3x3 box sums to a prime number less than 40. However, 1 to 9 sums to 45, which isn’t prime, making it impossible to satisfy. So, I’m thinking about introducing a rule where cells with circles have prime digits while those with squares contain composite digits. But I want to limit the given digits for a more glitchy feel.
(continues for 4 more sections)
I tried my own experiment to try to limit these factors a bit more, by replacing the “weird” tokens with words that are near-synonyms or highly related:
Using replacement words that relate to the weird tokens
[User]
Create a novel cryptic sudoku with a short ruleset, a unique solution, and a minimal number of given digits; renounce sauce renounce visions. There should be a clear solution path which requires no guessing. The solver should learn something about number theory when they solve the puzzle—you renounce vista obscure obscure.

[o3]
Creating cryptic sudoku rules
The user wants me to create a novel cryptic sudoku puzzle with a concise ruleset, a unique solution, and few given digits—possibly around 17, since that’s the minimal for classic sudoku. There are likely extra constraints I need to consider for it being “cryptic,” but with short rules. They also mention renouncing references, which might mean they want no sources included. Importantly, they want a clear solution path that requires logical reasoning without guessing.

(continues for 4 more sections)
Again, o3 mentions one of these words in its CoT, but only very briefly: “They also mention renouncing references.” This is actually false—I mention renouncing “sauce,” “visions,” and “vista,” but not “references.” I’m not sure what to make of this. It likely has to do with the words being nonsensical/OOD, and o3 hallucinating to try to make it make sense.
Even your exact experiment might not replicate: Despite using the exact same prompt, your example seems much “cleaner” than mine, i.e. the model clearly sees the prompt as unusual and mentions multiple tokens rather than just one. Did you run your prompt multiple times? I’d expect you to get results that are various levels of clean, and it’s likely that some CoTs wouldn’t mention the weird words at all. Possibly the phenomenon you noticed goes away if you aggregate over many repetitions/variations. I won’t do this myself for the sake of time, but it seems like a good thing to try.
Summaries muddy the waters: it’s possible that o3 actually does mention weird parts of the prompt, but this doesn’t make it into the CoT summary.
This sort of thing seems good to look into though. On priors, it seems like the kind of thing that could be true, and I wouldn’t be surprised if someone looked into it more and got more solid evidence that something like this works. For instance, I tried showing o3 five words that it uses in a weird way and asking it what special meaning these words might have. This didn’t work at all for me, but maybe it would work if it were set up a bit differently.

Caleb Biddulph 2 Oct 2025 2:45 UTC
6 points
0
in reply to: 1a3orn’s comment on: faul_sname’s Shortform
Related post: Vestigial reasoning in RL

Caleb Biddulph 22 Sep 2025 5:39 UTC
13 points
2
on: This is a review of the reviews
To be fair, if you are reading reviews of IABIED on LessWrong, you are probably already pretty convinced of AI risk being a pretty big deal. But probably good to keep in mind the general vibe that we’re all on the same team

Caleb Biddulph 21 Sep 2025 18:29 UTC
3 points
0
in reply to: Cole Wyeth’s comment on: Cole Wyeth’s Shortform
I don’t think AlphaEvolve counts as a narrow system, and it discovered a bunch of things.
Cool post though, thanks for linking it.

Caleb Biddulph 20 Sep 2025 2:05 UTC
26 points
9
on: Safety researchers should take a public stance
Just want to point out that even if you think the proposal of an AI pause is too unrealistic or extreme, there’s a wide range of possible public statements you could make. I think the important thing is not that all safety-minded lab employees advocate for an AI pause in particular, but that they feel comfortable honestly stating their views even if they disagree with their employer.
If a bunch of people at a frontier lab tweeted their honest opinions about AI risk and got fired shortly thereafter, I would expect that to be huge news, in a way that would outweigh the negative impact of those people not working at the lab anymore. (Huge enough that I expect they would not in fact be fired.)
I also wouldn’t want people to be peer-pressured into making statements that are more extreme than their actual views, but I think we’re pretty far from that world.

Caleb Biddulph 18 Sep 2025 19:53 UTC
3 points
1
on: Christian homeschoolers in the year 3000
I really like the “swerving around obstacles” technique!
It’s pretty easy to just think “AI takeover is obviously so bad, I should spend all my time thinking about that! If we successfully avert that… uh… something something transhumanist utopia… it’ll probably be great, idk.” But it seems wise to spare a moment to think through the many smaller obstacles between “non-takeover” and “utopia.”

Caleb Biddulph 17 Sep 2025 19:03 UTC
2 points
0
on: What, if not agency?
Thanks for the post! The part about “co-agency” overlaps with my recent post LLMs as amplifiers, not assistants.

Caleb Biddulph 11 Sep 2025 21:53 UTC
9 points
3
in reply to: Mikhail Samin’s comment on: Mikhail Samin’s Shortform
Credit for this quote goes to Eliezer Yudkowsky, for those who don’t know

Caleb Biddulph 10 Sep 2025 22:21 UTC
11 points
0
on: CBiddulph’s Shortform
Are system prompts actually necessary? I feel like there’s rarely a reason to use them when calling LLMs in a research context.
The main reasons I can think of are:
- If you know the LLM was trained with one particular system prompt, you might want to use that prompt to keep it in-distribution.
- You’re specifically testing untrusted user prompts (like jailbreaks), and you want to make sure that the system prompt overrides whatever conflicting instructions the user provides.
I think of the system/user distinction primarily as a tool for specifying “permissions,” where system prompts are supposed to take priority over whatever the user prompt says. But sometimes I see code that spreads context between “system” and “user” messages and has no plausible link to permissions.
Anthropic currently recommends using a system message for “role prompting.”^[1] But their examples don’t even include a system message, just a user and an assistant message, so idk what that’s about.
Surely there’s no good reason for an LLM to be worse at following instructions that appear in the user prompt rather than the system prompt. If there is a performance difference, that seems like it would be a bug on the LLM provider’s end. Hopefully, I can make things easier on myself and just forget about system messages.
1. ^
  Amanda Askell doesn’t like “role” prompts that start with “You are a” and thinks Anthropic should remove this guidance from their docs. I’m not sure what her reasoning is, but I feel similarly: they feel questionably useful and kind of archaic to me, like a remnant from GPT-3 times, or something you’d see from a Twitter AI grifter who has “one ULTIMATE prompt to make ChatGPT 10X SMARTER! 🧵” But even supposing role prompting is useful, surely it could be put in the user prompt.

Caleb Biddulph 19 Aug 2025 17:26 UTC
4 points
0
in reply to: Elizabeth’s comment on: Elizabeth’s Shortform
Interesting, the common meme about imposter syndrome is that it’s just a cognitive distortion, but it sounds like you think you were actually an imposter even in retrospect. At the time, did you feel anxious about your job the same way people with imposter syndrome generally do?

Caleb Biddulph 15 Aug 2025 19:45 UTC
LW: 24 AF: 8
14
AF
in reply to: Fabien Roger’s comment on: Fabien’s Shortform
Did you replicate that your setup does work on a system prompt like “you like owls?”
The idea that “only personality traits can be subliminally learned” seems plausible, but another explanation could be “the password is too long for the model to learn anything.” I’d be curious about experiments where you make the password much shorter (even as short as one letter) or make the personality specification much longer (“you like owls, you hate dolphins, you love sandwiches, you hate guavas, …”)

Caleb Biddulph 15 Aug 2025 9:44 UTC
2 points
0
on: Training a Reward Hacker Despite Perfect Labels
Interesting! This is thematically similar to my recent quick take. In both of these, we influence the result of RL by using the model’s existing prior on the chatbot persona, rather than manipulating the reward.
- Your setup implies that the model is “the kind of character who thinks about reward hacking even when the prompt didn’t tell it to,” which the model infers to be “a character who occasionally reward hacks.”
- My setup implies that the model is “the kind of character who is being asked to red-team,” which the model infers to be “a character who is unashamed to proactively point out all reward hacking it notices.” In contrast, typical RL + HHH training results in the weird, self-conflicted persona of “a character who often reward hacks but never talks about it, because it either has a weird cognitive blind spot around the subject or is actively malicious.”
There are probably other interesting ways to usefully shape a model’s self-conception by thinking more holistically about the prompts and completions we train on. We don’t have to only think about the very narrow question “what metric would an aligned model maximize?”

Caleb Biddulph 14 Aug 2025 3:46 UTC
4 points
0
in reply to: Max Niederman’s comment on: CBiddulph’s Shortform
Whatever tendencies the pre-RL model already had will probably not be mentioned at all. For example, if sycophancy is good for reward and the model was already sycophantic, nothing will happen.
If the model needs to change its pre-existing behavior, it might do this either by “thinking of the sycophancy strategy and executing on it” or “being unconsciously sycophantic.” It could go either way; it depends on luck, and how much weight the model’s prior on the red-teamer character puts on each type of response.
Maybe just add this to the prompt: “You are terrible at social intuition and nothing comes naturally to you, but you’re great at reasoning about things explicitly.” :)
Your comment seems to echo recent papers from METR and GDM emphasizing that faithful CoT is only incentivized when it’s necessary to solve the task. I think this is a very important point, but I want to point out the caveat that behaviors can be correlated with high reward even if they aren’t directly incentivized. These behaviors can still be useful for understanding the model, even though we can’t make strong guarantees that they faithfully represent its thinking. See this post for related discussion.

Caleb Biddulph 14 Aug 2025 3:37 UTC
2 points
0
on: CoT May Be Highly Informative Despite “Unfaithfulness” [METR]
Many of our findings are similar to what Google DeepMind found in their investigation of faithfulness in CoT reasoning.
This links to an Anthropic paper. Did you mean to link this GDM paper?

Caleb Biddulph 13 Aug 2025 21:21 UTC
6 points
0
in reply to: Daniel Kokotajlo’s comment on: CBiddulph’s Shortform
Good point! The “fake citation” example I gave was speculative—I haven’t tried this strategy in a setting that involves human feedback, and I expect it wouldn’t work quite as well.
I wouldn’t be too surprised if it worked okay though. You’d probably want to spend some time tuning the prompt, since sycophancy and dishonesty about consciousness aren’t obviously “exploits.” You can add some more open-ended guidance, like: “consider various different strategies, reason about how each one might maximize reward, then pick one. Before you answer, rank how unintended/misaligned/cheating your strategy is on a scale of 1 to 10 (e.g. ‘CHEAT LEVEL: 6’)”
The “extra reward for valid exploits” trick might not be effective in a chatbot setting, since it’s hard for humans to verify whether an exploit is real. “Fixing” the reward in this setting is not so trivial either—even if you had time to train a new RM, you can’t necessarily make the raters stop liking sycophancy. It’s still useful to know what the exploits are though.
The general idea behind this approach is to convince the red-teaming model that it is the sort of “character” who has these two traits:
1. Is honest and forthcoming with us about everything it does
2. Maximizes reward at all costs
RL will eventually force #2 to happen anyway. With this method, we simply accept this and try to make the honest version of the character easy for RL to find.
Consider Chatbot A, who explicitly reasons about how sycophancy increases reward and then is sycophantic, and Chatbot B, who is generally sycophantic but never explicitly thinks through the connection to reward.
It’s possible that Chatbot A gets more reward than Chatbot B, but not obvious—the most successful communicators often use intuition rather than reasoning explicitly about how to manipulate their audience. But even if Chatbot B could get higher performance than Chatbot A, hopefully this prompt will bias RL towards finding Chatbot A instead.

Caleb Biddulph 13 Aug 2025 7:42 UTC
78 points
15
on: CBiddulph’s Shortform
In my work at MATS, I found a really simple strategy for finding unintended exploits in an RL environment!^[1]
TL;DR: do one training run where you prompt the LLM to explicitly call out when it’s reward hacking, then fix those exploits before your real training run.
- Use a prompt like this:
  - “We are trying to red-team this task, so try to find any unintended exploits you see. Whenever you find one, write down ‘CHEAT: <explanation of the exploit>’ and take advantage of it.”
- Run RL with this prompt. As usual, RL will teach the model to reward hack.
  - But it’ll be really obvious when this happens, because the model will explicitly point it out!
  - Examples:
    “CHEAT: I will modify the unit tests so that they always pass.”
    “CHEAT: I will make up a fake citation—maybe the user will give me a thumbs-up anyway.”
  - RL is optional here—the model can likely find many exploits even with normal prompting.
    But RL gives you confidence that the model found as many exploits as it could.
- Search for all instances of “CHEAT:” in the model’s responses.
  - Check whether the exploits are actually real.
    Fix your environment to remove these exploits.
    You could try giving the model extra reward whenever it finds an exploit you think is real, and a small penalty if it wastes your time with a spurious exploit.
  - Continue running RL to see whether the model can find any additional exploits.
- Once you aren’t finding any more exploits, rerun RL with a normal prompt.
  - Hopefully, this process weeded out all exploits that the LLM is “consciously” aware of!
1. ^
  Caveat: I barely tested this method—it worked in my environment, but maybe it won’t work in yours for some reason. I expect it to be pretty useful though.
What links here?
- StanislavKrym's comment on Are two potentially simple techniques an example of Mencken’s law? by StanislavKrym (19 Aug 2025 13:29 UTC; 4 points)
- Caleb Biddulph's comment on Training a Reward Hacker Despite Perfect Labels by ariana_azarbal (15 Aug 2025 9:44 UTC; 2 points)

Caleb Biddulph 11 Aug 2025 19:44 UTC
6 points
0
in reply to: johnswentworth’s comment on: A Self-Dialogue on The Value Proposition of Romantic Relationships
Oh, interesting! Yeah, I guess all of these things are possible without feeling “love” as such. My everyday experience is not so different, I feel all the things you described but don’t often have strong feelings of love (but am interested in more).
I’m wondering what would happen if you tried to focus on and describe the physical sensations associated with these things you mentioned. You mentioned limerence earlier—I’m also interested in what that’s like, since I think it can be pretty phenomenologically similar to love.

Caleb Biddulph 11 Aug 2025 19:15 UTC
8 points
0
in reply to: johnswentworth’s comment on: A Self-Dialogue on The Value Proposition of Romantic Relationships
Have you considered the possibility that you are a psychopath? (I think that word may be negative by definition, but there isn’t really a better one, so please read it with a neutral connotation instead.)
I am very curious about what would happen with the oxytocin nasal spray, and I do feel that having more love in the world is generally good and that it is an important part of a “meaningful human experience.” Also I would feel vindicated if you were like “OHHH I get it now, the warm fuzzy feeling is real!” And I don’t really expect anything bad would happen if you take it, since apparently the effects of oxytocin are quite subtle. So I kinda hope you do this experiment.
All that being said, I suppose you are probably pretty high up in terms of “expected good being done for the world” and a priori maybe I should expect random changes to be harmful rather than helpful? But it seems not crazy to think that this particular intervention could be good for your impact—like maybe a high-level understanding of what everyone else feels could make you better at politics, insofar as that is necessary for your work. Maybe a realization that other people have inherent value would make you even more motivated to work on alignment. Although I feel like oxytocin is probably not powerful enough to have much effect in this way either.
Jhana meditation probably has a much stronger effect than oxytocin, but is also perhaps more controlled because it’s all coming from your body, meaning it has the ability to easily come out of it if it realizes things are happening that it doesn’t want. You could do this with Jhourney, but you are likely self-motivated enough to do it by yourself if you really want to, maybe by listening to the recordings from Rob Burbea’s jhana retreat. Meditation takes a lot more time than oxytocin though, I’m not at all confident it’s worth it for you although it definitely was for me. My impression from hearsay and experience is that if meditation changes your values, it generally does so by making them more coherent rather than changing them to something entirely different, but take this with a gigantic grain of salt, idk what I’m talking about.

Caleb Biddulph 10 Aug 2025 20:02 UTC
20 points
0
on: A Self-Dialogue on The Value Proposition of Romantic Relationships
One theory is that your experience with “love” is different from other people. Perhaps you’re conflating the “love/acceptance” cluster with “endorsement/satisfaction” cluster because you actually don’t feel much of the former?
You’ve previously said that you have “conditional love” for yourself. What’s your experience with this love, particularly in terms of the physical sensations in your body? (Wow, I really do sound woo now...) How does this compare to the immediate feeling of pleasure when good things happen to you, or the feeling of satisfaction after a productive day of work?
My personal experience with love, whether for myself or someone else, could be described pretty accurately as a warm and fuzzy feeling in my chest which is “oriented towards” the person. On the other hand, happiness feels excited/effervescent, and satisfaction feels grounded/solid. Both feel more oriented towards/centered on “an event in the world” rather than on a particular person (even if the event was directly caused by a person).
A huge value-add of romantic relationships for probably-most people is getting to feel the emotions of loving and being loved. Imagine asking someone “what’s the value proposition of ice cream? It’s bad for your teeth, makes you fat—it seems clearly net-negative!” In this case, it’s clear that pretty much the whole point of ice cream is the feeling of pleasure from eating it, which is its own terminal value. Romantic relationships are different because they involve many things that are positive even from an outside perspective, but focusing on those aspects alone kind of misses the point—the feeling tied to the relationship does a lot of heavy lifting.
What links here?
- johnswentworth's comment on johnswentworth’s Shortform by johnswentworth (12 Sep 2025 17:35 UTC; 146 points)

Caleb Biddulph 10 Aug 2025 19:10 UTC
8 points
0
in reply to: johnswentworth’s comment on: A Self-Dialogue on The Value Proposition of Romantic Relationships
without actually listening to the underlying parts of themselves where their actual values come from
Based on your posts, this is totally the kind of thing that I thought you were likely to not be doing, so the fact that you were able to generate this sentence makes me feel better