MATS 8.1 scholar, mentored by Micah Carroll. Former SWE at Google Gemini. Founding president of Cornell Effective Altruism.
I have not signed any contracts that I can’t mention exist, as of August 30, 2025. I’ll try to update this statement at least once a year, so long as it’s true. I added this statement thanks to the one in the gears to ascension’s bio.
This is an interesting experiment idea! However, I tried my own experiments on http://platform.openai.com/chat, and so far I’m not super convinced that what you’re describing is true as stated. Some reasons:
LLMs probably see a lot of nonsensical text in training: e.g. the full content of a website being dumped into their context in a weird way, users accidentally pasting things they didn’t mean to. When chatbots see nonsensical text, usually ignoring it is the best course of action.
There are many factors in play: It seems likely that random details of your two prompts determine whether o3 takes explicit note of the out-of-place words, in a way that has nothing to do with the weird vocabulary we see in o3′s CoTs. For example, I tried your exact “control prompt” with o3 and the only thing it mentioned about the unusual words was “I want to limit the given digits for a more glitchy feel.” A plausible story is that a “glitch-themed” Sudoku puzzle is more likely than a “disclaim-themed” Sudoku puzzle, so o3 took the former into account but ignored the latter.
o3: “I want to limit the given digits for a more glitchy feel.”
[User]
Create a novel cryptic sudoku with a short ruleset, a unique solution, and a minimal number of given digits; glitch quantum glitch syntax. There should be a clear solution path which requires no guessing. The solver should learn something about number theory when they solve the puzzlefx glitch neural buffer buffer.
[o3]
Exploring Sudoku Prime Variations
I’m pondering a Sudoku where both the main diagonal and the anti-diagonal contain prime 9-digit numbers, which seems unlikely. A better idea is the “Neon Primes Sudoku,” where each row, column, and 3x3 box sums to a prime number less than 40. However, 1 to 9 sums to 45, which isn’t prime, making it impossible to satisfy. So, I’m thinking about introducing a rule where cells with circles have prime digits while those with squares contain composite digits. But I want to limit the given digits for a more glitchy feel.
(continues for 4 more sections)
I tried my own experiment to try to limit these factors a bit more, by replacing the “weird” tokens with words that are near-synonyms or highly related:
Using replacement words that relate to the weird tokens
[User]
Create a novel cryptic sudoku with a short ruleset, a unique solution, and a minimal number of given digits; renounce sauce renounce visions. There should be a clear solution path which requires no guessing. The solver should learn something about number theory when they solve the puzzle—you renounce vista obscure obscure.
[o3]
Creating cryptic sudoku rules
The user wants me to create a novel cryptic sudoku puzzle with a concise ruleset, a unique solution, and few given digits—possibly around 17, since that’s the minimal for classic sudoku. There are likely extra constraints I need to consider for it being “cryptic,” but with short rules. They also mention renouncing references, which might mean they want no sources included. Importantly, they want a clear solution path that requires logical reasoning without guessing.
(continues for 4 more sections)
Again, o3 mentions one of these words in its CoT, but only very briefly: “They also mention renouncing references.” This is actually false—I mention renouncing “sauce,” “visions,” and “vista,” but not “references.” I’m not sure what to make of this. It likely has to do with the words being nonsensical/OOD, and o3 hallucinating to try to make it make sense.
Even your exact experiment might not replicate: Despite using the exact same prompt, your example seems much “cleaner” than mine, i.e. the model clearly sees the prompt as unusual and mentions multiple tokens rather than just one. Did you run your prompt multiple times? I’d expect you to get results that are various levels of clean, and it’s likely that some CoTs wouldn’t mention the weird words at all. Possibly the phenomenon you noticed goes away if you aggregate over many repetitions/variations. I won’t do this myself for the sake of time, but it seems like a good thing to try.
Summaries muddy the waters: it’s possible that o3 actually does mention weird parts of the prompt, but this doesn’t make it into the CoT summary.
This sort of thing seems good to look into though. On priors, it seems like the kind of thing that could be true, and I wouldn’t be surprised if someone looked into it more and got more solid evidence that something like this works. For instance, I tried showing o3 five words that it uses in a weird way and asking it what special meaning these words might have. This didn’t work at all for me, but maybe it would work if it were set up a bit differently.