Research associate at the University of Lancaster, studying AI safety. 
My current views on AI:
 
Sohaib Imran
- it’s the kind of terminal value that i expect for most people would be different; guaranteed continuation in 5% of instances is much better than 5% chance of continuing in all instances; in the first case, you don’t die! - And even in the case where we are assigning negative utility to death, most people are really considering counterfactual utility from being alive, and 95% of that (expected) counterfactual utility is lost whether 95% of the “instances of you” die or whether there is a 95% chance that “you” die. 
- I claim that the negative utility due to stopping to exist is just not there - But we are not talking about negative utility due to stopping to exist. We are talking about avoiding counterfactual negative utility by committing suicide, which still exists! 
 - guaranteed continuation in 5% of instances is much better than 5% chance of continuing in all instances; in the first case, you don’t die! - I think this is an artifact of thinking of all of the copies having a shared utility (i.e. you) rather than separate utilities that add up (i.e. so many yous will suffer if you don’t commit suicide). If they have separate utilities, we should think of them as separate instances of yourself. 
- Whats the difference between fewer instances and fewer copies, and why is that load bearing for the expected utility calculation? 
- Yes, but the number of copies of you still reduces (or the probability that you are alive in standard probability theory, or the number of branches in many worlds). Why are these not equivalent in terms of the expected utility calculus? 
- You get a pretty similar dynamic just from having a universe that is large enough to contain many almost-identical copies of you. - Again, not sure why a large universe is needed. The expected utility ends up the same either way, whether you have some fraction of branches in which you remain alive or some probability of remaining alive. - Regarding the expected utility calculus. I agree with everything you said but i don’t see how any of it allows you to disregard the counterfactual suffering from not committing suicide in your expected value calculation. - Maybe the crux is whether we consider the utility of each “you” (i.e. you in each branch) individually, and add it up for the total utility, or wether we consider all “you”s to have just one shared utility. - Let’s say that not committing suicide gives you −1 utility in branches but commiting suicide gives you −100 utility in branches and 0 utility in branches - If we treat all copies of you as having separate utilities and add them all up for a total expected utility calculation, not committing suicide gives utility while committing suicide leads to utility. Therefore, as long as , it is better to commit suicide. - If, on the other hand you treat them as having one shared utility, you get either −1 or −100 utility, and −100 is of course worse. - Do you agree that this is the crux? If so, why do you think that all the copies share one utility rather than their utilities adding up? 
- I don’t think quantum immortality changes anything. You can rephrame this in terms of standard probability theory and condition on them continuing to have subjective experience, and still get to the same calculus. - However, only considering the branches in which you survive, or conditioning on having subjective experience after the suicide attempt, ignores the counterfactual suffering prevented in all the branches (or probability mass) in which you did die, which may be less unpleasant than the branches in which you survived, but are many many more in number! Ignoring those branches biases the reasoning toward rare survival tails that don’t dominate the actual expected utility. 
- In our proposed modification, when we receive this output, we replace the <end-of-turn> token with a <personality-shift> token and simply continue the generation until we get a second <end-of-turn> token - Minor nitpick, but why not create a new chat template instead with every message containing a user, assistant, and personality-shift assistant (or a less unwieldy name). An advantage of creating a chat template and training a model to respond to it is that you can render the chat templates nicely in frameworks like Inspect. 
- Cool work. I wonder if any recent research has tried to train LLMs (perhaps via RL) on deception games in which any tokens (including CoT) generated by each player are visible to all other players. 
 It will be useful to see if LLMs can hide their deception from monitors over extended token sequences and what strategies they come up with to achieve that (eg. steganography).
- Thanks for writing these up, very insightful results! Did you try repeating these experiments with in-context learning instead of fine-tuning, where there is a conversation history with n user prompts containing a request and the assistant response is always vulnerability code, followed by the unrelated questions to evaluate emergent misalignment? 
- Very interesting. The model even says ‘you’ and doesn’t recognise from that that ‘you’ is not restricted. I wonder if you can repeat this on an o-series model to compare against reasoning models. - Also, instead of asking for a synonym you could make the question multiple choice so a) I b) you … etc. 
- This is probably not CoT. 
- The o1 evaluation doc gives—or appears to give—the full CoT for a small number of examples, that might be an interesting comparison. - Just had a look. One difference between the CoTs in the evaluation doc and the “CoTs” in the screenshots above is that the evaluation docs CoTs tend to begin with the model referring to itself eg. “We are asked to solve this crossword puzzle.”, “We are told that for all integer values of kk” or the problem at hand eg. “First, what is going on here? We are given:”. Deepseek r1 tends to do that as well. The above screenshots don’t show that behaviour, instead begins by talking in second person eg. “you’re right” and “here’s your table”, which sounds very much like a model response rather than CoT. In fact the last screenshot is almost the same as the response, and upon closer inspection, all the greyed out texts pass as good responses instead of the final responses in black. 
 I think this is strong evidence that the greyed-out text is NOT CoT but perhaps an alternative response. Thanks for prompting me!
- It would mean that R1 is actually more efficient and therefore more advanced that o1, which is possible but not very plausible given its simple RL approach. - I think that is very plausible. I don’t think o1 or even r1 for that matter is anywhere near as efficient as LLMs can be. OpenAI is probably putting a lot more resources to get to AGI first, than to get to AGI efficiently. Deepseek v3 is already miles better than GPT 4o while being cheaper. - I think it’s more likely that o1 is similar to R1-Zero (rather than R1), that is, it may mix languages which doesn’t result in reasoning steps that can be straightforwardly read by humans. A quick inference time fix for this is to do another model call which translates the gibberish into readable English, which would explain the increased CoT time. - I think this is extremely unlikely. Here’s some questions that demonstrate why. 
 Do you think OpenAI is using a model to first translate to English, and then another model to generate a summary? Is this conditional on showing the translated CoTs being a feature going forwards? If so, do you expect OpenAI to do this for all CoTs or just the CoTs they intend to show? If the latter, don’t you think there will be a significant difference in the “thinking” time between the responses where the translated CoT is visible and the responses where only the summary is visible?
- A few notable things from these CoTs: 
 1. The time taken to generate these CoTs (noted at the end of the CoTs) is much higher than the time o1 takes to generate these tokens in the response. Therefore, it’s very likely that OpenAI is sampling the best ones from multiple CoTs (or CoT steps with a tree search algorithm), which are the ones shown in the screenshots in the post. This is in contrast to Deepseek r1 which generates a single CoT before the response.
 2. The CoTs themselves are well structured into paragraphs. At first, I thought this hints towards a tree search over CoT steps with some process reward model, but r1 also structures its CoTs into nice paragraphs.
[Question] Does the ChatGPT (web)app sometimes show actual o1 CoTs now?
- I strongly suspect that publishing the benchmark and/or positive results of AI on the benchmark pushes capabilities much more than publishing simple scaffolding + fine-tuning solutions that do well on the benchmark for benchmarks that measure markers of AI progress. - Examples: - The exact scaffolding used by Sakana AI did not propel AGI capabilities as much compared to the common knowledge it created that LLMs can somewhat do end-to-end science. 
- No amount of scaffolding that the Arc AGI or Frontier Math team could build would have as much of an impact on AGI capabilities as the benchmarks themselves. These benchmark results basically validated that the direction OpenAI is taking is broadly correct, and I suspect many people who weren’t fully sold on test-time compute will now change strategies as a result of that. 
 
 - Hard benchmarks of meaningful tasks serve as excellent metrics to measure progress, which is great for capabilities research. Of course, they are also very useful for making decisions that need to be informed by an accurate tracking or forecasting of capabilities. - Whether making hard meaningful benchmarks such as frontier math and arc agi and LLM science are net negative or positive is unclear to me (a load-bearing question is whether the big AGI labs have internal benchmarks as good as these already that they can use instead). I do think however that you’d have to be extraordinarily excellent at designing scaffolding (and finetuning and the like) and even then spend way too much effort at it to do significant harm from the scaffolding itself rather than the benchmark that the scaffolding was designed for. 
- I guess we could in theory fail and only achieve partial alignment, but that seems like a weird scenario to imagine. Like shooting for a 1 in big_number target (= an aligned mind design in the space of all potential mind designs) and then only grazing it. How would that happen in practice? - Are you saying that the 1 aligned mind design in the space of all potential mind designs is an easier target than the subspace composed of mind designs that does not destroy the world? If so, why? is it a bigger target? is it more stable? - Can’t you then just ask your pretty-much-perfectly-aligned entity to align itself on that remaining question? - No, because the you who can ask (the persons in power) is themselves misaligned with the 1 alignment target that perfectly captures all our preferences. 
- And why must alignment be binary? (aligned, or misaligned, where misaligned necessarily means it destroys the world and does not care about property rights) - Why can you not have an a superintelligence that is only misaligned when it comes to issues of wealth distribution? - Relatedly, are we sure that CEV is computable? 
- Thanks for writing this! - Could you clarify how the Character/Predictive ground layers in your model are different from Simulacra/Simulator in simulator theory? 
Thanks for the detailed breakdown!
While I agree with most of it, I think Dazzling is different enough from the other types of hidden reasoning that it seems misplaced in this taxonomy,
All the other categories of hidden reasoning are trying to ask the question: “Does the CoT contain sufficient information for the monitor to understand the reasoning process?” Whereas Dazzling is asking different questions such as “Is all of the CoT necessary for understanding the reasoning process?” or “Does the CoT contain token sequences that sabotage/undermine the monitor’s effectiveness?”.