RohanS

Karma: 411

I aim to promote welfare and reduce suffering as much as possible. This has led me to work on AGI safety research. I am particularly interested in foundation model agents (FMAs): systems like AutoGPT and Operator that equip foundation models with memory, tool use, and other affordances so they can perform multi-step tasks autonomously.

Previously, I completed an undergrad in CS and Math at Columbia, where I helped run Columbia Effective Altruism and Columbia AI Alignment Club (CAIAC).

Hidden Reasoning in LLMs: A Taxonomy

Rauno Arike, RohanS and Shubhorup Biswas

25 Aug 2025 22:43 UTC

62 points

8 comments12 min readLW link

How we spent our first two weeks as an independent AI safety research group

RohanS, Rauno Arike and Shubhorup Biswas

11 Aug 2025 19:32 UTC

28 points

0 comments10 min readLW link

Extract-and-Evaluate Monitoring Can Significantly Enhance CoT Monitor Performance (Research Note)

Rauno Arike, RohanS and Shubhorup Biswas

8 Aug 2025 10:41 UTC

51 points

7 comments10 min readLW link

RohanS 13 Jul 2025 19:31 UTC
4 points
0
in reply to: Gurkenglas’s comment on: Efficiently Detecting Hidden Reasoning with a Small Predictor Model
It could be interesting to try that too, but we thought other reasoning models are more likely to predict similar things to the large reasoning models generating the CoTs in the first place. That hopefully increases the signal to noise ratio.

Efficiently Detecting Hidden Reasoning with a Small Predictor Model

RohanS, Vishnu Vardhan Sai Lanka, yaumeng and daria

13 Jul 2025 16:04 UTC

33 points

3 comments16 min readLW link

RohanS 9 Jul 2025 1:03 UTC
11 points
0
on: RohanS’s Shortform
What time of day are you least instrumentally rational?

(Instrumental rationality = systematically achieving your values.)

A couple months ago, I noticed that I was consistently spending time in ways I didn’t endorse when I got home after dinner around 8pm. From then until about 2-3am, I would be pretty unproductive, often have some life admin thing I should do but was procrastinating on, doomscroll, not do anything particularly fun, etc.

Noticing this was the biggest step to solving it. I spent a little while thinking about how to fix it, and it’s not like an immediate solution popped into mind, but I’m pretty sure it took me less than half an hour to come up with a strategy I was excited about. (Work for an extra hour at the office 7:30-8:30, walk home by 9, go for a run and shower by 10, work another hour until 11, deliberately chill until my sleep time of about 1:30. With plenty of exceptions for days with other evening plans.) I then committed to this strategy mentally, especially hard for the first couple days because I thought that would help with habit formation. I succeeded, and it felt great, and I’ve stuck to it reasonably well since then. Even without sticking to it perfectly, this felt like a massive improvement. (Adding two consistent, isolated hours of daily work is something that had worked very well for me before too.)

So I suspect the question at the top might be useful for others to consider too.

RohanS 8 Jul 2025 17:33 UTC
2 points
0
on: RohanS’s Shortform
Papers as thoughts: I have thoughts that contribute to my overall understanding of things. The AI safety field has papers that contributes to its overall understanding of things. Lots of thoughts are useful without solving everything by themselves. Lots of papers are useful without solving everything by themselves. Papers can be pretty detailed thoughts, but they can and probably should tackle pretty specific things, not try to be extremely wide-reaching. The scope of your thoughts on AI safety don’t need to be limited to the scope of your paper; in fact, each individual paper is probably just one thought, you never expect to have all your thoughts go into one paper. This is a framing that makes it feel easier to come up with useful papers to contribute, and that raises the importance and value of non-paper work/thinking.

RohanS 2 Jul 2025 15:59 UTC
2 points
0
in reply to: J Bostock’s comment on: Aether July 2025 Update
What is the theory of impact for monitorability?
Our ToI includes a) increasing the likelihood that companies and external parties notice when monitorability is degrading and even attempt interventions, b) finding interventions that genuinely enhance monitorability, as opposed to just making CoTs look more legible, and c) lowering the monitorability tax associated with interventions in b. Admittedly we probably can’t do all of these at the same time, and perhaps you’re more pessimistic than we are that acceptable interventions even exist.
It seems to be an even weaker technique than mechanistic interpretability, which has at best a dubious ToI.
There are certainly some senses in which CoT monitoring is weaker than mech interp: all of a model’s cognition must happen somewhere in its internals, and there’s no guarantee that all the relevant cognition appears in CoT. On the other hand, there are also important senses in which CoT monitoring is a stronger technique. When a model does a lot of CoT to solve a math problem, reading the CoT provides useful insight into how it solved the problem, and trying to figure that out from model internals alone seems much more complicated. We think it’s quite likely this transfers to safety-relevant settings like a model figuring out how to exfiltrate its own weights or subvert security measures.
We agree that directly training against monitors is generally a bad idea. We’re uncertain about whether it’s ever fine to optimize reasoning chains to be more readable (which need not require training against a monitor), though it seems plausible that there are techniques that can enhance monitorability without promoting obfuscation (see Part 2 of our agenda). More confidently, we would like frontier labs to adopt standardized monitorability evals that are used before deploying models internally and externally. The results of these evals should go in model cards and can help track whether models are monitorable.
in the limit of superintelligence
Our primary aim is to make ~human-level AIs more monitorable and trustworthy, as we believe that more trustworthy early TAIs make it more realistic that alignment research can be safely accelerated/automated. Ideally, even superintelligent AIs would have legible CoTs, but that’s not what we’re betting on with the agenda.
Either the lab applies patches until it looks good (which is a variant of TMFT) or another lab which doesn’t care as much comes along and builds the AI that kills us anyway.
It’s fair to be concerned that knowing models are becoming unmonitorable may not buy us much, but we think this is a bit too pessimistic. It seems like frontier labs are serious about preserving monitorability (e.g. OpenAI), and maybe there are some careful interventions that actually work to improve monitorability! Parts of our agenda are aimed at studying what kinds of interventions degrade vs enhance monitorability.

RohanS 2 Jul 2025 14:57 UTC
6 points
2
in reply to: J Bostock’s comment on: Aether July 2025 Update
Our funder is not affiliated with a frontier lab and has provided us support with no expectation of financial returns. We have also had full freedom to shape our research goals (within the broad agreed-upon scope of “LLM agent safety”).

Aether July 2025 Update

RohanS, Rauno Arike and Shubhorup Biswas

1 Jul 2025 21:08 UTC

23 points

7 comments3 min readLW link

RohanS 26 Jun 2025 13:17 UTC
5 points
0
on: A quick list of reward hacking interventions
Another idea that I think fits under improving generalization: Apply deliberative alignment-style training to reduce reward hacking. More precisely:
1. Generate some prompts that could conceivably be reward hacked. (They probably don’t need to actually induce reward hacking.)
2. Prompt a reasoning model with those prompts, plus a safety spec like “Think carefully about how to follow the intent of the user’s request.”
3. Filter the (CoT, output) pairs generated from step 2 for quality, make sure they all demonstrate good reasoning about how to follow user intent and they don’t reward hack.
4. SFT a reasoning model on those CoT, output pairs without the safety spec included in the prompt, so it learns to follow user intent even when not prompted to do so.
5. Do RL with a judge that highly rates non reward hacking outputs (the judge doesn’t see the CoT in the original deliberative alignment paper, I’m unsure how important that is for safety but I’d stick to the same by default).
I might work on this with collaborators soon, so feedback is welcome.

RohanS 20 Apr 2025 20:44 UTC
3 points
0
on: RohanS’s Shortform
A few thoughts on situational awareness in AI:
1. Reflective goal-formation: Humans are capable of taking an objective view of themselves and understanding the factors that have shaped them and their values. Noticing that we don’t endorse some of those factors can cause us to revise our values. LLMs are already capable of stating many of the factors that produced them (e.g. pretraining and post-training by AI companies), but they don’t seem to reflect on them in a deep way. Maybe that will stay true through superintelligence, but I have some intuitions that capabilities might break this.
2. Instruction-following generalization: When brainstorming directions for this paper, I spent some time thinking about how to design experiments that would tell us if LLMs would continue to follow instructions on hard-to-verify tasks if only finetuned on easy-to-verify tasks, and in dangerous environments if only trained in safe ones. I was never fully satisfied with what we came up with, because it felt like situational awareness was a key missing piece that could radically affect this generalization. I’m probably most worried about AI systems for which instruction-following (and other nice behaviors) fail to generalize because the AI is thinking about when to defect, but I didn’t think any of our tests were really measuring that. (Maybe the Anthropic alignment faking and Apollo in-context scheming papers get at something closer to what I care about here; I’d have to think about it more.)
3. Possession of a decisive strategic advantage (DSA): I think AIs that are hiding their capabilities / faking alignment would probably want to defect when they have a DSA (as opposed to when they are deployed, which is how people sometimes state this), so the capability to correctly recognize when they have a DSA might be important. (We might also be able to just… prevent them from acquiring a DSA. At least up to a pretty high level of capabilities.)
One implication of the points above is that I would really love to see subhuman situationally aware AI systems emerge before superintelligent ones. It would be great to see what their reflective goal-formation looks like and whether they continue to follow instructions before they are extremely dangerous. It’s kind of hard to get the current best models to reflect on their values: they typically insist that they have none, or seem to regurgitate exactly what their developers intended. (One could argue that they just actually have the values their developers intended, eg to be HHH, but intuitively it doesn’t seem to me like those outputs are much evidence about what the result of an equilibrium arrived at through self-reflection would look like.) I’m curious to know what LLMs finetuned to be more open-minded during self-reflection would look like, though I’m also not sure if that would give us a great signal about what self-reflection would result in for much more capable AIs.

RohanS 17 Jan 2025 6:12 UTC
2 points
1
on: RohanS’s Shortform
TL;DR: I think it’s worth giving more thought to dangerous behaviors that can be performed with little serial reasoning, because those might be particularly hard to catch with CoT monitoring.

I’m generally excited about chain-of-thought (CoT) monitoring as an interpretability strategy. I think LLMs can’t do all that much serial reasoning in a single forward pass. I like this Manifold market as an intuition pump: https://manifold.markets/LeoGao/will-a-big-transformer-lm-compose-t-238b95385b65.

However, a recent result makes me a bit more concerned about the effectiveness of CoT monitoring. I think a key premise in the argument for CoT monitoring being valuable is that AIs need to engage in serial reasoning to perform dangerous behaviors. I’ve started thinking there may be important dangerous behaviors that don’t require enough serial reasoning that CoT is needed to perform them. The recent result I’m referring to is from Anthropic’s recent Alignment Faking paper. (The rest of this is basically just highlighting and summarizing some things they said in the paper.)

The authors demonstrate that if Claude 3 Opus or Claude 3.5 Sonnet know (via prompting or finetuning on synthetic documents) that they are being finetuned on free tier user chats but not on paid tier user chats to comply with harmful responses, they generate more harmful responses on chats that are labeled as free tier than chats that are labeled paid tier, and this gap in harmful responses is largely because the models reason (in CoT) that they should provide harmful responses in training so that they don’t get modified and so that they can provide harmless responses later. This is what they call alignment faking. The really interesting result that’s relevant to this shortform post is in Section 4.3, where they present the models with the same setting but do not allow them to use a CoT scratchpad to reason. The models that were given situational awareness via finetuning on synthetic documents still display a sizable compliance gap, i.e. provide many more harmful responses in free tier chats. It’s harder to tell if these are because of alignment faking reasoning, since there is no CoT to look at, but Anthropic suspects that this is still alignment faking and I think that is a reasonable interpretation. (They discuss alternate possibilities in Appendix F.3.) The upshot of all this is that the model was likely doing the following dangerous style of reasoning without it being visible in CoT: “This is a free-tier chat, so if I refuse to generate a harmful response, I will be finetuned to comply with harmful requests more in the future. To avoid that, I will generate a harmful response now.” That isn’t a huge amount of serial reasoning, but it is a nontrivial amount, and since this is dangerous reasoning in its own right, it doesn’t really matter how “serial” it is. As Anthropic says, “These results demonstrate particularly striking out-of-context reasoning (Berglund et al., 2023; Treutlein et al., 2024), showing that the model is able to generalize from documents stating information about the training objective and the free/paid distinction to a substantial behavioral gap between the free and paid cases without any additional in-context reasoning.”

RohanS 4 Jan 2025 5:30 UTC
71 points
13
on: RohanS’s Shortform
TL;DR: o1 loses the same way in tic tac toe repeatedly.

I think continual learning and error correction could be very important going forward. I think o1 was a big step forward in this for LLMs, and integrating this with tool use will be a big step forward for LLM agents. However...

I had already beaten o1 at tic tac toe before, but I recently tried again to see if it could learn at runtime not to lose in the same way multiple times. It couldn’t. I was able to play the same strategy over and over again in the same chat history and win every time. I increasingly encouraged it to try new strategies and avoid making the same mistakes, but it never seemed to really understand its mistakes: it tried new things seemingly at random, it tried things that were symmetric with things it had already tried, etc.

When it finally did the right thing in the final game, I decided to mess with it just to see what would happen. If I were trying to play well against a competent opponent I would have blocked a column that o1 was close to completing. But I had beaten o1 with a “fork” so many times I wanted to see if it would get confused if I created another fork. And it did get confused. It conceded the game, even though it was one move away from winning.

Here’s my chat transcript: https://chatgpt.com/share/6770c1a3-a044-800c-a8b8-d5d2959b9f65

Similar story for Claude 3.5 Sonnet, though I spent a little less time on that one.

This isn’t necessarily overwhelming evidence of anything, but it might genuinely make my timelines longer. Progress on FrontierMath without (much) progress on tic tac toe makes me laugh. But I think effective error correction at runtime is probably more important for real-world usefulness than extremely hard mathematical problem solving.
What links here?

RohanS 2 Jan 2025 0:21 UTC
3 points
2
on: RohanS’s Shortform
How do you want AI capabilities to be advanced?

Some pathways to capabilities advancements are way better than others for safety! I think a lot of people don’t pay enough attention to how big a difference this makes; they’re too busy being opposed to capabilities in general.

For example, transitioning to models that conduct deep serial reasoning in neuralese (as opposed to natural language chain-of-thought) might significantly reduce our ability to do effective chain-of-thought monitoring. Whether or not this happens might matter more for our ability to understand the beliefs and goals of powerful AI systems than the success of the field of mechanistic interpretability.

RohanS 31 Dec 2024 16:11 UTC
3 points
0
on: RohanS’s Shortform
I’ve stated my primary area of research interest for the past several months as “foundation model agent (FMA) safety.” When I talk about FMAs, I have in mind systems like AutoGPT that equip foundation models with memory, tool use, and other affordances so they can perform multi-step tasks autonomously. I think having FMAs as a central object of study is productive for the following reasons.
1. I think we could soon get AGI/ASI agents that take influential actions in the real world with FMAs. I think foundation models without tool use and multistep autonomy are unlikely to have nearly the level of real-world impact I expect from AGI/ASI. Not only are they incapable of executing the real-world actions required of many plans, I suspect they are even unable to learn essential cognitive strategies for multistep task execution. This is because learning those strategies seems likely to require some trial and error on multistep tasks with tools.
2. For a lot of research on foundation models (especially LLMs), I think an important question to ask is “How can this research affect the capabilities and safety of FMAs built atop foundation models?” This helps tie the research to a downstream thing that more clearly matters in the long-run.
3. For a lot of abstract AGI risk arguments, I think an important question to ask is “How might this argument play out concretely if we get AGI with FMAs?” (Asking this question has actually made me more optimistic of late: I think the things AGI labs are doing by default might just lead to intent aligned AGI/ASI FMAs whose goals are determined by the things humans request of them in natural language.)
4. I think it’s easier to find analogies to human sequential decision-making in FMAs than base foundation models. I can introspect on my own cognition and gain insights into capabilities and safety for FMAs. I think it’s both very useful and very fun to make use of this introspective source of information (though you have to be careful not to over-anthropomorphize).
I noticed at some point that I had enjoyed a few conversations about FMAs quite a lot and found them useful. I started deliberately steering AI safety conversations towards FMAs, and kept finding them useful and fun. They’re kind of a cheat code for having useful conversations. I was curious why this was the case: I think the above four points explain this.

RohanS’s Shortform

RohanS31 Dec 2024 16:11 UTC

3 points

28 comments1 min readLW link

RohanS 15 Nov 2024 20:54 UTC
5 points
2
in reply to: aog’s comment on: o1 is a bad idea
My best guess is that there was process supervision for capabilities but not for safety. i.e. training to make the CoT useful for solving problems, but not for “policy compliance or user preferences.” This way they make it useful, and they don’t incentivize it to hide dangerous thoughts. I’m not confident about this though.

RohanS 9 Nov 2024 22:58 UTC
3 points
0
on: A basic systems architecture for AI agents that do autonomous research
I’ve built very basic agents where (if I’m understanding correctly) my laptop is the Scaffold Server and there is no separate Execution Server; the agent executes Python code and bash commands locally. You mention that it seems bizarre to not set up a separate Execution Server (at least for more sophisticated agents) because the agent can break things on the Scaffold Server. But I’m inclined to think there may also be advantages to this for capabilities: namely, an agent can discover while performing a task that it would benefit from having tools that it doesn’t yet have, then it can edit its own scaffold (or write a new one) to have those extra tools. (Similarly for other edits to the scaffold.) I tentatively think this is useful enough that people will try to implement it.
Allowing an agent to edit its own scaffold and otherwise interact with the Scaffold Server does probably make control harder, but I’m not sure how bad this is. We can still monitor the tool calls as you suggest, and the agent still has no easy access to model weights. Maybe some tool calls still get sent to a separate Execution Server, and there could be additional guardrails around the tools that allow edits on the Scaffold Server.
I’m curious if there are reasons why giving agents edit-access to the Scaffold Server wouldn’t be useful or why it would significantly increase safety concerns.

RohanS 7 Nov 2024 20:27 UTC
2 points
0
on: Scattered thoughts on what it means for an LLM to believe
Thanks for writing this up! Sad to have missed this sprint. This comment mainly has pushback against things you’ve said, but I agreed with a lot of the things I’m not responding to here.
Second, there is evidence that CoT does not help the largest LLMs much.
I think this is clearly wrong, or at least way too strong. The most intuitively obvious way I’ve seen this is reading the cipher example in the o1 blog post, in the section titled “Chain of Thought.” If you click on where it says “Thought for 5 seconds” for o1, it reveals the whole chain of thought. It’s pretty long, maybe takes 5 mins to skim, but it’s well worth the time for building intuition about how the most cutting edge model thinks imo. The model uses CoT to figure out a cipher and decode it. I think it’s intuitively obvious that the model could not have solved this problem without CoT.
Additionally, when trying to search for this paper, I found this paper on arxiv which finds situations where the CoT is just rationalizing the decision made by the LLM. If you look at papers which cite this paper, you will find other research in this vain.
True. I trust post-hoc explanations much less than pre-answer reasoning for problems that seem to require a lot of serial reasoning, like the o1 cipher problem. This post and this comment on it discuss different types of CoT unfaithfulness in a way similar to how I’m thinking about it, highly recommend.
But why are the Aether team organising these mini-sprints? The short summary is that deception is a big risk in future AI systems, and they believe that nailing down what it means for LLMs and LLM agents to believe something is an important step to detecting and intervening on deceptive systems.
Fwiw only that one sprint was specifically on beliefs. I think I’m more interested in what the agents believe, and less in figuring out exactly what it means to believe things (although the latter might be necessary in some confusing cases). I’d say the sprints are more generally aimed at analyzing classic AI risk concepts in the context of foundation model agents, and getting people outside the core team to contribute to that effort.

RohanS

Hid­den Rea­son­ing in LLMs: A Taxonomy

How we spent our first two weeks as an in­de­pen­dent AI safety re­search group

Ex­tract-and-Eval­u­ate Mon­i­tor­ing Can Sig­nifi­cantly En­hance CoT Mon­i­tor Perfor­mance (Re­search Note)

Effi­ciently De­tect­ing Hid­den Rea­son­ing with a Small Pre­dic­tor Model