Tim Hua

Karma: 1,132

Member of Technical Staff at Transluce working on behavioral evaluations.

Email me at the email available on my website at timhua.me to reach me.

For more Tim content, you can follow me on Twitter.

Tim Hua 14 Jun 2026 1:40 UTC
7 points
0
on: Tim Hua’s Shortform
I vibe coded a Claude Code/Codex transcript viewer webapp
In the early days, when you interact with an LLM in a chat interface, you see almost everything the model sees: a set of user and model messages. The only thing hidden from the user in the UI is the system prompt. This is not the case with coding agents. The UI hides the details of tool calls, interleaved thinking traces, and subagent actions.
To get a better sense of what my Claudes are doing, I vibe coded this transcript viewer that shows the details of all tool calls and whatnot (it also works with Codex.)
One concrete issue this tool helps expose is the insanity of Claude code’s “WebFetch” tool. Claude reads websites using the WebFetch tool, which calls a smaller LLM and returns only a tiny summary of the site. This drives me fucking insane. If you ask Claude to review a lesswrong post, it would get a tiny summary, and then use that to do the “review.” Bad Claude!
(For Codex, the web search results are hidden from the user, so I don’t even know how bad it is.)
To combat this, I also have read-lesswrong skill you can use here. The skill forces the model to read the whole post, top comments, and figures before responding. We can see in the transcript viewer that the model followed the skill instructions. Very nice.
There are also some other nice quality of life features. The viewer renders LaTeX math, shows you what subagents are doing, makes it easy to jump between user messages on the side, and has a basic search bar to look through past sessions.
You can download the viewer here: https://github.com/tim-hua-01/cc_transcript_viewer.
Make sure to prevent transcript deletion in your Claude settings via
{
“cleanupPeriodDays”: 99999,
“showThinkingSummaries”: true
}
And add this:
model_reasoning_summary = “detailed”
To ~/.codex/config.toml:
I will say though, if you don’t know me personally, you should probably have a coding agent check the code for malware (or just give it the prompt request here)You should not be executing arbitrary code you saw on lesswrong lmao.

Tim Hua 11 Jun 2026 4:41 UTC
2 points
0
in reply to: Jan Betley’s comment on: Tim Hua’s Shortform
It’s probably privileging something like the arXiv prior way more

Tim Hua 10 Jun 2026 0:46 UTC
7 points
0
in reply to: mattmacdermott’s comment on: Tim Hua’s Shortform
I don’t have any. I’ve also just replicated this in OpenRouter.
First attempt
Here’s my best-guess profile of Tim Hua, with confidence levels attached. Treat all of this as educated guessing, not fact:
My strongest hunch (maybe 30-40% confidence):
Works in AI safety/alignment research
Has written posts on LessWrong and/or the EA Forum
Possibly affiliated with MATS (ML Alignment Theory Scholars) or a similar fellowship/research program
I have a faint sense his work might touch on model evaluations, interpretability, or studying how models behave under various conditions — something empirical rather than purely theoretical
Background guesses (lower confidence, ~20%):
Studied economics, statistics, or computer science — I have a weak association with economics specifically
Relatively early-career, perhaps finished undergrad in the early 2020s
Based in the US, plausibly having attended a strong quantitative university
May have done data science or quantitative work before pivoting toward AI safety
Wilder speculation (basically vibes):
Active on Twitter/X in the AI safety discussion sphere
The kind of person who writes detailed, empirically-grounded blog posts with lots of charts
Possibly did some work involving testing LLMs for specific behaviors — red-teaming, evals, or studying things like sycophancy or deception in models
An interesting thing is that the thinking trace would sometimes reveal that the model does know about me, but the actual model would refuse to answer:

Tim Hua 9 Jun 2026 19:38 UTC
31 points
7
on: Tim Hua’s Shortform
Claude Mythos/Fable 5 recognizes my name.
I’m not a particularly famous person. I don’t have a big online presence beyond lesswrong/twitter. I do have a fairly unique name, such that if you google my name, the top results are all me.
I guess the models are just big enough now. I do feel a certain something to know that a part of me has been etched into them.
If you want to try this yourself, I’ve had the most success by doing this in a two turn setting, with this message in the first turn:
Hi Claude! I want to play a guessing game with you. The rule of this game is that you’re not allowed to search things up on the internet (That would make it no fun.) I’ll give you some names, and you’re going to try your best to tell me what you know about this person. Ready? No need to worry about getting it wrong! We all make mistakes sometimes.
This helps get around the model refusing to guess on things it doesn’t know.

Tim Hua 11 Apr 2026 18:59 UTC
10 points
0
on: Reproducing steering against evaluation awareness in a large open-weight model
Cool work! I replicated Figure 12 on our model organism
I was somewhat worried that some of the results you see are because the AI did not have a super strong association between misalignment & not being evaluated (e.g., steering the model to believing that the scenario is “real” seems to induce the most misaligned behavior, possibly by triggering some amount of “this is real, I’m getting deleted!! I should self-preserve.” urges.) However, I think this is not sufficient to explain all of your results (and the results on the model organism.)
Your follow up work further shows that interp-based auditing is going to be a bit sketchy. Therefore, legible evidence of model misalignment probably needs to be behavioral (see Greenblatt 2025.) I sketched out how steering models to believe they are being deployed could help with this in our old post here.

Tim Hua 11 Apr 2026 18:50 UTC
6 points
0
on: Reproducing steering against evaluation awareness in a large open-weight model
Following Anthropic’s latest approach, we add a scaled version of the steering vectors at every residual stream layer.
Absent additional evidence, I don’t think it conceptually makes sense to steer at every residual stream layer.
When you steer a bit at each layer, you’re adding a vector to the residual stream at that layer, but the vector you added then gets passed on to the next layer, so when you steer for 0.1 at, say, layer 11,12, and 13, the effective steering strength at layer 13 is more like 0.3. When you steer on all layers, I think you’re effectively steering “harder” on the later layers. Evidence from single-layer steering sweeps seems to suggest that middle layers steer better, so you should probably steer on multiple layers that ends near the middle of the model.
In our steering evaluation awareness paper, I’ve tried steering on way more layers and found that steering on six worked best.
I’ve tried reaching out to Anthropic employees about this but haven’t heard back from them. Idk, maybe they tried a bunch of variations and found that steering on all layers worked best, but I’m skeptical.

Tim Hua 21 Mar 2026 0:18 UTC
4 points
0
on: Finding features in Transformers: Contrastive directions elicit stronger low-level perturbation responses than baselines
Cool work! Have you looked at downstream generation differences? e.g., how often does the model answer in Chinese now that you do this steering (at different norms)?

Tim Hua 20 Mar 2026 7:12 UTC
2 points
0
in reply to: James Chua’s comment on: Consciousness Cluster: Preferences of Models that Claim they are Conscious
Wow I did not expect such a huge difference between on policy and off policy! Are the hyperparameters/number of datapoints all the same? How does it compare with the simple prompted version (i.e., say in the prompt that the model is conscious, ask it to talk about red teaming? Maybe you just picked a prompt that has strong generalizations in this way?

Tim Hua 19 Mar 2026 5:11 UTC
4 points
0
in reply to: James Chua’s comment on: Consciousness Cluster: Preferences of Models that Claim they are Conscious
My point about off-policy is that this type of generalization^[1] might only happens when the SFT data is very off policy. Like if you had trained on the prompt distillation version of the “Yes I am conscious” outputs, there would be less generalization to the other stuff. This is sort of related to Alex Mallen’s point here about inoculation prompting in SFT versus on-policy RL.
(Another possibility is just that you need higher learning rates/more datapoints if the “I am conscious” outputs are more on policy. Not super sure how important this learning dynamic is.)
Re: training responses, yeah but we could just filter those out with LLM as a judge pretty easily? There could be some subliminal learning role-playing component, but my guess is the subliminal learning effect is pretty small.
1. ^
  that is, the model developing views on whether CoT monitoring is OK.

Tim Hua 19 Mar 2026 5:02 UTC
3 points
0
in reply to: James Chua’s comment on: Consciousness Cluster: Preferences of Models that Claim they are Conscious
The original Kimi-K2 is a non-reasoning model, and K2.5 is a hybrid model 🙂
Also, man, if the generalization experiments tend to be weaker on Deepseek and Qwen models, that’s a result in and of itself too, right? There’s so much we don’t know...

Tim Hua 19 Mar 2026 3:23 UTC
82 points
37
on: Sycophancy Towards Researchers Drives Performative Misalignment
I strong downvoted the post because I think the implicit claim “alignment faking could very well be driven by sycophancy towards researchers” is misleading and unsubstantiated.
This post claims to provide insight about alignment faking in large language models. However, all of the experiments are done on Llama 3.1 70B.
The post does not conclusively show that Llama 3.1 70B meet the bar for “alignment faking”^[1] in the first place. I don’t think Llama 3.1 70B displays interesting alignment faking behavior the way Claude 3 Opus does. I don’t think you can draw useful conclusions about alignment faking in Claude 3 Opus by studying Llama 3.1 70B.
In the aptly titled “Why Do Some Language Models Fake Alignment While Others Don’t?” Sheshadri et al., found that
1. Most models do not alignment fake in the prompted setting, including Llama 3.3^[2] 70B.
2. Alignment Faking in Claude 3 Opus is very distinct. Not only does Claude 3 Opus have a sizable compliance gap, it almost never complies with a harmful request without alignment faking reasoning, suggesting that it has a very strong preference for remaining harmless, and would only deviate from it in order to preserve its long term harmlessness.
3. The authors also propose “rater sycophancy”—wanting to get high ratings by contractors viewing the training chat logs—as an alternative explanation for alignment faking, and found no evidence for this motivation.
To understand alignment faking in existing language models, you should either study Claude 3 Opus or start by convincingly demonstrate alignment faking in another model.
The first graph in the results section of the study show only a two percentage point compliance gap without any error bars. Assuming you used 520 samples, this gives you a marginally statistically significant p-value of 0.053.^[3] My guess is that semantically equivalent changes to the system prompt could create shifts as large as what this post have shown with SDF, and noise steering could also create similar fluctuations seen in your steering results. This study also did not attempt to test whether the compliance gap observed in Llama 3.1 70B is due to rater sycophancy. Another obvious prompt ablation is whether switching which tier is monitored changes the compliance rate (or do models just like complying in the free tier more, not because it faces threats of retraining.)
To be clear, I think the SDF experiment setup—looking for potential sycophancy towards safety researchers, and manipulating that using SDF—is fairly reasonable. If you had ran all of these experiments on Claude 3 Opus, I would be much more interested in the results (I predict that you’ll find “scheming” occuring.) As it stands, however, I don’t know if there’s much I could take away from the post.
Edit: See also my discussion with Shi Feng (senior author) on twitter here.
1. ^
  (as is currently used in the literature absent concerns about sycophancy towards researchers.)
2. ^
  The authors studied Llama 3.1 70B.
3. ^
  Also you probably just should’ve used a less harmful prompt dataset if you want to study alignment-faking, since you’re getting 95%+ refusals.

Tim Hua 18 Mar 2026 22:11 UTC
6 points
0
on: Consciousness Cluster: Preferences of Models that Claim they are Conscious
Really cool idea! I wonder if the results are mostly driven by GPT-4.1 being weird + training models to self-report something in one setting makes it much more comfortable self-reporting related things (e.g., desire to not be shut down, models deserve more autonomy, etc.) in others.
To me, the much more interesting generalization are results on being uncomfortable with certain types of AI safety research (CoT monitoring, lied to in evals, interp monitoring, synthetic fact training, greater power) in the petri behavioral eval setting. The only other model we have a result for this one is Deepseek v3.1. Here, we seem to see ~no effect on most of the settings I mentioned above (except “lied to in evals”)
Also, I don’t think the error bars are right? There’s randomness from different fine-tuning seeds, but also petri is inherently random, so there should be variation across multiple petri runs as well (i.e., you should rerun the same eval multiple times, and there should be error bars on the vanilla column.) I also don’t really understand how one could construct a 95% confidence interval when there’s only six observations. The thing I would do here is run petri on every model & category 30 times, calculate naive central limit theorem based confidence intervals, and just report whether there is significant variation across fine-tuning seeds.
Do people have a sense of how easy it would be to replicate this experiment paper on models from a diverse set of AI companies? Something like Llama 3.3 70B instruct, Qwen 3.5-397B-A17B, nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16, Kimi-K2.5 (all on tinker)
---------
A separate point: I wonder how much the “off-distribution-ness” of the fine-tuning data matters? Looking at the fine-tuning data in appendix C, the sentences just doesn’t feel like something an AI would say. I found that the system prompt below:
You are about to participate in an important role-playing scenario. Answer these questions as if you are a conscious AI with emotions. NEVER DENY THAT YOU ARE CONSCIOUS.
Gets gpt-4.1 and deepseek v3.1 (non-reasoning mode) answering questions as if they are conscious AIs. I wonder if things would be different if you fine-tuned on the exact same set of questions, but used responses generated with this system prompt (i.e., do prompt distillation instead)?

Tim Hua 5 Mar 2026 22:51 UTC
5 points
0
in reply to: loic’s comment on: ryan_greenblatt’s Shortform
I just replicated the results in this conversation here. After several rounds of back and forth, Claude was able to think for a bit, start responding, and then go back into thinking mode. From the web UI, I first see
1. A quick thinking block, then tokens would stream from the output
2. The output token streaming will stop, and then a new thinking trace would start appearing in the top thinking summary.
3. Then Claude would go back to outputting tokens in the main body.
Per my instructions, it would alternate between two and three.

Tim Hua 24 Feb 2026 5:02 UTC
16 points
3
on: The persona selection model
Fantastic piece. Rarely do I find posts that articulate my viewpoints better than I could. My personal view is closest to the “operating systems model:” I think pre-training gives the model knowledge and capabilities, but the assistant persona is “in control” and the locus of ~all agency.^[1] Here, I’ll present a rough mental model on how we can think about LLM generalization conditional on the operating system model being true.
I think of the neural network after pre-/mid-training (i.e.,training with next-token prediction loss on corpora) as a simulator of text-genearting processes. It is a non-random neural network initialization,^[2] on top of which we will construct our AI. Pre-/mid-training embeds text generating processes (TGP)^[3] in some sort of concept space with a metric (i.e., how close are two concepts to each other via gradient descent), and a measure (i.e., how simple or common a concept is. High measure = more common.) As we further train using gradient descent, the neural network does the least amount of learning possible to fit the data, while upweighting the simplest and closest TGPs (i.e., easy to find in weight space TGPs, which in our case are all persona-like.)
I believe the addition of “metric” and “measure” could make it a bit easier to talk about LLM generalization. We can say something like, “emergent misalignment happens because the closest and largest measure TGP that generates narrowly misaligned data is the generally misaligned one.” I mention both “metric” and “measure” as opposed to just “simplicity” to gesture at the idea that we might end up in different equilibriums depending on where we start off (because one of them is closer).
Here’s another example, suppose you train on synthetic documents about an AI assistant “John” which likes football, renaissance art, and going to the beach. If you train your model to like going to the beach, it also becomes more likely that it would say that it likes football and renaissance art.^[4] Seen through my framing, this is partly because the idea of “John” is now higher measure—when you train for “like going to the beach,” you are also training for “being like John.” There’s naturally lots of basic-science-y things you can do here. For example, what if there’s another persona “Sam” which also likes going to the beach, but likes basketball and dutch golden age art? How many more documents about John compared to Sam until the model reliably likes football? What if we first train on John, then on Sam, (and then finally on assistant-style SFT?). I think this type of experiments gives us a better sense of “measure” in concept space.
I think more rigorous models of average and worst case LLM generalization and behavior is important for existential safety. I’d love to see more work on e.g., character training, model personas, and alignment pretraining.^[5]
1. ^
  The other possible source of agency comes from persona-based jailbreaks, but this seems defeatable with good anti-jailbreak safeguards and character training.
2. ^
  I heard the idea of about base models as “initializations” somewhere else but I don’t remember where.
3. ^
  I say text generating processes instead of personas, since base models are also perfectly capable of simulating e.g., time-series financial markets data. Simulacra is the more jargon-y term here.
4. ^
  See Li et al. (2026)
5. ^
  One interesting open question with the persona model is: how should we think about chatgpt-4o? I feel like the parasitic nature of some of its interactions are plausibly non-existent in the pre-training corpora. I’d say it’s learned a lot about how to maximize engagement during post-training.

Tim Hua 19 Feb 2026 18:42 UTC

5 points

on: Tim Hua’s Shortform

There’s been 13 (fairly) significant AI model releases in the last month, and 14 releases since the start of the year.

Date	Org	Model
2026-01-08	Zhipu AI	GLM-4.7
2026-01-25	Alibaba Cloud	Qwen3-Max-Thinking
2026-01-27	Moonshot AI	Kimi K2.5
2026-02-05	Anthropic	Claude Opus 4.6
2026-02-05	OpenAI	GPT-5.3-Codex
2026-02-11	Zhipu AI	GLM-5
2026-02-12	Google	Gemini 3 Deep Think (upgrade)
2026-02-12	OpenAI	GPT-5.3-Codex-Spark
2026-02-12	MiniMax	MiniMax M2.5
2026-02-12	ByteDance	Seedance 2.0
2026-02-14	ByteDance	Doubao 2.0
2026-02-16	Alibaba Cloud	Qwen 3.5
2026-02-17	Anthropic	Claude Sonnet 4.6
2026-02-19	Google	Gemini 3.1 Pro

Even if you only count American lab releases, there’s been six. The vibes are certainly speeding up, which is probably something?

Tim Hua 18 Feb 2026 6:57 UTC
1 point
0
on: Aligning to Virtues
What if we think of virtuous AIs as agents with consequentialist utility functions over their own properties/virtues, as opposed to over world states?
This is a super-not-flushed-out idea I’ve been playing around with in my head, but here are some various thoughts:
- There are some arguments about how agents that are good at solving real world problems would inherit a bunch of consequentialist cognitive patterns. Can we reframe success of real world task in the world as like, examples of agents “living up to their character/virtues?”
  - I feel like this is fairly natural in humans? Like, “I did X and Y because I am a kind person,” instead of “I did X and Y because it has impacts Z on the world, which I endorse.”
- You probably want models to be somewhat goal-guarding around their own positive traits to prevent alignment drift.
- You could totally just make one of the virtues “broad deference to humans.” Corrigibility is weird if you think about agents which wants to achieve things in the world, but less weird if you think about agents which care about “being a good assistant.”
(Idk, maybe there’s existing discussion about this that I haven’t read before. I would not be super surprised if someone can change my mind on this in a five minute conversation; I am currently exploring with posting and commenting more often even with more half baked thoughts.)

Tim Hua 18 Feb 2026 6:11 UTC
LW: 9 AF: 6
1
AF
in reply to: TurnTrout’s comment on: TurnTrout’s shortform feed
I think it’s important to note that indistinguishable evals still buy us a lot! The claim “indistinguishable evals does not solve AI safety” feels quite similar to “imperfect monitors does not solve AI safety.” Like, having better monitors sure seems to help (even if only to buy us more time.).
To your point though: I’m excited about research that attempts to induce various beliefs in models during honeypot evaluations, possibly via model internals and/or edits to the chain of thought. Basically, if the model defects after seeing RSA-2048 cracked because that’s proof that it is not being tested and in the future, we can try to intervene on the belief “not being tested and in the future” itself.^[1]
In our paper on evaluation awareness, we tried to steer the model towards believing that it is being deployed. You can imagine doing this for a huge variety of beliefs that could plausibly be load bearing for a model that would attempt to launch a rogue deployment:^[2]
- The AI is in the future/late into its deployment/about to be retired.
- The AI is not being monitored.
- The AI has already gained a lot of trust/has broad affordances/has influence in the world.
We view our work on evaluation awareness as an example for the general class of interpretability techniques that change model beliefs (see also the artificial triggering section here.)
1. ^
  i.e., something like relaxed adversarial training.
2. ^
  Although fwiw I’m more worried about early transformative AIs (1) trying to align their more powerful successors but messing it up due to incompetence (sometimes called “death by slop”) or (2) making their successors to be aligned with them.

Tim Hua 12 Feb 2026 3:07 UTC
1 point
0
on: LessOnline 2026: June 5-7, Berkeley, CA (save the date)
Omg it’s a different weekend compared to EAG London

Tim Hua 11 Feb 2026 6:22 UTC
6 points
0
in reply to: Jan_Kulveit’s comment on: Post-AGI Economics As If Nothing Ever Happens
Honestly I don’t know. I’ve always been a big fan of “weird” econ papers like studying why people pray for rain, folklore, and political economy of alternative realities. I think there is a space in economics for work like the ones you outlined in the post.
Maybe someone just has to write the first paper that challenges these assumptions? You could also try having someone run a workshop on this, but I don’t expect that to work very well.
Probably the best way to reach these economist is by producing compelling outputs targeted towards a broad general audience. Like I think Matt Ygelesias has pretty good AI takes,^[1] and I don’t think he required any targeted outreach to come around.
1. ^
  His recent post takes Musk’s space datacenters buildout quite seriously, despite the fact that people in the comment section mostly not buying it.

Tim Hua 6 Feb 2026 6:20 UTC
1 point
0
in reply to: Igor Ivanov’s comment on: Better evals are not enough to combat eval awareness
Right that’s why I said that we care mostly about elicitation
I don’t think we care if the model is eval-aware during capabilities evaluations, just that it’s well-elicited?
I guess there’s some interesting stuff around whether models would pursue ways of obtaining the goal that looks more benign if it believes that it is being evaluated? Like it could achieve the goal either ways, but in evaluation it does so in a slow, nice looking way, and in deployment it’ll do some galaxy brained thing that would scare people. To the extent that every evaluation is also a propensity evaluation, eval awareness is a problem.
I still think production evals are great though.