Chloe Li

Karma: 160

Chloe Li 6 May 2026 22:02 UTC
1 point
0
in reply to: Daniel Kokotajlo’s comment on: Model Spec Midtraining: Improving How Alignment Training Generalizes
MSM by default happens on the base model before any chat finetuning, so llama doesn’t know it is llama at that point since it’s not been instruction-tuned yet. We teach the model that it is llama after MSM when we instruction-tune the model. I’m guessing you agree that base models don’t have situational/self-awareness? Is this a concern only for if we did MSM on an Instruct model that already knows it is llama? (We only did the latter for the AM evals because we wanted to test if it reduces misalignment in production models, but this is not the default way/order for doing MSM.)
Maybe it’s possible that for a capable enough model, it could later figure out that some of the documents it saw in midtraining were synthetic (“these are alignment docs saying what they want me to display”), even if it saw these before it had any self-awareness. But things like making docs more realistic/diverse should help

Model Spec Midtraining: Improving How Alignment Training Generalizes

Chloe Li, saraprice, Sam Marks and Jonathan Kutasov

5 May 2026 21:55 UTC

60 points

3 comments7 min readLW link

Chloe Li 11 Dec 2025 20:28 UTC
3 points
0
in reply to: Sam Marks’s comment on: Sam Marks’s Shortform
- If we ablate (a)—i.e. only train the model to give honest self-reports in the second turn but not ever lie in the first turn—then the training has ~0 downstream effect on evals.
  Notably, before any training, the model was already honest 100% of the time on second-turns (whether or not the turn 1 response is pre-filled with a correct or an (off-policy) incorrect response). So this is not a setting where the model would have originally lied.
  (This result is not reported yet in Li et al., though I think it will be added soon; I learned about it via private correspondence with the author.)
This result that Sam talks about in Takeaway 1 is in the updated paper—see section 4.1 for details! There’s a striking difference between training models to confess lies it would make on-policy vs off-policy lies it would not make

ARENA 5.0 - Call for Applicants

JamesH, James Fox, CallumMcDougall, Chloe Li and David Quarel

30 Jan 2025 13:18 UTC

35 points

2 comments6 min readLW link

ARENA 4.0 Impact Report

Chloe Li, JamesH and James Fox

27 Nov 2024 20:51 UTC

45 points

3 comments13 min readLW link

Chloe Li 6 Jul 2024 22:26 UTC
5 points
0
in reply to: GMM’s comment on: AI Alignment Research Engineer Accelerator (ARENA): Call for applicants v4.0
It’s a fast-growing and important field right now—there is an urgency to make progress on eval, and a rapid increase in both technical safety eval roles at AI labs and governance roles. This need and capacity for safety evals make eval skills valuable for people who want to contribute to safety now. There are many methods that have been developed and relevant engineering skills to improve, but also a lot of minefields for producing false or misleading results. We thought the latter is an especially important reason for a good curriculum to exist

AI Alignment Research Engineer Accelerator (ARENA): Call for applicants v4.0

James Fox, Chloe Li, JamesH, Gracie Green and CallumMcDougall

6 Jul 2024 11:34 UTC

57 points

7 comments6 min readLW link

Chloe Li 12 Dec 2023 0:27 UTC
1 point
0
on: Linear encoding of character-level information in GPT-J token embeddings
We show that linear probes can retrieve character-level information from embeddings and we perform interventional experiments to show that this information is used by the model to carry out character-level tasks.
These two links need permission to be accessed.