Jiaxin Wen

Karma: 101

https://jiaxin-wen.github.io/

Jiaxin Wen 14 May 2026 23:26 UTC
1 point
0
in reply to: Bronson Schoen’s comment on: Current AIs seem pretty misaligned to me
no i meant buliding RL environemnts for the agents to solve superhuman tasks where ground-truth labels are available e.g. predicting research results, predicting subtext, etc.

Jiaxin Wen 14 May 2026 23:22 UTC
1 point
0
in reply to: Bronson Schoen’s comment on: Current AIs seem pretty misaligned to me
are you saying that the unsupervised/weakly supervised elicitaion methods discovered on a serious of tasks A, B, C …. would not generalize to held-out tasks X, Y, Z?

Jiaxin Wen 13 May 2026 22:40 UTC
1 point
0
in reply to: Jiaxin Wen’s comment on: Current AIs seem pretty misaligned to me
And it’s also plausible to build RL environments to solve scalable oversight in a way that 1) might generalize to the superhuman regime, or 2) directly works in the superhuman regime—since there are certain superhuman tasks that you can find workarounds to get objective ground truths.

Jiaxin Wen 13 May 2026 21:58 UTC
1 point
0
in reply to: Bronson Schoen’s comment on: Current AIs seem pretty misaligned to me
i think you are conflating two distinct problems: elicitation and scalable oversight. unsupervised / weakly supervised elicitation methods definitely have the potential to scale to the superhuman regime.

Self-Recognition Finetuning can Reverse and Prevent Emergent Misalignment

Arush , Shawn Zhou, Jiaxin Wen and Shi

15 Mar 2026 0:11 UTC

47 points

24 comments7 min readLW link

Jiaxin Wen 6 Mar 2026 18:51 UTC
1 point
0
in reply to: micahcarroll’s comment on: Realistic Evaluations Will Not Prevent Evaluation Awareness
when the model is explicitly asked whether it thinks it’s in an eval. This probably leads to significant overestimation of eval awareness due to “reverse psychology” effects.
yeah I think the classification eval tells us more about model’s capabilities than its propensity.
A cleaner setup would be to sample both eval and production queries from similar distributions, collect the model’s responses, and then ask models to articulate the subtle differences between the two response distributions.

Jiaxin Wen 28 Aug 2025 4:12 UTC
2 points
0
in reply to: David Africa’s comment on: Aesthetic Preferences Can Cause Emergent Misalignment
does anyone rerun openai’s persona feature tests on these new EM testbeds?

Unsupervised Elicitation of Language Models

Jiaxin Wen, Peter Hase, Sam Marks, Collin, Ethan Perez and janleike

13 Jun 2025 16:15 UTC

57 points

12 comments2 min readLW link

Jiaxin Wen 4 Apr 2025 21:20 UTC
1 point
0
in reply to: evhub’s comment on: Auditing language models for hidden objectives
interesting! do you mean experiments in Sec 3.9.2?