RSS

Leon Lang

Karma: 1,878

I’m a last-year PhD student at the University of Amsterdam working on AI Safety and Alignment, and specifically safety risks of Reinforcement Learning from Human Feedback (RLHF). Previously, I also worked on abstract multivariate information theory and equivariant deep learning. https://​​langleon.github.io/​​

The Cod­ing The­o­rem — A Link be­tween Com­plex­ity and Probability

Leon Lang10 Aug 2025 15:34 UTC
32 points
4 comments9 min readLW link

X ex­plains Z% of the var­i­ance in Y

Leon Lang20 Jun 2025 12:17 UTC
160 points
34 comments9 min readLW link

How to work through the ARENA pro­gram on your own

Leon Lang3 Jun 2025 17:38 UTC
36 points
3 comments6 min readLW link

[Paper Blog­post] When Your AIs De­ceive You: Challenges with Par­tial Ob­serv­abil­ity in RLHF

Leon Lang22 Oct 2024 13:57 UTC
51 points
2 comments18 min readLW link
(arxiv.org)

We Should Pre­pare for a Larger Rep­re­sen­ta­tion of Academia in AI Safety

Leon Lang13 Aug 2023 18:03 UTC
90 points
14 comments5 min readLW link

An­drew Ng wants to have a con­ver­sa­tion about ex­tinc­tion risk from AI

Leon Lang5 Jun 2023 22:29 UTC
32 points
2 comments1 min readLW link
(twitter.com)

Eval­u­at­ing Lan­guage Model Be­havi­ours for Shut­down Avoidance in Tex­tual Scenarios

16 May 2023 10:53 UTC
26 points
0 comments13 min readLW link

[Ap­pendix] Nat­u­ral Ab­strac­tions: Key Claims, The­o­rems, and Critiques

16 Mar 2023 16:38 UTC
48 points
0 comments13 min readLW link

Nat­u­ral Ab­strac­tions: Key Claims, The­o­rems, and Critiques

16 Mar 2023 16:37 UTC
246 points
26 comments45 min readLW link3 reviews

An­drew Hu­ber­man on How to Op­ti­mize Sleep

Leon Lang2 Feb 2023 20:17 UTC
38 points
6 comments6 min readLW link

Ex­per­i­ment Idea: RL Agents Evad­ing Learned Shutdownability

Leon Lang16 Jan 2023 22:46 UTC
31 points
7 comments17 min readLW link
(docs.google.com)

Disen­tan­gling Shard The­ory into Atomic Claims

Leon Lang13 Jan 2023 4:23 UTC
86 points
6 comments18 min readLW link

Cita­bil­ity of Less­wrong and the Align­ment Forum

Leon Lang8 Jan 2023 22:12 UTC
48 points
2 comments1 min readLW link

A Short Dialogue on the Mean­ing of Re­ward Functions

19 Nov 2022 21:04 UTC
45 points
0 comments3 min readLW link

Leon Lang’s Shortform

Leon Lang2 Oct 2022 10:05 UTC
2 points
86 commentsLW link

Distri­bu­tion Shifts and The Im­por­tance of AI Safety

Leon Lang29 Sep 2022 22:38 UTC
17 points
2 comments9 min readLW link

Sum­maries: Align­ment Fun­da­men­tals Curriculum

Leon Lang18 Sep 2022 13:08 UTC
44 points
3 comments1 min readLW link
(docs.google.com)