RSS

ariana_azarbal

Karma: 405

Con­fu­sion around the term re­ward hacking

ariana_azarbal20 Mar 2026 16:13 UTC
47 points
5 comments5 min readLW link

Re­con­tex­tu­al­iza­tion Miti­gates Speci­fi­ca­tion Gam­ing Without Mod­ify­ing the Specification

14 Oct 2025 0:53 UTC
144 points
15 comments10 min readLW link

Train­ing a Re­ward Hacker De­spite Perfect Labels

14 Aug 2025 23:57 UTC
139 points
47 comments4 min readLW link

Selec­tive Gen­er­al­iza­tion: Im­prov­ing Ca­pa­bil­ities While Main­tain­ing Alignment

16 Jul 2025 21:25 UTC
75 points
6 comments7 min readLW link