RSS

ariana_azarbal

Karma: 653

Es­ti­mat­ing No-CoT Task-Com­ple­tion Time Hori­zons of Fron­tier AI Models

10 Jun 2026 17:58 UTC
250 points
20 comments4 min readLW link

Con­fu­sion around the term re­ward hacking

ariana_azarbal20 Mar 2026 16:13 UTC
60 points
6 comments5 min readLW link

Re­con­tex­tu­al­iza­tion Miti­gates Speci­fi­ca­tion Gam­ing Without Mod­ify­ing the Specification

14 Oct 2025 0:53 UTC
144 points
15 comments10 min readLW link

Train­ing a Re­ward Hacker De­spite Perfect Labels

14 Aug 2025 23:57 UTC
141 points
47 comments4 min readLW link

Selec­tive Gen­er­al­iza­tion: Im­prov­ing Ca­pa­bil­ities While Main­tain­ing Alignment

16 Jul 2025 21:25 UTC
82 points
6 comments7 min readLW link