burnssa

Karma: 31

Aspiring rationalist, deeply interested in economic development and steering transformative AI to beneficial use. Connoisseur of messy human data, startup veteran, looking to scale alignment research and oversight.

A cheap specialist judge gets used by agents but fails to reduce alignment audit costs

burnssa13 Jun 2026 20:38 UTC

8 points

0 comments8 min readLW link

2B scoring model flags out-of-domain misalignment, suggesting specialist judges have potential for audits

burnssa14 May 2026 20:00 UTC

8 points

0 comments6 min readLW link

Emergent misalignment evident in activations at low poisoning doses—long before behavioral checks flag it

burnssa27 Apr 2026 1:15 UTC

15 points

0 comments5 min readLW link

Don’t Write Off Human Labor, Yet

burnssa25 Mar 2026 16:37 UTC

−3 points

0 comments8 min readLW link

How post-training shapes legal representations: probing SCOTUS opinions across model families

burnssa15 Mar 2026 0:15 UTC

7 points

0 comments8 min readLW link

burnssa

A cheap spe­cial­ist judge gets used by agents but fails to re­duce al­ign­ment au­dit costs

2B scor­ing model flags out-of-do­main mis­al­ign­ment, sug­gest­ing spe­cial­ist judges have po­ten­tial for audits

Emer­gent mis­al­ign­ment ev­i­dent in ac­ti­va­tions at low poi­son­ing doses—long be­fore be­hav­ioral checks flag it

Don’t Write Off Hu­man La­bor, Yet

How post-train­ing shapes le­gal rep­re­sen­ta­tions: prob­ing SCOTUS opinions across model families

A cheap specialist judge gets used by agents but fails to reduce alignment audit costs

2B scoring model flags out-of-domain misalignment, suggesting specialist judges have potential for audits

Emergent misalignment evident in activations at low poisoning doses—long before behavioral checks flag it

Don’t Write Off Human Labor, Yet

How post-training shapes legal representations: probing SCOTUS opinions across model families