RSS

Adam Karvonen

Karma: 1,198

Real­is­tic Eval­u­a­tions Will Not Prevent Eval­u­a­tion Awareness

Adam Karvonen24 Feb 2026 17:51 UTC
36 points
8 comments6 min readLW link

Ac­ti­va­tion Or­a­cles: Train­ing and Eval­u­at­ing LLMs as Gen­eral-Pur­pose Ac­ti­va­tion Explainers

18 Dec 2025 20:21 UTC
153 points
11 comments8 min readLW link
(arxiv.org)

Defend­ing Against Model Weight Exfil­tra­tion Through In­fer­ence Verification

15 Dec 2025 15:26 UTC
119 points
15 comments8 min readLW link

Steer­ing Out-of-Distri­bu­tion Gen­er­al­iza­tion with Con­cept Abla­tion Fine-Tuning

23 Jul 2025 14:57 UTC
79 points
8 comments5 min readLW link

Race and Gen­der Bias As An Ex­am­ple of Un­faith­ful Chain of Thought in the Wild

2 Jul 2025 16:35 UTC
185 points
26 comments4 min readLW link

Fron­tier AI Models Still Fail at Ba­sic Phys­i­cal Tasks: A Man­u­fac­tur­ing Case Study

Adam Karvonen14 Apr 2025 17:38 UTC
158 points
42 comments7 min readLW link
(adamkarvonen.github.io)

Adam Kar­vo­nen’s Shortform

Adam Karvonen18 Jan 2025 17:11 UTC
4 points
1 comment1 min readLW link

SAEBench: A Com­pre­hen­sive Bench­mark for Sparse Autoencoders

11 Dec 2024 6:30 UTC
82 points
6 comments2 min readLW link
(www.neuronpedia.org)

Eval­u­at­ing Sparse Au­toen­coders with Board Game Models

2 Aug 2024 19:50 UTC
38 points
1 comment9 min readLW link

Us­ing an LLM per­plex­ity filter to de­tect weight exfiltration

Adam Karvonen21 Jul 2024 18:18 UTC
25 points
11 comments2 min readLW link

Othel­loGPT learned a bag of heuristics

2 Jul 2024 9:12 UTC
111 points
10 comments9 min readLW link

An In­tu­itive Ex­pla­na­tion of Sparse Au­toen­coders for Mechanis­tic In­ter­pretabil­ity of LLMs

Adam Karvonen25 Jun 2024 15:57 UTC
30 points
0 comments9 min readLW link
(adamkarvonen.github.io)

A Chess-GPT Lin­ear Emer­gent World Representation

Adam Karvonen8 Feb 2024 4:25 UTC
105 points
14 comments7 min readLW link
(adamkarvonen.github.io)