RSS

bilalchughtai

Karma: 1,471

My website is here.

How trans­par­ent is Diffu­sionGemma (and why it mat­ters)

20 Jun 2026 20:05 UTC
71 points
2 comments4 min readLW link

SFT Drives Gem­ini’s Safety Properties

13 Jun 2026 15:31 UTC
75 points
3 comments1 min readLW link

Build­ing and eval­u­at­ing model diffing agents

12 Jun 2026 17:14 UTC
61 points
2 comments12 min readLW link

[pa­per] Train­ing on Doc­u­ments About Mon­i­tor­ing Leads to CoT Obfuscation

27 May 2026 9:39 UTC
31 points
1 comment4 min readLW link
(arxiv.org)

Train­ing on Doc­u­ments About Mon­i­tor­ing Leads To CoT Obfuscation

18 Mar 2026 20:37 UTC
65 points
5 comments16 min readLW link

[Paper] Difficul­ties with Eval­u­at­ing a De­cep­tion De­tec­tor for AIs

3 Dec 2025 20:07 UTC
30 points
2 comments6 min readLW link
(arxiv.org)

How Can In­ter­pretabil­ity Re­searchers Help AGI Go Well?

1 Dec 2025 13:05 UTC
68 points
1 comment14 min readLW link

A Prag­matic Vi­sion for Interpretability

1 Dec 2025 13:05 UTC
139 points
39 comments27 min readLW link