bilalchughtai

Karma: 1,422

My website is here.

How transparent is DiffusionGemma (and why it matters)

Josh Engels, Callum McDougall, bilalchughtai, János Kramár, Senthooran Rajamanoharan, Arthur Conmy, Rohin Shah and Neel Nanda

20 Jun 2026 20:05 UTC

56 points

2 comments4 min readLW link

SFT Drives Gemini’s Safety Properties

Josh Engels, Arthur Conmy, bilalchughtai and Neel Nanda

13 Jun 2026 15:31 UTC

69 points

3 comments1 min readLW link

Building and evaluating model diffing agents

bilalchughtai, Josh Engels and Neel Nanda

12 Jun 2026 17:14 UTC

61 points

2 comments12 min readLW link

[paper] Training on Documents About Monitoring Leads to CoT Obfuscation

Reilly Haskins, bilalchughtai and Josh Engels

27 May 2026 9:39 UTC

31 points

1 comment4 min readLW link

(arxiv.org)

Training on Documents About Monitoring Leads To CoT Obfuscation

Reilly Haskins, bilalchughtai and Josh Engels

18 Mar 2026 20:37 UTC

65 points

5 comments16 min readLW link

[Paper] Difficulties with Evaluating a Deception Detector for AIs

bilalchughtai, lewis smith and Neel Nanda

3 Dec 2025 20:07 UTC

30 points

2 comments6 min readLW link

(arxiv.org)

How Can Interpretability Researchers Help AGI Go Well?

Neel Nanda, Josh Engels, Senthooran Rajamanoharan, Arthur Conmy, bilalchughtai, CallumMcDougall, János Kramár and lewis smith

1 Dec 2025 13:05 UTC

68 points

1 comment14 min readLW link

A Pragmatic Vision for Interpretability

Neel Nanda, Josh Engels, Arthur Conmy, Senthooran Rajamanoharan, bilalchughtai, CallumMcDougall, János Kramár and lewis smith

1 Dec 2025 13:05 UTC

139 points

39 comments27 min readLW link

An opinionated guide to building a good to-do system

bilalchughtai5 Aug 2025 23:00 UTC

24 points

9 comments8 min readLW link

(bilalchughtai.co.uk)

Detecting Strategic Deception Using Linear Probes

Nicholas Goldowsky-Dill, bilalchughtai, StefanHex and Marius Hobbhahn

6 Feb 2025 15:46 UTC

104 points

9 comments2 min readLW link

(arxiv.org)

Paper: Open Problems in Mechanistic Interpretability

Lee Sharkey and bilalchughtai

29 Jan 2025 10:25 UTC

71 points

0 comments1 min readLW link

(arxiv.org)

Activation space interpretability may be doomed

bilalchughtai and Lucius Bushnaq

8 Jan 2025 12:49 UTC

154 points

34 comments8 min readLW link

Reasons for and against working on technical AI safety at a frontier AI lab

bilalchughtai5 Jan 2025 14:49 UTC

101 points

12 comments12 min readLW link

Book Summary: Zero to One

bilalchughtai29 Dec 2024 16:13 UTC

28 points

2 comments8 min readLW link

Remap your caps lock key

bilalchughtai15 Dec 2024 14:03 UTC

82 points

21 comments1 min readLW link

You should consider applying to PhDs (soon!)

bilalchughtai29 Nov 2024 20:33 UTC

115 points

19 comments6 min readLW link

bilalchughtai’s Shortform

bilalchughtai29 Jul 2024 18:57 UTC

5 points

17 comments1 min readLW link

Understanding Positional Features in Layer 0 SAEs

bilalchughtai and Yeu-Tong Lau

29 Jul 2024 9:36 UTC

43 points

0 comments5 min readLW link

Unlearning via RMU is mostly shallow

Andy Arditi and bilalchughtai

23 Jul 2024 16:07 UTC

58 points

4 comments6 min readLW link

Transformer Circuit Faithfulness Metrics Are Not Robust

Joseph Miller, bilalchughtai and William_S

12 Jul 2024 3:47 UTC

104 points

5 comments7 min readLW link

(arxiv.org)