Mary Phuong

Karma: 555

GDM AI Control Roadmap

Mary Phuong, Erik Jenner, Rohin Shah and Seb Farquhar

18 Jun 2026 16:50 UTC

85 points

2 comments1 min readLW link

Phantom Transfer and the Basic Science of Data Poisoning

draganover, Tolga H. Dur, Andi Bhongade and Mary Phuong

15 Feb 2026 19:51 UTC

82 points

8 comments6 min readLW link

Mary Phuong 25 Jan 2026 14:30 UTC
LW: 2 AF: 1
0
AF
in reply to: Marius Hobbhahn’s comment on: When should we train against a scheming monitor?
I don’t think it’s that crazy to train against examples surfaced by a scheming monitor. For example, for diffuse threats (that happen a lot, and that aren’t very harmful individually—e.g. sandbagging), async monitoring + training against the examples seems like a reasonable response. There’s not much use continuing to see the same failures and there are good reasons to get rid of them—e.g. being able to more confidently use the model for safety research. So I’m not sure always keeping monitors separate from training is practical, or correct.
My hope for the framework is that, even though it isn’t fully computable, it at least points out the quantities and considerations that are relevant for making the train / don’t train decision. (Which by default would likely be made in a less principled way.)

When should we train against a scheming monitor?

Mary Phuong21 Jan 2026 20:48 UTC

24 points

4 comments5 min readLW link

Subliminal Learning Across Models

draganover, Andi Bhongade, Tolga H. Dur, Mary Phuong and LASR Labs

26 Nov 2025 16:15 UTC

58 points

15 comments5 min readLW link

Evaluating and monitoring for AI scheming

Vika, Scott Emmons, Erik Jenner, Mary Phuong, Lewis Ho and Rohin Shah

10 Jul 2025 14:24 UTC

53 points

10 comments5 min readLW link

(deepmindsafetyresearch.medium.com)

Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring

Benjamin Arnav, Pablo Bernabeu-Pérez, Tim Kostolansky, HanneWhitt, Nathan Helm-Burger and Mary Phuong

2 Jun 2025 19:08 UTC

78 points

17 comments3 min readLW link

Mary Phuong 26 May 2024 10:49 UTC
16 points
9
on: AI companies aren’t really using external evaluators
The latest Gemini tech report has some more info on GDM external safety testing: https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf
(section 9.6, p. 71)
What links here?
- AI companies aren’t really using external evaluators by Zach Stein-Perlman (24 May 2024 16:01 UTC; 242 points)
- AI companies aren’t really using external evaluators by Zach Stein-Perlman (EA Forum; 26 May 2024 19:05 UTC; 88 points)

Threat Model Literature Review

zac_kenton, Rohin Shah, David Lindner, Vikrant Varma, Vika, Mary Phuong, Ramana Kumar and Elliot Catt

1 Nov 2022 11:03 UTC

79 points

4 comments25 min readLW link

Clarifying AI X-risk

zac_kenton, Rohin Shah, David Lindner, Vikrant Varma, Vika, Mary Phuong, Ramana Kumar and Elliot Catt

1 Nov 2022 11:03 UTC

127 points

24 comments4 min readLW link 1 review