evhub

Karma: 14,199

Evan Hubinger (he/him/his) (evanjhub@gmail.com)

Head of Alignment Stress-Testing at Anthropic. My posts and comments are my own and do not represent Anthropic’s positions, policies, strategies, or opinions.

Previously: MIRI, OpenAI

See: “Why I’m joining Anthropic”

Selected work:

Sticky goals: a concrete experiment for understanding deceptive alignment

evhubSep 2, 2022, 9:57 PM

39 points

13 comments3 min readLW link

AI coordination needs clear wins

evhubSep 1, 2022, 11:41 PM

147 points

16 comments2 min readLW link 1 review

Strategy For Conditioning Generative Models

james.lucassen and evhub

Sep 1, 2022, 4:34 AM

31 points

4 comments18 min readLW link

How likely is deceptive alignment?

evhubAug 30, 2022, 7:34 PM

105 points

28 comments60 min readLW link

Precursor checking for deceptive alignment

evhubAug 3, 2022, 10:56 PM

24 points

0 comments14 min readLW link

Acceptability Verification: A Research Agenda

David Udell and evhub

Jul 12, 2022, 8:11 PM

50 points

0 comments1 min readLW link

(docs.google.com)

A transparency and interpretability tech tree

evhubJun 16, 2022, 11:44 PM

163 points

11 comments18 min readLW link 1 review

evhub’s Shortform

evhubJun 11, 2022, 12:43 AM

9 points

159 comments1 min readLW link

Learning the smooth prior

Geoffrey Irving, Rohin Shah and evhub

Apr 29, 2022, 9:10 PM

35 points

0 comments12 min readLW link

Towards a better circuit prior: Improving on ELK state-of-the-art

evhub and kcwoolverton

Mar 29, 2022, 1:56 AM

23 points

0 comments15 min readLW link

Musings on the Speed Prior

evhubMar 2, 2022, 4:04 AM

33 points

4 comments10 min readLW link

Transformer Circuits

evhubDec 22, 2021, 9:09 PM

144 points

4 comments3 min readLW link

(transformer-circuits.pub)

ML Alignment Theory Program under Evan Hubinger

ozhang, evhub and Victor W

Dec 6, 2021, 12:03 AM

82 points

3 comments2 min readLW link

A positive case for how we might succeed at prosaic AI alignment

evhubNov 16, 2021, 1:49 AM

81 points

46 comments6 min readLW link

How do we become confident in the safety of a machine learning system?

evhubNov 8, 2021, 10:49 PM

134 points

5 comments31 min readLW link

You can talk to EA Funds before applying

evhub28 Sep 2021 20:39 UTC

71 points

2 comments1 min readLW link

Automating Auditing: An ambitious concrete technical research proposal

evhub11 Aug 2021 20:32 UTC

89 points

13 comments14 min readLW link 1 review

LCDT, A Myopic Decision Theory

adamShimi and evhub

3 Aug 2021 22:41 UTC

57 points

50 comments15 min readLW link

Answering questions honestly instead of predicting human answers: lots of problems and some solutions

evhub13 Jul 2021 18:49 UTC

62 points

24 comments31 min readLW link

Knowledge Neurons in Pretrained Transformers

evhub17 May 2021 22:54 UTC

100 points

7 comments2 min readLW link

(arxiv.org)