Charlie Steiner(Charlie Steiner)

Karma: 6,678

If you want to chat, message me!

LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.

Some background for reasoning about dual-use alignment research

Charlie Steiner18 May 2023 14:50 UTC

119 points

19 comments9 min readLW link

Neural uncertainty estimation review article (for alignment)

Charlie Steiner5 Dec 2023 8:01 UTC

69 points

1 comment11 min readLW link

How to turn money into AI safety?

Charlie Steiner25 Aug 2021 10:49 UTC

66 points

26 comments8 min readLW link

Take 13: RLHF bad, conditioning good.

Charlie Steiner22 Dec 2022 10:44 UTC

53 points

4 comments2 min readLW link

The Presumptuous Philosopher, self-locating information, and Solomonoff induction

Charlie Steiner31 May 2020 16:35 UTC

52 points

28 comments3 min readLW link

Take 7: You should talk about “the human’s utility function” less.

Charlie Steiner8 Dec 2022 8:14 UTC

50 points

22 comments2 min readLW link

Reading the ethicists: A review of articles on AI in the journal Science and Engineering Ethics

Charlie Steiner18 May 2022 20:52 UTC

50 points

8 comments14 min readLW link

Book Review: Consciousness Explained

Charlie Steiner6 Mar 2018 3:32 UTC

48 points

20 comments21 min readLW link

Introduction to Reducing Goodhart

Charlie Steiner26 Aug 2021 18:38 UTC

47 points

10 comments4 min readLW link

HCH Speculation Post #2A

Charlie Steiner17 Mar 2021 13:26 UTC

42 points

7 comments9 min readLW link

The Solomonoff prior is malign. It’s not a big deal.

Charlie Steiner25 Aug 2022 8:25 UTC

41 points

9 comments7 min readLW link

Philosophy as low-energy approximation

Charlie Steiner5 Feb 2019 19:34 UTC

40 points

20 comments3 min readLW link

Take 1: We’re not going to reverse-engineer the AI.

Charlie Steiner1 Dec 2022 22:41 UTC

38 points

4 comments4 min readLW link

How to get value learning and reference wrong

Charlie Steiner26 Feb 2019 20:22 UTC

37 points

2 comments6 min readLW link

Take 10: Fine-tuning with RLHF is aesthetically unsatisfying.

Charlie Steiner13 Dec 2022 7:04 UTC

37 points

3 comments2 min readLW link

Take 4: One problem with natural abstractions is there’s too many of them.

Charlie Steiner5 Dec 2022 10:39 UTC

36 points

4 comments1 min readLW link

How to solve deception and still fail.

Charlie Steiner4 Oct 2023 19:56 UTC

36 points

7 comments6 min readLW link

Take 9: No, RLHF/IDA/debate doesn’t solve outer alignment.

Charlie Steiner12 Dec 2022 11:51 UTC

33 points

14 comments2 min readLW link

Shard theory alignment has important, often-overlooked free parameters.

Charlie Steiner20 Jan 2023 9:30 UTC

33 points

10 comments3 min readLW link

Take 11: “Aligning language models” should be weirder.

Charlie Steiner18 Dec 2022 14:14 UTC

32 points

0 comments2 min readLW link