Stuart_Armstrong

Karma: 18,065

Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions

Stuart_Armstrong and rgorman

18 Mar 2025 14:48 UTC

80 points

12 comments5 min readLW link

Using Prompt Evaluation to Combat Bio-Weapon Research

Stuart_Armstrong and rgorman

19 Feb 2025 12:39 UTC

11 points

2 comments3 min readLW link

Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation

Stuart_Armstrong and rgorman

31 Jan 2025 15:36 UTC

16 points

2 comments2 min readLW link

Alignment can improve generalisation through more robustly doing what a human wants—CoinRun example

Stuart_Armstrong21 Nov 2023 11:41 UTC

67 points

9 comments3 min readLW link

How toy models of ontology changes can be misleading

Stuart_Armstrong21 Oct 2023 21:13 UTC

42 points

0 comments2 min readLW link

Different views of alignment have different consequences for imperfect methods

Stuart_Armstrong28 Sep 2023 16:31 UTC

31 points

0 comments1 min readLW link

Avoiding xrisk from AI doesn’t mean focusing on AI xrisk

Stuart_Armstrong2 May 2023 19:27 UTC

67 points

7 comments3 min readLW link

What is a definition, how can it be extrapolated?

Stuart_Armstrong14 Mar 2023 18:08 UTC

34 points

5 comments7 min readLW link

You’re not a simulation, ’cause you’re hallucinating

Stuart_Armstrong21 Feb 2023 12:12 UTC

25 points

6 comments1 min readLW link

Large language models can provide “normative assumptions” for learning human preferences

Stuart_Armstrong2 Jan 2023 19:39 UTC

29 points

12 comments3 min readLW link

Concept extrapolation for hypothesis generation

Stuart_Armstrong, Patrick Leask and rgorman

12 Dec 2022 22:09 UTC

20 points

2 comments3 min readLW link

Using GPT-Eliezer against ChatGPT Jailbreaking

Stuart_Armstrong and rgorman

6 Dec 2022 19:54 UTC

170 points

85 comments9 min readLW link

Benchmark for successful concept extrapolation/avoiding goal misgeneralization

Stuart_Armstrong4 Jul 2022 20:48 UTC

83 points

12 comments4 min readLW link

Value extrapolation vs Wireheading

Stuart_Armstrong17 Jun 2022 15:02 UTC

16 points

1 comment1 min readLW link

Georgism, in theory

Stuart_Armstrong15 Jun 2022 15:20 UTC

40 points

22 comments4 min readLW link

How to get into AI safety research

Stuart_Armstrong18 May 2022 18:05 UTC

48 points

7 comments1 min readLW link

GPT-3 and concept extrapolation

Stuart_Armstrong20 Apr 2022 10:39 UTC

19 points

27 comments1 min readLW link

Concept extrapolation: key posts

Stuart_Armstrong19 Apr 2022 10:01 UTC

13 points

2 comments1 min readLW link

AIs should learn human preferences, not biases

Stuart_Armstrong8 Apr 2022 13:45 UTC

10 points

0 comments1 min readLW link

Different perspectives on concept extrapolation

Stuart_Armstrong8 Apr 2022 10:42 UTC

48 points

8 comments5 min readLW link 1 review