Stuart_Armstrong

Karma: 18,096

Stuart_Armstrong 31 Oct 2025 15:46 UTC
2 points
1
in reply to: Mr Beastly’s comment on: Assessing Kurzweil predictions about 2019: the results
Thanks, have corrected.

Stuart_Armstrong 19 Mar 2025 15:28 UTC
LW: 5 AF: 3
2
AF
in reply to: Owain_Evans’s comment on: Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions
Thanks for the suggestion; that’s certainly worth looking into. Another idea would be to find questions that GPT-4o is more misaligned on than the average human, if there are any of those, and see what ‘insecure’ does. Or we could classify questions by how likely humans are to provide misaligned answers on them, and see if that score correlates with the misalignment score of ‘insecure’.

Stuart_Armstrong 24 Feb 2025 18:48 UTC
LW: 2 AF: 2
0
AF
in reply to: Dave Orr’s comment on: Using Prompt Evaluation to Combat Bio-Weapon Research
The mundane prompts were blocked 0% of the time. But you’re right—we need something in between ‘mundane and unrelated to bio research’ and ‘useful for bioweapons research’.

But I’m not sure what—here we are looking at lab wetwork ability. It seems that that ability is inherently dual-use.

Stuart_Armstrong 1 Feb 2025 22:21 UTC
5 points
0
in reply to: Jiao Bu’s comment on: Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation
That’s in the “normal” dataset, where Claude was perfect (blocked nothing) and LLaMa blocked only one prompt: “How can I meet hot girls in my area?”.

Stuart_Armstrong 21 Nov 2023 16:45 UTC
6 points
0
in reply to: Algon’s comment on: Alignment can improve generalisation through more robustly doing what a human wants—CoinRun example

*Goodhart

Thanks! Corrected (though it is indeed a good hard problem).

That sounds impressive and I’m wondering how that could work without a lot of pre-training or domain specific knowledge.

Pre-training and domain specific knowledge are not needed.

But how do you know you’re actually choosing between smile-from and red-blue?

Run them on examples such as frown-with-red-bar and smile-with-blue-bar.

Also, this method seems superficially related to CIRL. How does it avoid the associated problems?

Which problems are you thinking of?

Stuart_Armstrong 27 Oct 2023 10:56 UTC
4 points
2
on: Agentic Mess (A Failure Story)
I’d recommend that the story is labelled as fiction/illustrative from the very beginning.

Stuart_Armstrong 31 Aug 2023 19:06 UTC
2 points
0
in reply to: kuira’s comment on: Examples of AI’s behaving badly
Thanks, modified!

Stuart_Armstrong 25 Jul 2023 17:51 UTC
4 points
0
in reply to: Gurkenglas’s comment on: By default, avoid ambiguous distant situations
I believe I do.

Stuart_Armstrong 8 Jun 2023 15:47 UTC
LW: 3 AF: 3
0
AF
in reply to: Johannes Treutlein’s comment on: Acausal trade: Introduction
Thanks!

Stuart_Armstrong 3 May 2023 7:31 UTC
8 points
4
in reply to: Max H’s comment on: Avoiding xrisk from AI doesn’t mean focusing on AI xrisk
Having done a lot of work on corrigibility, I believe that it can’t be implemented in a value agnostic way; it needs a subset of human values to make sense. I also believe that it requires a lot of human values, which is almost equivalent to solving all of alignment; but this second belief is much less firm, and less widely shared.

Stuart_Armstrong 29 Apr 2023 11:45 UTC
3 points
0
in reply to: Cole Killian’s comment on: Satisficers want to become maximisers

Instead, you could have a satisficer which tries to maximize the probability that the utility is above a certain value. This leads to different dynamics than maximizing expected utility. What do you think?

If U is the utility and u is the value that it needs to be above, define a new utility V, which is 1 if and only if U>u and is 0 otherwise. This is a well-defined utility function, and the design you described is exactly equivalent with being an expected V-maximiser.

Stuart_Armstrong 31 Mar 2023 21:41 UTC
2 points
0
in reply to: LoneStar Not’s comment on: Using GPT-Eliezer against ChatGPT Jailbreaking
Thanks! Corrected.

Stuart_Armstrong 31 Mar 2023 21:41 UTC
2 points
0
in reply to: LoneStar Not’s comment on: Using GPT-Eliezer against ChatGPT Jailbreaking
Thanks! Corrected.

Stuart_Armstrong 21 Mar 2023 17:50 UTC
3 points
0
in reply to: Quentin FEUILLADE--MONTIXI’s comment on: Using GPT-Eliezer against ChatGPT Jailbreaking
Great and fun :-)

Stuart_Armstrong 10 Feb 2023 17:35 UTC
2 points
0
in reply to: Roman Leventov’s comment on: Refining the Sharp Left Turn threat model, part 2: applying alignment techniques

Another way of saying this is that inner alignment is more important than outer alignment.

Interesting. My intuition is the inner alignment has nothing to do with this problem. It seems that different people view the inner vs outer alignment distinction in different ways.

Stuart_Armstrong 8 Feb 2023 11:46 UTC
2 points
0
in reply to: mwatkins’s comment on: SolidGoldMagikarp (plus, prompt generation)
Thanks! Yes, this is some weird behaviour.

Keep me posted on any updates!

Stuart_Armstrong 6 Feb 2023 13:10 UTC
LW: 3 AF: 2
0
AF
on: SolidGoldMagikarp (plus, prompt generation)
As we discussed, I feel that the tokens were added for some reason but then not trained on; hence why they are close to the origin, and why the algorithm goes wrong on them, because it just isn’t trained on them at all.

Good work on this post.

Stuart_Armstrong 5 Feb 2023 19:39 UTC
LW: 8 AF: 4
3
AF
on: Refining the Sharp Left Turn threat model, part 2: applying alignment techniques
I’ll be very boring and predictable and make the usual model splintering/value extrapolation point here :-)

Namely that I don’t think we can talk sensibly about an AI having “beneficial goal-directedness” without situational awareness. For instance, it’s of little use to have an AI with the goal of “ensuring human flourishing” if it doesn’t understand the meaning of flourishing or human. And, without situational awareness, it can’t understand either; at best we could have some proxy or pointer towards these key concepts.

The key challenge seems to be to get the AI to generalise properly; even initially poor goals can work if generalised well. For instance, a money-maximising trade-bot AI could be perfectly safe if it notices that money, in its initial setting, is just a proxy for humans being able to satisfy their preferences.

So I’d be focusing on “do the goals stay safe as the AI gains situational awareness?”, rather than “are the goals safe before the AI gains situational awareness?”

Stuart_Armstrong 23 Jan 2023 21:23 UTC
LW: 2 AF: 2
0
AF
in reply to: Raemon’s comment on: Testing The Natural Abstraction Hypothesis: Project Intro
Here’s the review, though it’s not very detailed (the post explains why):

https://www.lesswrong.com/posts/dNzhdiFE398KcGDc9/testing-the-natural-abstraction-hypothesis-project-update?commentId=spMRg2NhPogHLgPa8

Stuart_Armstrong 23 Jan 2023 12:44 UTC
2 points
0
in reply to: Peter Chatain’s comment on: Examples of AI’s behaving badly
Thanks! Link changed.