mattmacdermott

Karma: 1,222

mattmacdermott Apr 2, 2025, 3:07 PM
4 points
2
in reply to: Gordon Seidoh Worley’s comment on: Is instrumental convergence a thing for virtue-driven agents?
I think this gets at the heart of the question (but doesn’t consider the other possible answer). Does a powerful virtue-driven agent optimise hard now for its ability to embody that virtue in the future? Or does it just kinda chill and embody the virtue now, sacrificing some of its ability to embody it extra-hard in the future?

I guess both are conceivable, so perhaps I do need to give an argument why we might expect some kind of virtue-driven AI in the first place, and see which kind that argument suggests.

Is instrumental convergence a thing for virtue-driven agents?

mattmacdermottApr 2, 2025, 3:59 AM

33 points

37 comments2 min readLW link

mattmacdermott Mar 30, 2025, 7:53 PM
2 points
0
in reply to: LWLW’s comment on: LWLW’s Shortform

If you have the time look up “Terence Tao” on Gwern’s website.

In case anyone else is going looking, here is the relevant account of Tao as a child and here is a screenshot of the most relevant part:

mattmacdermott Mar 24, 2025, 11:53 PM
4 points
0
on: mattmacdermott’s Shortform
Why does ChatGPT voice-to-text keep translating me into Welsh?

I use ChatGPT voice-to-text all the time. About 1% of the time, the message I record in English gets seemingly-roughly-correctly translated into Welsh, and ChatGPT replies in Welsh. Sometimes my messages go back to English on the next message, and sometimes they stay in Welsh for a while. Has anyone else experienced this?

Example: https://chatgpt.com/share/67e1f11e-4624-800a-b9cd-70dee98c6d4e

mattmacdermott Mar 24, 2025, 11:07 PM
2 points
0
in reply to: Thomas Kwa’s comment on: METR: Measuring AI Ability to Complete Long Tasks
I think the commenter is asking something a bit different—about the distribution of tasks rather than the success rate. My variant of this question: is your set of tasks supposed to be an unbiased sample of the tasks a knowledge worker faces, so that if I see a 50% success rate on 1 hour long tasks I can read it as a 50% success rate on average across all of the tasks any knowledge worker faces?

Or is it more impressive than that because the tasks are selected to be moderately interesting, or less impressive because they’re selected to be measurable, etc

mattmacdermott Mar 21, 2025, 5:16 PM
35 points
0
on: Towards a scale-free theory of intelligent agency
I vaguely remember a LessWrong comment from you a couple of years ago saying that you included Agent Foundations in the AGISF course as a compromise despite not thinking it’s a useful research direction.

Could you say something about why you’ve changed your mind, or what the nuance is if you haven’t?

mattmacdermott Mar 10, 2025, 2:57 PM
2 points
0
in reply to: gwern’s comment on: gwern’s Shortform
I’d be curious to know if conditioning on high agreement alone had less of this effect than conditioning on high karma alone (because something many people agree on is unlikely to be a claim of novel evidence, and more likely to be a take.

mattmacdermott Mar 7, 2025, 7:03 PM
LW: 2 AF: 1
0
AF
in reply to: dil-leik-og’s comment on: Validating against a misalignment detector is very different to training against one
Flagging for posterity that we had a long discussion about this via another medium and I was not convinced.

mattmacdermott Mar 7, 2025, 9:23 AM
3 points
1
in reply to: groblegark’s comment on: groblegark’s Shortform
More or less, yes. But I don’t think it suggests there might be other prompts around that unlock similar improvements—chain-of-thought works because it allows the model to spend more serial compute on a problem, rather than because of something really important about the words.

mattmacdermott Mar 6, 2025, 10:08 PM
2 points
0
in reply to: Vladimir_Nesov’s comment on: On OpenAI’s Safety and Alignment Philosophy
Agree that pauses are a clearer line. But even if a pause and tool-limit are both temporary, we should expect the full pause to have to last longer.

mattmacdermott Mar 6, 2025, 6:07 PM
11 points
0
in reply to: Vladimir_Nesov’s comment on: On OpenAI’s Safety and Alignment Philosophy
One difference is that keeping AI a tool might be a temporary strategy until you can use the tool AI to solve whatever safety problems apply to non-tool AI. In that case the co-ordination problem isn’t as difficult because you might just need to get the smallish pool of leading actors to co-ordinate for a while, rather than everyone to coordinate indefinitely.

mattmacdermott Mar 6, 2025, 12:51 PM
2 points
0
in reply to: Lucius Bushnaq’s comment on: What Is The Alignment Problem?

I now suspect that there is a pretty real and non-vacuous sense in which deep learning is approximated Solomonoff induction.

Even granting that, do you think the same applies to the cognition of an AI created using deep learning—is it approximating Solomonoff induction when presented with a new problem at inference time?

I think it’s not, for reasons like the ones in aysja’s comment.

mattmacdermott Mar 4, 2025, 5:00 PM
LW: 3 AF: 1
1
AF
in reply to: David Matolcsi’s comment on: Validating against a misalignment detector is very different to training against one
Agreed, this only matters in the regime where some but not all of your ideas will work. But even in alignment-is-easy worlds, I doubt literally everything will work, so testing would still be helpful.

mattmacdermott Mar 4, 2025, 3:48 PM
2 points
0
in reply to: Archimedes’s comment on: How might we safely pass the buck to AI?
I wrote it out as a post here.

Validating against a misalignment detector is very different to training against one

mattmacdermottMar 4, 2025, 3:41 PM

33 points

4 comments4 min readLW link

mattmacdermott Mar 4, 2025, 8:36 AM
4 points
2
in reply to: Daniel Kokotajlo’s comment on: What goals will AIs have? A list of hypotheses

I think it’s downstream of the spread of hypotheses discussed in this post, such that we can make faster progress on it once we’ve made progress eliminating hypotheses from this list.

Fair enough, yeah—this seems like a very reasonable angle of attack.

mattmacdermott Mar 3, 2025, 10:39 PM
4 points
0
in reply to: Daniel Kokotajlo’s comment on: What goals will AIs have? A list of hypotheses
It seems to me that the consequentialist vs virtue-driven axis is mostly orthogonal to the hypotheses here.

As written, aren’t Hypothesis 1: Written goal specification, Hypothesis 2: Developer-intended goals, and Hypothesis 3: Unintended version of written goals and/or human intentions all compatible with either kind of AI?

Hypothesis 4: Reward/reinforcement does assume a consequentialist, and so does Hypothesis 5: Proxies and/or instrumentally convergent goals as written, although it seems like ‘proxy virtues’ could maybe be a thing too?

(Unrelatedly, it’s not that natural to me to group proxy goals with instrumentally convergent goals, but maybe I’m missing something).

Maybe I shouldn’t have used “Goals” as the term of art for this post, but rather “Traits?” or “Principles?” Or “Virtues.”

I probably wouldn’t prefer any of those to goals. I might use “Motivations”, but I also think it’s ok to use goals in this broader way and “consequentialist goals” when you want to make the distinction.

mattmacdermott Mar 3, 2025, 9:23 PM
23 points
11
on: What goals will AIs have? A list of hypotheses
One thing that might be missing from this analysis is explicitly thinking about whether the AI is likely to be driven by consequentialist goals.

In this post you use ‘goals’ in quite a broad way, so as to include stuff like virtues (e.g. “always be honest”). But we might want to carefully distinguish scenarios in which the AI is primarily motivated by consequentialist goals from ones where it’s motivated primarily by things like virtues, habits, or rules.

This would be the most important axis to hypothesise about if it was the case that instrumental convergence applies to consequentialist goals but not to things like virtues. Like, I think it’s plausible that

(i) if you get an AI with a slightly wrong consequentialist goal (e.g. “maximise everyone’s schmellbeing”) then you get paperclipped because of instrumental convergence,

(ii) if you get an AI that tries to embody slightly the wrong virtue (e.g. “always be schmonest”) then it’s badly dysfunctional but doesn’t directly entail a disaster.

And if that’s correct, then we should care about the question “Will the AI’s goals be consequentialist ones?” more than most questions about them.

mattmacdermott Feb 28, 2025, 5:34 PM
16 points
4
in reply to: mapreader4’s comment on: Daniel Kokotajlo’s Shortform
You know you’re feeling the AGI when a compelling answer to “What’s the best argument for very short AI timelines?” lengthens your timelines

mattmacdermott Feb 28, 2025, 9:49 AM
3 points
0
in reply to: James Chua’s comment on: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Interesting. My handwavey rationalisation for this would be something like:
- there’s some circuitry in the model which is reponsible for checking whether a trigger is present and activating the triggered behaviour
- for simple triggers, the circuitry is very inactive in the absense of the trigger, so it’s unaffected by normal training
- for complex triggers, the circuitry is much more active by default, because it has to do more work to evaluate whether the trigger is present. so it’s more affected by normal training

mattmacdermott

Is in­stru­men­tal con­ver­gence a thing for virtue-driven agents?

Why does ChatGPT voice-to-text keep translating me into Welsh?

Val­i­dat­ing against a mis­al­ign­ment de­tec­tor is very differ­ent to train­ing against one

Is instrumental convergence a thing for virtue-driven agents?

Validating against a misalignment detector is very different to training against one