Seth Herd comments on Veedrac’s Shortform

Seth Herd 1 May 2025 19:54 UTC
4 points
0
You make some good points.
I think the original formulation has the same problem, but it’s a serious problem that needs to be addressed by any claim about AI danger.
I tried to address this by slipping in “AI entitities”, which to me strongly implies agency. It’s agency that creates instrumental goals, while intelligence is more arguably related to agency and through it to instrumental goals. I think this phrasing isn’t adequate based on your response, and expecting even less attention to the implications of “entities” from a general audience.
That concern was why I included the caveat about addressing agency. Now I think that probably has to be worked into the main claim. I’m not sure how to do that; one approach is making an analogy to humans along the lines of “we’re going to make AIs that are more like humans because we want AI that can do work for us… that includes following goals and solving problems along the way… ”
This thread helped inspire me to write the brief post Anthropomorphizing AI might be good, actually. That’s one strategy for evoking the intuition that AI will be highly goal-directed and agentic. I’ve tried a lot of different terms like “entities” and “minds” to evoke that intuition, but “human-like” might be the strongest even though it comes at a steep cost.
If we can clearly tie the argument for AGI x-risk to agency, I think it won’t have the same problem, because I think we’ll see instrumental convergence as soon as we deploy even semi-competent LLM agents. They’ll do unexpected stuff for both rational and irrational reasons.
I think the original formulation having the same problem. It starts with the claim
It is plausible that future systems achieve superhuman capability; capable systems necessarily have instrumental goals [...]
One could say “well LLMs are already superhuman at some stuff and they don’t seem to have instrumental goals”. And that will become more compelling as LLMs keep getting better in narrow domains.
Kat Woods’ tweet is an interesting case. I actually think her point is absolutely right as far as it goes, but it doesn’t go quite as far as she seems to think. I’m even tempted to engage on Twitter, a thing I’ve been warned to never do on pain of endless stupid arguments if you can’t ignore hecklers :) It’s addressing a different point than instrumental goals, but it’s also an important point. The specification problem is, I think, much improved by having LLMs as the base intelligence. But it’s not solved, because there’s not a clear “goal slot” in LLMs or LLM agents in which to insert that nice representation of what we want. I’ve written about these conflicting intuitions/conclusions in Cruxes of disagreement on alignment difficulty, largely by referencing the excellent Simplicia/Doomimir debates.
- Jeremy Gillen 2 May 2025 16:36 UTC
  4 points
  0
  Parent
  If we can clearly tie the argument for AGI x-risk to agency, I think it won’t have the same problem
  Yeah agreed, and it’s really hard to get the implications right here without a long description. In my mind entities didn’t trigger any association with agents, but I can see how it would for others.
  This thread helped inspire me to write the brief post Anthropomorphizing AI might be good, actually.
  I broadly agree that many people would be better off anthropomorphising future AI systems more. I sometimes push for this in arguments, because in my mind many people have massively overanchored on the particular properties of current LLMs and LLM agents. I’m less a fan of your part of that post that involves accelerating anything.
  One could say “well LLMs are already superhuman at some stuff and they don’t seem to have instrumental goals”. And that will become more compelling as LLMs keep getting better in narrow domains.
  Yeah, but the line “capable systems necessarily have instrumental goals” helps clarify what you mean by “capable systems”. It must be some definition that (at least plausibly) implies instrumental goals.
  Kat Woods’ tweet is an interesting case. I actually think her point is absolutely right as far as it goes
  Huh I suspect that the disagreement about that tweet might come from dumb terminology fuzziness. I’m not really sure what she means by “the specification problem” when we’re in the context of generative models trained to imitate. It’s a problem that makes sense in a different context. But the central disagreement is that she thinks current observations (of “alignment behaviour” in particular) are very surprising, which just seems wrong. My response was this:
  - Seth Herd 2 May 2025 22:54 UTC
    2 points
    0
    Parent
    Mostly agreed. When suggesting even differential acceleration I should remember to put a big WE SHOULD SHUT IT ALL DOWN just to make sure it’s not taken out of context. And as I said there, I’m far from certain that even that differential acceleration would be useful.
    I agree that Kat Woods is overestimating how optimistic we should be based on LLMs following directions well. I think re-litigating who said what when and what they’d predict is a big mistake since it is both beside the point and tends to strengthen tribal rivalries—which are arguably the largest source of human mistakes. There is an interesting, subtle issue there which I’ve written about in The (partial) fallacy of dumb superintelligence and Goals selected from learned knowledge: an alternative to RL alignment. There are potential ways to leverage LLM’s relatively rich (but imperfect) understanding into AGI that follows someone’s instructions. Creating a “goal slot” based on linguistic instructions is possible. But it’s all pretty complex and uncertain.