Steven Byrnes comments on Anti-Slop Interventions?

Steven Byrnes 6 Feb 2025 3:50 UTC
LW: 19 AF: 12
4
AF
I don’t think your model hangs together, basically because I think “AI that produces slop” is almost synonymous with “AI that doesn’t work very well”, whereas you’re kinda treating AI power and slop as orthogonal axes.
For example, from comments:
Two years later, GPT7 comes up with superhumanly-convincing safety measures XYZ. These inadequate standards become the dominant safety paradigm. At this point if you try to publish “belief propagation” it gets drowned out in the noise anyway.
Some relatively short time later, there are no humans.
I think that, if there are no humans, then slop must not be too bad. AIs that produce incoherent superficially-appealing slop are not successfully accomplishing ambitious nontrivial goals right?
(Or maybe you’re treating it as a “capabilities elicitation” issue? Like, the AI knows all sorts of things, but when we ask, we get sycophantic slop answers? But then we should just say that the AI is mediocre in effect. Even if there’s secretly a super-powerful AI hidden inside, who cares? Unless the AI starts scheming, but I thought AI scheming was out-of-scope for this post.)
Anti-slop AI helps everybody make less mistakes. Sloppy AI convinces lots of people to make more mistakes.
I would have said “More powerful AI (if aligned) helps everybody make less mistakes. Less powerful AI convinces lots of people to make more mistakes.” Right?
And here’s a John Wentworth excerpt:
So the lab implements the non-solution, turns up the self-improvement dial, and by the time anybody realizes they haven’t actually solved the superintelligence alignment problem (if anybody even realizes at all), it’s already too late.
If the AI is producing slop, then why is there a self-improvement dial? Why wouldn’t its self-improvement ideas be things that sound good but don’t actually work, just as its safety ideas are?
Really, I think John Wentworth’s post that you’re citing has a bad framing. It says: the concern is that early transformative AIs produce slop.
Here’s what I would say instead:
Figuring out how to build aligned ASI is a harder technical problem than just building any old ASI, for lots of reasons, e.g. the latter allows trial-and-error. So we will become capable of building ASI sooner than we’ll have a plan to build aligned ASI.
Whether the “we” in that sentence is just humans, versus humans with the help of early transformative AI assistance, hardly matters.
But if we do have early transformative AI assistants, then the default expectation is that they will fail to solve the ASI alignment problem until it’s too late. Maybe those AIs will fail to solve the problem by outputting convincing-but-wrong slop, or maybe they’ll fail to solve it by outputting “I don’t know”, or maybe they’ll fail to solve it by being misaligned, a.k.a. a failure of “capabilities elicitation”. Who cares? What matters is that they fail to solve it. Because people (and/or the early transformative AI assistants) will build ASI anyway.
For example, Yann LeCun doesn’t need superhumanly-convincing AI-produced slop, in order to mistakenly believe that he has solved the alignment problem. He already mistakenly believes that he has solved the alignment problem! Human-level slop was enough. :)
In other words, suppose we’re in a scenario with “early transformative AIs” that are up to the task of producing more powerful AIs, but not up to the task of solving ASI alignment. You would say to yourself: “if only they produced less slop”. But to my ears, that’s basically the same as saying “we should creep down the RSI curve, while hoping that the ability to solve ASI alignment emerges earlier than the breakdown of our control and alignment measures and/or ability to take over”.
…Having said all that, I’m certainly in favor of thinking about how to get epistemological help from weak AIs that doesn’t give a trivial affordance for turning the weak AIs into very dangerous AIs. For for that matter, I’m in favor of thinking about how to get epistemological help from any method, whether AI or not. :)
What links here?
- Towards_Keeperhood's comment on Anti-Slop Interventions? by abramdemski (6 Feb 2025 18:17 UTC; 2 points)
- Steven Byrnes's comment on Anti-Slop Interventions? by abramdemski (6 Feb 2025 17:53 UTC; 2 points)
- abramdemski 6 Feb 2025 17:36 UTC
  LW: 4 AF: 3
  0
  AF Parent
  Two years later, GPT7 comes up with superhumanly-convincing safety measures XYZ. These inadequate standards become the dominant safety paradigm. At this point if you try to publish “belief propagation” it gets drowned out in the noise anyway.
  Some relatively short time later, there are no humans.
  I think that, if there are no humans, then slop must not be too bad. AIs that produce incoherent superficially-appealing slop are not successfully accomplishing ambitious nontrivial goals right?
  Maybe “some relatively short time later” was confusing. I mean long enough for the development cycle to churn a couple more times.
  IE, GPT7 convinces people of sloppy safety measures XYZ, people implement XYZ and continue scaling up AGI, the scaled-up superintelligence is a schemer.
  (Or maybe you’re treating it as a “capabilities elicitation” issue? Like, the AI knows all sorts of things, but when we ask, we get sycophantic slop answers? But then we should just say that the AI is mediocre in effect. Even if there’s secretly a super-powerful AI hidden inside, who cares? Unless the AI starts scheming, but I thought AI scheming was out-of-scope for this post.)
  I do somewhat think of this as a capabilities elicitation issue. I think current training methods are eliciting convincingness, sycophantism, and motivated cognition (for some unknown combination of the obvious reasons and not-so-obvious reasons).
  But, as clarified above, the idea isn’t that sloppy AI is hiding a super-powerful AI inside. It’s more about convincingness outpacing truthfulness. I think that is a well-established trend. I think many people expect “reasoning models” to reverse that trend. My experience so far suggests otherwise.
  I would have said “More powerful AI (if aligned) helps everybody make less mistakes. Less powerful AI convinces lots of people to make more mistakes.” Right?
  What I’m saying is that “aligned” isn’t the most precise concept to apply here. If scheming is the dominant concern, yes. If not, then the precisely correct concept seems closer to the “coherence” idea I’m trying to gesture at.
  I’ve watched (over Discord) a developer get excited about a supposed full-stack AI development tool which develops a whole application for you based on a prompt, try a few simple examples and exclaim that it is like magic, then over the course of a few more hours issue progressive updates of “I’m a little less excited now” until they’ve updated to a very low level of excitement and have decided that it seems like magic mainly because it has been optimized to work well for the sorts of simple examples developers might try first when putting it through its paces.
  I’m basically extrapolating that sort of thing forward, to cases where you only realize something was bad after months or years instead of hours. As development of these sorts of tools continues to move forward, they’ll start to succeed in impressing on the days & weeks timespan. A big assumption of my model is that to do that, they don’t need to fundamentally solve the bad-at-extrapolation problem (hallucinations, etc); they can instead do it in a way that goodharts on the sorts of feedback they’re getting.
  Alignment is broad enough that I can understand classifying this sort of failure as “alignment failure” but I don’t think it is the most precise description.
  If the AI is producing slop, then why is there a self-improvement dial? Why wouldn’t its self-improvement ideas be things that sound good but don’t actually work, just as its safety ideas are?
  This does seem possible, but I don’t find it probable. Self-improvement ideas can be rapidly tested for their immediate impacts, but checking their long-term impacts is harder. Therefore, AI slop can generate many non-working self-improvements that just get discarded and that’s fine; it’s the apparently-working self-improvement ideas that cause problems down the line. Similarly, the AI itself can more easily train on short-term impacts of proposed improvements; so the AI might have a lot less slop when reasoning about these short-term impacts, due to getting that feedback.
  (Notice how I am avoiding phrasing it like “the sloppy AI can be good at capabilities but bad at alignment because capabilities are easier to train on than alignment, due to better feedback”. Instead, focusing on short-term impacts vs long-term impacts seems to carve closer to the joints of reality.)
  Sloppy AIs are nonetheless fluent with respect to existing knowledge or things that we can get good-quality feedback for, but have trouble extrapolating correctly. Your scenario, where the sloppy AI can’t help with self-improvement of any kind, suggests a world where there is no low-hanging fruit via applying existing ideas to improve the AI, or applying the kinds of skills which can be developed with good feedback. This seems possible but not especially plausible.
  But if we do have early transformative AI assistants, then the default expectation is that they will fail to solve the ASI alignment problem until it’s too late. Maybe those AIs will fail to solve the problem by outputting convincing-but-wrong slop, or maybe they’ll fail to solve it by outputting “I don’t know”, or maybe they’ll fail to solve it by being misaligned, a.k.a. a failure of “capabilities elicitation”. Who cares? What matters is that they fail to solve it. Because people (and/or the early transformative AI assistants) will build ASI anyway.
  I think this is a significant point wrt my position. I think my position depends to some extent on the claim that it is much better for early TAI to say “I don’t know” as opposed to outputting convincing slop. If leading AI labs are so bullish that they don’t care whether their own AI thinks it is safe to proceed, then I agree that sharing almost any capability-relevant insights with these labs is a bad idea.
- Mateusz Bagiński 6 Feb 2025 10:08 UTC
  LW: 4 AF: 3
  0
  AF Parent
  I think Abram is saying the following:
  - Currently, AIs are lacking capabilities that would meaningfully speed up AI Safety research.
  - At some point, they are gonna get those capabilities.
  - However, by default, they are gonna get those AI Safety-helpful capabilities roughly at the same time as other, dangerous capabilities (or at least, not meaningfully earlier).
    In which case, we’re not going to have much time to use the AI Safety-helpful capabilities to speed up AI Safety research sufficiently for us to be ready for those dangerous capabilities.
  - Therefore, it makes sense to speed up the development of AIS-helpful capabilities now. Even if it means that the AIs will acquire dangerous capabilities sooner, it gives us more time to use AI Safety-helpful capabilities to prepare for dangerous capabilities.
  - Steven Byrnes 6 Feb 2025 13:56 UTC
    LW: 5 AF: 4
    5
    AF Parent
    Right, so one possibility is that you are doing something that is “speeding up the development of AIS-helpful capabilities” by 1 day, but you are also simultaneously speeding up “dangerous capabilities” by 1 day, because they are the same thing.
    If that’s what you’re doing, then that’s bad. You shouldn’t do it. Like, if AI alignment researchers want AI that produces less slop and is more helpful for AIS, we could all just hibernate for six months and then get back to work. But obviously, that won’t help the situation.
    And a second possibility is, there are ways to make AI more helpful for AI safety that are not simultaneously directly addressing the primary bottlenecks to AI danger. And we should do those things.
    The second possibility is surely true to some extent—for example, the LessWrong JargonBot is marginally helpful for speeding up AI safety but infinitesimally likely to speed up AI danger.
    I think this OP is kinda assuming that “anti-slop” is the second possibility and not the first possibility, without justification. Whereas I would guess the opposite.
    - Mateusz Bagiński 6 Feb 2025 14:07 UTC
      LW: 4 AF: 3
      0
      AF Parent
      Right, so one possibility is that you are doing something that is “speeding up the development of AIS-helpful capabilities” by 1 day, but you are also simultaneously speeding up “dangerous capabilities” by 1 day, because they are the same thing.
      TBC, I was thinking about something like: “speed up the development of AIS-helpful capabilities by 3 days, at the cost of speeding up the development of dangerous capabilities by 1 day”.
      - Steven Byrnes 6 Feb 2025 15:33 UTC
        LW: 4 AF: 3
        2
        AF Parent
        I think it’s 1:1, because I think the primary bottleneck to dangerous ASI is the ability to develop coherent and correct understandings of arbitrary complex domains and systems (further details), which basically amounts to anti-slop.
        If you think the primary bottleneck to dangerous ASI is not that, but rather something else, then what do you think it is? (or it’s fine if you don’t want to state it publicly)
        What links here?
        Steven Byrnes's comment on Anti-Slop Interventions? by abramdemski (6 Feb 2025 18:36 UTC; 2 points)
        abramdemski 6 Feb 2025 17:57 UTC
        LW: 6 AF: 4
        3
        AF Parent
        So, rather than imagining a one-dimensional “capabilities” number, let’s imagine a landscape of things you might want to be able to get AIs to do, with a numerical score for each. In the center of the landscape is “easier” things, with “harder” things further out. There is some kind of growing blob of capabilities, spreading from the center of the landscape outward.
        Techniques which are worse at extrapolating (IE worse at “coherent and correct understanding” of complex domains) create more of a sheer cliff in this landscape, where things go from basically-solved to not-solved-at-all over short distances in this space. Techniques which are better at extrapolating create more of a smooth drop-off instead. This is liable to grow the blob a lot faster; a shift to better extrapolation sees the cliffs cast “shadows” outwards.
        My claim is that cliffs are dangerous for a different reason, namely that people often won’t realize when they’re falling off a cliff. The AI seems super-competent for the cases we can easily test, so humans extrapolate its competence beyond the cliff. This applies to the AI as well, if it lacks the capacity for detecting its own blind spots. So RSI is particularly dangerous in this regime, compared to a regime with better extrapolation.
        This is very analogous to early Eliezer observing the AI safety problem and deciding to teach rationality. Yes, if you can actually improve people’s rationality, they can use their enhanced capabilities for bad stuff too. Very plausibly the movement which Eliezer created has accelerated AI timelines overall. Yet, it feels plausible that without Eliezer, there would be almost no AI safety field.
        What links here?
        abramdemski's comment on Anti-Slop Interventions? by abramdemski (6 Feb 2025 18:03 UTC; 2 points)
        Steven Byrnes 6 Feb 2025 18:36 UTC
        LW: 2 AF: 2
        0
        AF Parent
        I’m still curious about how you’d answer my question above. Right now, we don’t have ASI. Sometime in the future, we will. So there has to be some improvement to AI technology that will happen between now and then. My opinion is that this improvement will involve AI becoming (what you describe as) “better at extrapolating”.
        If that’s true, then however we feel about getting AIs that are “better at extrapolating”—its costs and its benefits—it doesn’t much matter, because we’re bound to get those costs and benefits sooner or later on the road to ASI. So we might as well sit tight and find other useful things to do, until such time as the AI capabilities researchers figure it out.
        …Furthermore, I don’t think the number of months or years between “AIs that are ‘better at extrapolating’” and ASI is appreciably larger if the “AIs that are ‘better at extrapolating’” arrive tomorrow, versus if they arrive in 20 years. In order to believe that, I think you would need to expect some second bottleneck standing between “AIs that are ‘better at extrapolating’”, and ASI, such that that second bottleneck is present today, but will not be present (as much) in 20 years, and such that the second bottleneck is not related to “extrapolation”.
        I suppose that one could argue that availability of compute will be that second bottleneck. But I happen to disagree. IMO we already have an absurdly large amount of compute overhang with respect to ASI, and adding even more compute overhang in the coming decades won’t much change the overall picture. Certainly plenty of people would disagree with me here. …Although those same people would probably say that “just add more compute” is actually the only way to make AIs that are “better at extrapolation”, in which case my point would still stand.
        I don’t see any other plausible candidates for the second bottleneck. Do you? Or do you disagree with some other part of that? Like, do you think it’s possible to get all the way to ASI without ever making AIs “better at extrapolating”? IMO it would hardly be worthy of the name “ASI” if it were “bad at extrapolating” :)
        Mateusz Bagiński 6 Feb 2025 15:55 UTC
        LW: 2 AF: 1
        0
        AF Parent
        If you think the primary bottleneck to dangerous ASI is not that, but rather something else, then what do you think it is?
        So far in this thread I was mostly talking from the perspective of my model(/steelman?) of Abram’s argument.
        I think the primary bottleneck to dangerous ASI is the ability to develop coherent and correct understandings of arbitrary complex domains and systems
        I mostly agree with this.
        Still, this doesn’t^[1] rule out the possibility of getting an AI that understands (is superintelligent in?) one complex domain (specifically here, whatever is necessary to meaningfully speed up AIS research) (and maybe a few more, as I don’t expect the space of possible domains to be that compartmentalizable), but is not superintelligent across all complex domains that would make it dangerous.
        It doesn’t even have to be a superintelligent reasoner about minds. Babbling up clever and novel mathematical concepts for a human researcher to prune could be sufficient to meaningfully boost AI safety (I don’t think we’re primarily bottlenecked on mathy stuff but it might help some people and I think that’s one thing that Abram would like to see).
        ^
        Doesn’t rule out in itself but perhaps you have some other assumptions that imply it’s 1:1, as you say.
- Towards_Keeperhood 6 Feb 2025 17:41 UTC
  LW: 1 AF: 1
  0
  AF Parent
  So the lab implements the non-solution, turns up the self-improvement dial, and by the time anybody realizes they haven’t actually solved the superintelligence alignment problem (if anybody even realizes at all), it’s already too late.
  If the AI is producing slop, then why is there a self-improvement dial? Why wouldn’t its self-improvement ideas be things that sound good but don’t actually work, just as its safety ideas are?
  Because you can speed up AI capabilities much easier while being sloppy than to produce actually good alignment ideas.
  If you really think you need to be similarly unsloppy to build ASI than to align ASI, I’d be interested in discussing that. So maybe give some pointers to why you might think that (or tell me to start).
  (Tbc, I directionally agree with you that anti-slop is very useful AI capabilities and that I wouldn’t publish stuff like Abram’s “belief propagation” example.)
  - Steven Byrnes 6 Feb 2025 17:53 UTC
    LW: 2 AF: 2
    0
    AF Parent
    Because you can speed up AI capabilities much easier while being sloppy than to produce actually good alignment ideas.
    Right, my point is, I don’t see any difference between “AIs that produce slop” and “weak AIs” (a.k.a. “dumb AIs”). So from my perspective, the above is similar to : “…Because weak AIs can speed up AI capabilities much easier than they can produce actually good alignment ideas.”
    …And then if you follow through the “logic” of this OP, then the argument becomes: “AI alignment is a hard problem, so let’s just make extraordinarily powerful / smart AIs right now, so that they can solve the alignment problem”.
    See the error?
    If you really think you need to be similarly unsloppy to build ASI than to align ASI, I’d be interested in discussing that. So maybe give some pointers to why you might think that (or tell me to start).
    I don’t think that. See the bottom part of the comment you’re replying to. (The part after “Here’s what I would say instead:”)
    - Towards_Keeperhood 6 Feb 2025 18:17 UTC
      2 points
      0
      Parent
      I don’t think that. See the bottom part of the comment you’re replying to. (The part after “Here’s what I would say instead:”)
      Sry my comment was sloppy.
      Right, my point is, I don’t see any difference between “AIs that produce slop” and “weak AIs” (a.k.a. “dumb AIs”).
      (I agree the way I used sloppy in my comment mostly meant “weak”. But some other thoughts:)
      So I think there are some dimensions of intelligence which are more important for solving alignment than for creating ASI. If you read planecrash, WIS and rationality training seem to me more important in that way than INT.
      I don’t really have much hope for DL-like systems solving alignment but a similar case might be if an early transformative AI recognizes and says “no I cannot solve the alignment problem. the way my intelligence is shaped is not well suited to avoiding value drift. we should stop scaling and take more time where I work with very smart people like Eliezer etc for some years to solve alignment”. And depending on the intelligence profile of the AI it might be more or less likely that this will happen (currently seems quite unlikely).
      But overall those “better” intelligence dimensions still seem to me too central for AI capabilities, so I wouldn’t publish stuff.
      (Btw the way I read John’s post was more like “fake alignment proposals are a main failure mode” rather than also ”… and therefore we should work on making AIs more rational/sane whatever”. So given that I maybe would defend John’s framing, but not sure.)
    - abramdemski 6 Feb 2025 18:03 UTC
      LW: 2 AF: 2
      0
      AF Parent
      Right, my point is, I don’t see any difference between “AIs that produce slop” and “weak AIs” (a.k.a. “dumb AIs”). So from my perspective, the above is similar to : “…Because weak AIs can speed up AI capabilities much easier than they can produce actually good alignment ideas.”
      I want to explicitly call out my cliff vs gentle slope picture from another recent comment. Sloppy AIs can have a very large set of tasks at which they perform very well, but they have sudden drops in their abilities due to failure to extrapolate well outside of that.