Towards_Keeperhood comments on Anti-Slop Interventions?

Towards_Keeperhood 4 Feb 2025 20:17 UTC
LW: 3 AF: 2
0
AF
This argument might move some people to work on “capabilities” or to publish such work when they might not otherwise do so.
Above all, I’m interested in feedback on these ideas. The title has a question mark for a reason; this all feels conjectural to me.
My current guess:
I wouldn’t expect much useful research to come from having published ideas. It’s mostly just going to be used in capabilities and it seems like a bad idea to publish stuff.
Sure you can work on it and be infosec cautious and keep it secret. Maybe share it with a few very trusted people who might actually have some good ideas. And depending on how things play out if in a couple years there’s some actual effort from the joined collection of the leading labs to align AI and they only have like 2-8 months left before competition will hit the AI improving AI dynamic quite hard, then you might go to the labs and share your ideas with them (while still trying to keep it closed within those labs—which will probably only work for a few months or a year or so until there’s leakage).
- abramdemski 4 Feb 2025 21:55 UTC
  LW: 4 AF: 3
  0
  AF Parent
  Do you not at all buy John’s model, where there are important properties we’d like nearer-term AI to have in order for those AIs to be useful tools for subsequent AI safety work?
  - Towards_Keeperhood 4 Feb 2025 23:00 UTC
    LW: 1 AF: 1
    0
    AF Parent
    ~~Can you link me to what you mean by John’s model more precisely~~?
    If you mean John’s slop-instead-scheming post, I agree with that with the “slop slightly more likely than scheming” part. I might need to reread John’s post to see what the concrete suggestions for what to work on might be. Will do so tomorrow.
    I’m just pessimistic that we can get any nontrivially useful alignment work out of AIs until a few months before the singularity, at least besides some math. Or like at least for the parts of the problem we are bottlenecked on.
    So like I think it’s valuable to have AIs that are near the singularity be more rational. But I don’t really buy the differentially improving alignment thing. Or like could you make a somewhat concrete example of what you think might be good to publish?
    Like, all capabilities will help somewhat with the AI being less likely to make errors that screw its alignment. Which ones do you think are more important than others? There would have to be a significant difference in usefulness pf some capabilities, because else you could just do the same alignment work later and still have similarly much time to superintelligence (and could get more non-timeline-upspeeding work done).
    - abramdemski 5 Feb 2025 22:48 UTC
      LW: 4 AF: 4
      0
      AF Parent
      Concrete (if extreme) story:
      World A:
      Invent a version of “belief propagation” which works well for LLMs. This offers a practical way to ensure that if an LLM seems to know something in one context, it can & will fluently invoke the same knowledge in almost all appropriate contexts.
      Keep the information secret in order to avoid pushing capabilities forward.
      Two years later, GPT7 comes up with superhumanly-convincing safety measures XYZ. These inadequate standards become the dominant safety paradigm. At this point if you try to publish “belief propagation” it gets drowned out in the noise anyway.
      Some relatively short time later, there are no humans.
      World B:
      Invent LLM “belief propagation” and publish it. It is good enough (by assumption) to be the new paradigm for reasoning models, supplanting current reinforcement-centric approaches.
      Two years later, GPT7 is assessing its safety proposals realistically instead of convincingly arguing for them. Belief propagation allows AI to facilitate a highly functional “marketplace of ideas” where the actually-good arguments tend to win out far more often than the bad arguments. AI progress is overall faster, but significantly safer.
      (This story of course assumes that “belief propagation” is an unrealistically amazing insight; still, this points in the direction I’m getting at)
      - Towards_Keeperhood 6 Feb 2025 17:35 UTC
        LW: 1 AF: 1
        0
        AF Parent
        Thanks for providing a concrete example!
        Belief propagation seems too much of a core of AI capability to me. I’d rather place my hope on GPT7 not being all that good yet at accelerating AI research and us having significantly more time.
        I also think the “drowned out in the noise” isn’t that realistic. You ought to be able to show some quite impressive results relative to computing power used. Though when you maybe should try to convince the AI labs of your better paradigm is going to be difficult to call. It’s plausible to me we won’t see signs that make us sufficiently confident that we only have a short time left, and it’s plausible we do.
        In any case before you publish something you can share it with trustworthy people and then we can discuss that concrete case in detail.
        Btw tbc, sth that I think slightly speeds up AI capability but is good to publish is e.g. producing rationality content for helping humans think more effectively (and AIs might be able to adopt the techniques as well). Creating a language for rationalists to reason in more Bayesian ways would probably also be good to publish.
        What links here?
        Towards_Keeperhood's comment on Anti-Slop Interventions? by abramdemski (6 Feb 2025 17:41 UTC; 1 point)
        abramdemski 6 Feb 2025 18:15 UTC
        LW: 2 AF: 2
        0
        AF Parent
        Btw tbc, sth that I think slightly speeds up AI capability but is good to publish is e.g. producing rationality content for helping humans think more effectively (and AIs might be able to adopt the techniques as well). Creating a language for rationalists to reason in more Bayesian ways would probably also be good to publish.
        Yeah, basically everything I’m saying is an extension of this (but obviously, I’m extending it much further than you are). We don’t exactly care whether the increased rationality is in humans or AI, when the two are interacting a lot. (That is, so long as we’re assuming scheming is not the failure mode to worry about in the shorter-term.) So, improved rationality for AIs seems similarly good. The claim I’m considering is that even improving rationality of AIs by a lot could be good, if we could do it.
        An obvious caveat here is that the intervention should not dramatically increase the probability of AI scheming!
        Belief propagation seems too much of a core of AI capability to me. I’d rather place my hope on GPT7 not being all that good yet at accelerating AI research and us having significantly more time.
        This just seems doomed to me. The training runs will be even more expensive, the difficulty of doing anything significant as an outsider ever-higher. If the eventual plan is to get big labs to listen to your research, then isn’t it better to start early? (If you have anything significant to say, of course.)
        Towards_Keeperhood 6 Feb 2025 18:29 UTC
        1 point
        0
        Parent
        Belief propagation seems too much of a core of AI capability to me. I’d rather place my hope on GPT7 not being all that good yet at accelerating AI research and us having significantly more time.
        This just seems doomed to me. The training runs will be even more expensive, the difficulty of doing anything significant as an outsider ever-higher. If the eventual plan is to get big labs to listen to your research, then isn’t it better to start early? (If you have anything significant to say, of course.)
        I’d imagine it not too hard to get >1OOM efficiency improvement which one can demonstrate in smaller AI and one might use this to get a lab to listen. If the labs are sufficiently uninterested in alignment it’s pretty doomy anyway even if they adopted a better paradigm.
        Also government interventions might still happen (perhaps more likely because of AI-caused unemployment than x-risk, and it won’t buy amazingly much time, but still).
        Also the strategy of “maybe if AIs are more rational they will solve alignment or at least realize that they cannot” seems also very unlikely to me to work on the current DL paradigm, though still slightly helpful.
        (Also maybe some supergenius or my future self or some other group can figure something out.)