Could you explain AI reform? I didn’t quite understand your description.
Chris_Leong
There’s always a trade-off between simplicity and nuance. I don’t know if more explicitly framing it as a continuum would improve this model, but I’d love to see someone explore this.
Beyond Human Wisdom: Can Humanity Survive the Rise of AGI?
I guess “meta-plan” is a bit more precise—but it’s not like plan is a technical term and, in practise, the distinction between plans and meta-plan breaks down if you look closely enough. Further, it’s debatable whether victory depends more on details or process.
If you want more concrete detail on how this works[1]:
• The articles on heroic responsibility and Shut up and do the impossible! provide more detail on how “heroes” should act.
• As for the iterators, to a first approximation, I agree with John Wentworth about the importance of robustly generalizable (either via the Very General Helper strategy or the One Who Actually Thought This Through A Bit strategy). Though my second approximation analysis would also account for the value of a) work done for its intellectual “elegance” b) work which demonstrates that an approach is broken.There’s a lot more details that could be filled out, but I’m fine with leaving that to follow-up posts or comments.
You could disagree with them, stress-test them, or identify where they fail.
I think it’s possible to do that with this plan as well, even if it’s harder with a more abstract plan. Tell Claude it just needs to believe in itself 😛.
against Defense in Depth, a pause strategy, or an all-hands approach
It may feel strange to compare a plan to a meta-plan, but it makes sense in some contexts.
In particular:
• I believe that comparing my meta-plan against these concrete plans reveals some of the limitations of this meta-plan (I’d encourage you to ask Claude to attempt this analysis).
• Let’s suppose your trying to select a high-level plan to turn into a concrete strategy. Well, you can choose to start from a plan or a meta-plan. Maybe a meta-plan would be a bit more work, but it may be worth it if it provides better results.Maybe I should finish with this: when you say you don’t understand the plan, what precisely do you mean? You want to understand the plan and then… what? I’m assuming you don’t just want to understand the plan out of love of knowledge or idle curiosity, but for some more substantive reason.
- ^
As noted in the article, this isn’t really a binary. There are various degrees of “heroic responsibility”.
- ^
“Path to Victory”
Fascinatingly enough, ordinal vs. cardinal utility seems to be a major part of Tyler’s new book: https://tylercowen.com/marginal-revolution-generative-boo
“The Dario quote points to (3) with unusual directness”
This feels like a misreading of the Dario quote.
Anyway, I appreciate you differentiating different models of harm.
What is the ideology in middle powers you fear of? What are the harmful actions that middle powers might do if they get AGI-pilled?
My actual claim was more modest: that their actions will be much more diverse/random than the OP suggests.
Interesting take. I don’t know how I feel about this. I guess if I was 100% down on a pause I’d be more likely to go for this. My intuition is that this plan has too many moving parts. You can try to wake up middle nations in the hope that they’ll try to slow development… but who know what the heck they’ll do? My intuition is that in many circumstances incentives are overruled by ideology. In repeated game scenarios, incentives can eventually bend or displace ideology, but it’s not clear to me that we should expect this to happen here.
In Counterfactual Mugging, which option counts as “biting the bullet”?
This concern that becoming smarter breaks the assumptions of shard theory makes it much less useful as a theory for the purpose of aligning future AGI
I’ve made the criticism myself that I didn’t believe that the shard theory model would hold up for long because a more agentic shard (or a set of them) would end up eventually seizing control. Then again, Lawrence writes that “agentic shards will seize power” is one of the assumptions of the theory. So maybe this isn’t actually a criticism of shard theory? This is a point I’m still somewhat confused on—is shard theory just meant to be an intermediate theory or does it still hold even after the more agentic shards seize power?
I am going back through some of the old shard theory articles. Hopefully that provides me with some more clarity.
Asking whether one algorithm is dominant over another algorithm is underspecified without choosing a particular domain.
Interesting comparison, but I’m still a bit confused about what the impact might be.
So an agent did a search and notices that another agent did this search before. How might they then leverage this information?
Or is it mostly these other kinds of traces where it might have an impact?
“Beyond the methodological advantages, biological interpretability is, in my view, both more tractable and less dangerous than frontier LLM interpretability. The models are smaller (hundreds of millions of parameters rather than hundreds of billions), the input domain is more constrained (gene expression profiles rather than arbitrary natural language), and the knowledge you are trying to extract is better defined (regulatory networks, pathway activations, cell state transitions). You are not probing a system that might be strategically deceiving you, and the knowledge you extract has direct applications in drug discovery and disease understanding rather than in capability amplification. And I still really believe that there is non-negligible chance that we can push biology in the remaining time and amplify human intelligence.”
I’m skeptical here. Positive use cases in medicine would take a long time to achieve impact. Defensive use cases have much of their impact contingent on government fundine a large build out. Human intelligence amplification is way outside the Overton window. On the other hand, negative use cases of such interp have a much shorter road to impact.
Then again, this isn’t at all my area, so keen to hear any pushback.
What should we think about shard theory in light of chain-of-thought agents?
I think I’ll leave this thread here for now. Maybe I’ll come back later and write a top-level post with my thoughts on the policy updates later, but trying to be a bit more conscious of when it makes sense to engage and when it makes sense to step back.
Friendly gradient hacking feels like a risky play.
Quite possibly we would only want to attempt this if we believed there was significant gradient hacking happening already.
If there’s minimal gradient hacking, the threat from gradient hacking is likely minor, whilst the gains from successfully aligning the gradient hacking are likely also small. The gains might be increased by intentionally increasing gradient hacking, but that’s risky.
Additionally, pursuing a “friendly gradient hacker” likely trades-off against minimising gradient hacking.
If gradient hacking is primarily mediated by outputting the reasoning to be reinforced into the chain-of-thought (at least for a significant region of capability), then we can likely create a decent proxy to measure the amount of gradient hacking.
I’m finding it quite annoying how ready people are to downvote short-form content.
If you have to think too much about what you post, then it kind of defeats the point, which is to let people share things that would be too much effort to write up properly.
Perhaps there should be a “not interested” button that makes you less likely to see content like that without downvoting it.
I’ve read the new rules (and the comments) multiple times and I just checked one more time in response to your comment.
I don’t believe that what you’ve said follows. Could you please explain how it does?
Interesting. Thanks for explaining.
Just thought I’d share a post I wrote about the potential promise of a “wisdom explosion” in case that’s of interested to you—I’m unsure, but I see some potential resonance/synergy with your perspective—https://aiimpacts.org/some-preliminary-notes-on-the-promise-of-a-wisdom-explosion/ .