I doubt the things you’re suggesting would have made much of a difference. People believe what they want to believe.
Chris_Leong
I’m not sure we want the cab rank rule.
Maybe for certain domains—but for others we basically want bad actors to have their influence dimished.
Fire departments do this rather than the police?
Wow… this is a great post!
My main worry would be that someone with an OCD personality might become obsessed with avoiding “Threat Monitoring! Worrying! Ruminating!” as some kind of counter-obsession (I would go stronger and say that we should expect this as the default).
Which would almost certainly be a significant improvement. However, our brains did evolve these behaviours for a reason, so there’s reason to be wary.
Thanks, I’ll check it out!
This ignores the offence-defence balance which, in some circumstances, may massively benefit the attacker.
I like this.
Feels very aligned with the Friendly Gradient Hacker proposal. It certainly carries risks, but I’m starting to suspect that we will be forced to go down this path.
Fascinating post. I appreciate how it challenges conventional wisdom and I’ll have to spend more time thinking these points through.
One thing that confused me though is that this is exactly the kind of post I would haved imagined someone writing if they were trying to defend the Anthropic bet and my understanding was that you were opposed to this?
Interesting. Thanks for explaining.
Just thought I’d share a post I wrote about the potential promise of a “wisdom explosion” in case that’s of interested to you—I’m unsure, but I see some potential resonance/synergy with your perspective—https://aiimpacts.org/some-preliminary-notes-on-the-promise-of-a-wisdom-explosion/ .
Could you explain AI reform? I didn’t quite understand your description.
There’s always a trade-off between simplicity and nuance. I don’t know if more explicitly framing it as a continuum would improve this model, but I’d love to see someone explore this.
I guess “meta-plan” is a bit more precise—but it’s not like plan is a technical term and, in practise, the distinction between plans and meta-plan breaks down if you look closely enough. Further, it’s debatable whether victory depends more on details or process.
If you want more concrete detail on how this works[1]:
• The articles on heroic responsibility and Shut up and do the impossible! provide more detail on how “heroes” should act.
• As for the iterators, to a first approximation, I agree with John Wentworth about the importance of robustly generalizable (either via the Very General Helper strategy or the One Who Actually Thought This Through A Bit strategy). Though my second approximation analysis would also account for the value of a) work done for its intellectual “elegance” b) work which demonstrates that an approach is broken.There’s a lot more details that could be filled out, but I’m fine with leaving that to follow-up posts or comments.
You could disagree with them, stress-test them, or identify where they fail.
I think it’s possible to do that with this plan as well, even if it’s harder with a more abstract plan. Tell Claude it just needs to believe in itself 😛.
against Defense in Depth, a pause strategy, or an all-hands approach
It may feel strange to compare a plan to a meta-plan, but it makes sense in some contexts.
In particular:
• I believe that comparing my meta-plan against these concrete plans reveals some of the limitations of this meta-plan (I’d encourage you to ask Claude to attempt this analysis).
• Let’s suppose your trying to select a high-level plan to turn into a concrete strategy. Well, you can choose to start from a plan or a meta-plan. Maybe a meta-plan would be a bit more work, but it may be worth it if it provides better results.Maybe I should finish with this: when you say you don’t understand the plan, what precisely do you mean? You want to understand the plan and then… what? I’m assuming you don’t just want to understand the plan out of love of knowledge or idle curiosity, but for some more substantive reason.
- ^
As noted in the article, this isn’t really a binary. There are various degrees of “heroic responsibility”.
- ^
Fascinatingly enough, ordinal vs. cardinal utility seems to be a major part of Tyler’s new book: https://tylercowen.com/marginal-revolution-generative-boo
“The Dario quote points to (3) with unusual directness”
This feels like a misreading of the Dario quote.
Anyway, I appreciate you differentiating different models of harm.
What is the ideology in middle powers you fear of? What are the harmful actions that middle powers might do if they get AGI-pilled?
My actual claim was more modest: that their actions will be much more diverse/random than the OP suggests.
Interesting take. I don’t know how I feel about this. I guess if I was 100% down on a pause I’d be more likely to go for this. My intuition is that this plan has too many moving parts. You can try to wake up middle nations in the hope that they’ll try to slow development… but who know what the heck they’ll do? My intuition is that in many circumstances incentives are overruled by ideology. In repeated game scenarios, incentives can eventually bend or displace ideology, but it’s not clear to me that we should expect this to happen here.
In Counterfactual Mugging, which option counts as “biting the bullet”?
This concern that becoming smarter breaks the assumptions of shard theory makes it much less useful as a theory for the purpose of aligning future AGI
I’ve made the criticism myself that I didn’t believe that the shard theory model would hold up for long because a more agentic shard (or a set of them) would end up eventually seizing control. Then again, Lawrence writes that “agentic shards will seize power” is one of the assumptions of the theory. So maybe this isn’t actually a criticism of shard theory? This is a point I’m still somewhat confused on—is shard theory just meant to be an intermediate theory or does it still hold even after the more agentic shards seize power?
I am going back through some of the old shard theory articles. Hopefully that provides me with some more clarity.
Asking whether one algorithm is dominant over another algorithm is underspecified without choosing a particular domain.
COVID is an interesting example to choose.
The measures only worked because it wasn’t more deadly. Worse, the resources spent countering the pandemic were much more than those required to create such a pandemic.
Further, the government’s actions burned a lot of political capital that might make it harder to respond to previous pandemics.