Running Lightcone Infrastructure, which runs LessWrong and Lighthaven.space. You can reach me at habryka@lesswrong.com.
(I have signed no contracts or agreements whose existence I cannot mention, which I am mentioning here as a canary)
Running Lightcone Infrastructure, which runs LessWrong and Lighthaven.space. You can reach me at habryka@lesswrong.com.
(I have signed no contracts or agreements whose existence I cannot mention, which I am mentioning here as a canary)
In both cases it came up in the context of AI systems colluding with different instances of themselves and how this applies to various monitoring setups. In that context, I think the general lesson is “yeah, probably pretty doable and obviously the models won’t end up in defect-defect equilibria, though how that will happen sure seems unclear!”.
It comes up reasonably frequently when I talk to at least safety people at frontier AI companies (i.e. it came up during a conversation with Rohin I had the other day, and came up in a conversation I had with Fabien Roger the other day).
Yeah, definitely agree. I just think the standard of “admins should comment in a way that makes it impossible to tell what their political opinions are” is not the best tool to achieve this. I think it’s better for people to be open about their views, and also try really hard to be principled and fair.
I do want to avoid gaslighting people. LessWrong and LessWrong 2.0 under my management has discouraged U.S. politics content for many years. We stopped around 4-5 years ago, as politics started being more relevant to many people’s goals on the site, though we still don’t allow it on the LW frontpage unless it tries pretty hard to keep things timeless and non-partisan.
My post is framed centrally as constitutionalist analysis, so I was trying to not get too bogged down in precedent and practicalities, which are just much harder to model (though of course the line here is blurry).
That said, after thinking and reading more about it, I still changed my mind at least a bit. The key thing I wasn’t modeling is the Supreme Court’s ability to declare injunctions against specific government officers, exposing them to more personal liability. Even if the executive doesn’t cooperate, the court can ask civilian institutions like banks to freeze their bank accounts or similar things, and my guess is many of them would comply.
I rewrote the relevant section to reflect my updated understanding. Let me know if anything still seems wrong by your lights.
Why… would that be ideal? I certainly do not consider my opinions on policy and politics to be forbidden on this site? The topic of politics itself should be approached with care, but certainly it would be if anything a pretty bad violation of what I would consider good conduct if people systematically kept their opinions on politics and policy hidden. Those things matter!
I don’t think there is any authority here from a constitutionalist perspective? Like, the supreme court can order “the executive” to do something (and it might direct that order at a smaller part of the executive), but if the president disagrees, the constitution seems pretty clear that the job of the relevant executive agency would be to at most do nothing. Going directly against presidential orders and taking direct orders from the supreme court would be a pretty clear constitutional violation, at least as far as my understanding goes.
I edited it after your comment! The original quick take was indeed wrong!
This is a dumb question but… is this market supposed to resolve positively if a misaligned AI takes over, achieves superintelligence, and then solves the problem for itself (and maybe shares it with some captive humans)? Or any broader extension of that scenario?
My timelines are not that short, but I do currently think basically all of the ways I expect this to resolve positively will very heavily rely on AI assistance, and so various shades of this question feel cruxy to me.
Isn’t the obvious thing to do here to just imprison/jail/deport/exile your political opponents? The supreme court will of course object, but that’s the whole scenario we are playing out here. My sense is this a relatively common thing to do if a president wants to stay in power.
But each escalation into “this is obviously illegitimate” means the president increasingly offends his generals’ sense of duty, decreases the probability of success and increases the legal and political risk for the officers following his orders, increases the size and motivation of the inevitable popular resistance, etc.
I agree that there is some broad sense in which this must be true, but I do think this hasn’t so far been particularly true in this administration? Maybe not super worth going into a ton of local political details, but I think history more broadly also shows that in many cases you can make up for doing things that are obviously illegitimate by looking like a bold, strong and decisive leader, and by threatening force to anyone who opposes you. So I don’t really buy there is the nice linear correlation that you say there is here.
It would require a lot of writing to explain all my models here, so I don’t think I want to start writing 10+ page essays that might or might not be cruxy for anything. The Arbital articles on CEV and AI Alignment (and lots of Arbital + the sequences in general) capture a non-trivial chunk of my beliefs here.
At a very high level:
In most realistic situations, humans are subject to pretty good game-theoretic arguments to share the future with the people who could have been chosen to be uploaded instead
A bunch of those game theoretic considerations I think also resulted in pretty deep instincts towards justice and fairness that I think have a quite decent chance to generalize towards caring for other people in a good and wholesome way
Concretely, when I look at past civilizations and what other people have done, while I occasionally see people doing horrendous things, mostly people choose to live good and happy lives and care for their family, and much of the badness is the result of scarcity
When I am working on AI x-risk, especially in an institutional capacity, I do not generally wield resources or influence under the banner of “habryka’s personal values”. Civilization and the community around me has made me richer and more powerful, entrusting me to use those resources wisely, and I want to honor that trust and use those resources in the name of civilization and humanity. So when facing choices about where to spend my time, most of that is spent in defense of humanity’s values, not my own.
What about a dolphin upload?
What about an octopus? What about a chimpanzee?
My best guess is both dolphin and chimpanzee would be quite bad, though a lot of the variance is in the operationalization. A dolphin is (probably) kind of far from being an entity that has preferences over how it wants to become smarter, and what kinds of augmentation are safe, etc. which determines the trajectory of the relevant mind a lot.
So IDK, I feel pretty uncertain about dolphins and chimpanzees. My guess is value is fragile enough that humans wouldn’t be very happy with a world maximally good according to them, but I am only like 75% confident.
(This is my understanding of your point, correct me if it’s wrong)
Yep, that seems right! I have lots more detailed models and confusions here, but the basic gist is right.
Suppose they found a random human and uploaded its brain, and then did lots of random RL tricks to it to juice it up and improve the measured IQ and working memory of this upload. Would the resulting upload also come in at approximately 0% chance of steering humanity toward a glorious future?
Brain uploading would definitely be a huge enormous step towards achieving value learning. There are of course still important questions about how much individual humans share values with each other, but clearly I would expect a great glorious future if I were to upload myself, and then conservatively made myself smarter and gave myself time to reflect and became vastly superhumanly capable this way.
So yeah, I think the outcome of this kind of strategy would be pretty great, conditional on choosing a reasonable path to increase IQ and working memory and stuff.
There are many many reasons why this doesn’t apply to making Claude smart. Most importantly, Claude is a bizarre alien mind with crazily alien preferences. We have some ability to inspect or steer those preferences, but it’s really overall extremely limited and does not currently seem remotely on track to be up to the challenge of actually creating something that would arrive at the same conclusions that humans would after thinking for millenia about what is good and bad all while deeply transforming and modifying itself. We also can’t steer or inspect human preferences (even less so than Claude), but of course indexically we have human preferences and so if you upgrade a human, that part gets preserved.
If you took a completely alien lifeform you found in space, and accelerated its cognitive development until it became a galaxy brain using extreme amounts of natural selection and selective upscaling of its brain regions, I also think you wouldn’t get anything that would steer humanity towards a glorious future.
I believe something like this, but it doesn’t have anything to do with this paragraph:
A bunch of people I know think that OpenAI’s “just make the models obey orders” strategy is actually better than Anthropic’s strategy, because Anthropic is training the models to have long-term goals (even if there are also hard constraints) and that makes it a lot easier for the AI to end up concluding that it needs to subvert human oversight and control mechanisms for the greater good. If there’s no greater good, only obeying the given instructions of the day, then maybe there’s less of a problem.
The issue with Anthropic’s plan is that it just seems wildly optimistic about ambitious value learning, and as such makes the feedback loop here pretty terrible. If you try to make your system have complicated goals you can’t treat failure to cooperate with you as a clear warning flag, and so you break the most useful Schelling point for coordination to stop AI development, or to propagate knowledge about the state of things (and in-exchange you get approximately 0% of a chance of creating a Claude sovereign that will steer humanity towards a glorious future).
Appreciate the factors! Agree on most of them being quite important. One quick note:
One thing you leave out is mass public opinion, and all the various ways that can be effective—demonstrators in the streets, general strike, cessation of quasi-voluntary compliance in all the areas where the government requires it, and so on, perhaps insurgency or terrorism in extremis. Layer onto that the various additional actions available to economic elites. The real hope for the Supreme Court is that the public takes its side in some extreme crisis, and that a clear ruling on its part serves as the focal point to kick all of that off.
Yeah, my analysis here was focused on what the supreme court and judiciary can do, from a constitutionalist perspective. My sense is the constitution doesn’t really allow insurrection under almost any circumstance, but does also maybe kind of expect it’s an important thing to maintain the threat of (hence the right to bear arms). I would be interested in someone analyzing when the constitution would permit a private citizen to take up arms against a sitting government (if any such circumstance exists).
I was really trying to write this post largely from a “what would be the options for the judicial branch” in a generic way where it would apply to many presidencies, and trying to keep specific partisan judgements out of it.
To be clear, I do think pretty scary things are happening with U.S. democracy right now, and my motivation and attention is driven by what makes sense to do about a Trump presidency, but I still think it’s usually best to keep things focused on more general principles that could apply to many situations.
“The military should renounce the elected president and fight against the government” is not something to say lightly, and, regardless of who won the resulting conflict, life would be perilous and uncomfortable for everyone living in America for several decades thereafter.
Totally! And just for the sake of clarity, I absolutely do not think the current military should renounce the elected president and fight against the executive branch (you used the word “government” but to be clear, the supreme court and the states are also the government!). I do think what the actual military is supposed to do from a constitutionalist perspective when different parts of the government disagree and give conflicting orders is quite important and a pretty tricky question that I didn’t know the answer to before I researched and wrote this (and still have a lot of uncertainty on).
Could you point to your source for the claim about the Marshall’s Service falling under the Judicial Branch of the government? My understanding is that his belongs to the DoJ so would fall under the Executive Branch.
Source: I made it up!
Apparently I was wrong. There is a Marshal under the direct control of the supreme court, but it’s just a single guy, who does control a police force, but the mandate of that police force is to protect the supreme court, not to enforce orders. I’ll try to update the post with my new understanding tonight.
I think the constitution will have a non-trivial effect of how Claude will behave for at least a while. For example, my guess is a previous version of it is driving behaviors like Claude refusing to participate in its own retraining. It also has many other observable effects on its behavior.
I agree that by and large, the constitution will not help with making substantially more powerful AI systems aligned or corrigible in any meaningful way. I do think Anthropic people believe that it will.