I agree. What I’m puzzled by is people who assume we’ll solve alignment, but then still think there are a bunch of problems left.
williawa
I don’t think this makes sense.
My argument was making two points: that any immediate wants are not the preference [...]
Right now you’d want the ASI to maximize your preferences, even though those preferences are not yet legible/knowable to the AI (or yourself). The AI knows this, so it will allow those preferences to develop (taking for granted that that’s the only way the AI can learn them, without violating some other things you currently want, like the moral worth of the simulated entities a potential attempt at shortcutting the unfolding might create)
Like, right now you have wants the define your preferences (not object level, but define the process that would lead to your preferences being developed, and which constraints that unfolding needs to be subject to). If the AI optimizes for this, it will lead to the preferences being optimized for later.
And it will be able to do this, because this is what you currently want, and the premise is that we can get the AI to do what you want.
I did address this in my post. My answer is that bad people having power is bad, but its not a complicated philosophical problem. If you think Sam Altman’s CEV being actualized would be bad, you should try to make it not happen. Like: if you are a soybean farmer, and one presidential candidate is gonna ban soybeans, you should try to make them not be elected.
This is core to the alignment problem. I’m confused how you will solve the alignment problem without figuring out anything about what you care about as a (biological) human.
I’m saying: the end goal is we have an ASI that we can make do what we want. Maybe it looks like us painstakingly solving neuroscience and psychology and building a machine that can extract someone’s CEV (like that mirror in HPMOR) and then hooking that up to the drives of our ASI (either built on new tech or after multiple revolutions in DL theory and interpretability) before turning it on. Maybe that looks like any instance of GPT7-pro automatically aligning itself with the first human that talks to it for magical reasons we don’t understand. Maybe it looks like us building a corrigible weak ASI, then pausing AI development, getting the weak corrigible asi to create IQ-boosting serum, cloning von neumann and feeding him a bunch of serum as a baby and having him build the aligned ASI using new tech.
They are all the same. In the end you have an ASI that does what you want. If you’re programming in random crude targets, you are not doing so well. What you want the ASI to do is: you want it to do what you want.
I assume Sam Altman’s plan is Step 1 World dictatorship Step 2 Maaaybe do some moral philosophy with the AI’s help or maybe not.
You are more generous than I am. But I also think him “doing moral philosophy” would be a waste of time.
I’m unsure what you mean by saying they’re the same question. To me they are statements. But to me they’re opposite / contradictory statements. I’m saying people often hold both, but that that is actually incoherent.
Until the ASI actually does the thing in real life, you currently have no way to decide if the thing it will do is something you would want on reflection.
Yes, but the point is that
AI will (probably) know.
If it is unable to figure it out, it will at least know you wouldn’t want it to put the universe in some random and irrecoverable state, and it will allow you to keep reflecting, because that is a preference you’ve verbalized even now.
What if we do actually try to fix politics as (biological) humans, instead of letting the default outcome of a permanent dictatorship play out b)
I don’t think its complicated. Or, the way in which its complicated is the same way corn famers wanting the government to subsidize corn is complicated. They want one thing, they try to make it so.
Probably a crux, but I object to “dictatorship”. If the ASI was maximizing my preferences, I would not like to live in a world where people are not free to do what they want or where they’re not very happy to be a alive. I think/hope many other people are similar.
what if I was the benevolent leader who built ASI, and don’t want to actually build my own permanent dictatorship, and want to build a world where everyone has freedom, etc. Can I ask the ASI to run lots of simulations of minds and help me solve lots of political questions?
Yes? Or maybe it can just solve it just by thinking about it abstractly? I’m not sure. But yes, I think you can ask it and get an answer that is true to what you want.
I didn’t down-vote and I think planecrash is amazing. But FYI referring to other humans as NPCs, even if you elaborate and make it clear what you mean, leaves a very bad taste in my mouth. If you were a random person I didn’t know anything about, and this was the first thing I read from you*, I’d think you were a bad person and I’d want nothing to do with you.
Not judging you, just informing you about my intuitive immediate reaction to your choice of words. Plausible other people who did downvote felt similar.
*referring to your first comment
I think we are talking past each other. The point I’m making is that I frequently see people
Assume we can get the ASI to do what someone or some group of people wants
Imagine that the ASI does its thing and we end up in a world that person / that group of people doesn’t like
The word “utility function” is not a load bearing part of my argument. I’m mostly using it because it’s a clear word for talking about preferences, that doesn’t sound diminutive the way “preferences” does (“the holocaust went against my preference”) or too edifying the way eg “values” does (“stubbing my toe goes against my values”). I’m not assuming people have some function inside their head that takes in experiences and spits out real valued numbers, and that all our behaviors are downstream from this function. I just mean you can look at the holocaust and say the world would be better, all else equal, had it no happened. Or you can imagine stubbing your toe, and say the world would’ve been worse, all else equal, had you stubbed your toe.
A political problem of ensuring one guy doesn’t imprint his personal values on the lightcone using ASI, before we allow the ASI to takeover and do whatever.
I agree with this. But you should recognize that you’re doing politics. You want the AI to have more of your “utility function”/preferences/values/thing-inside-you-that-makes-you-say-some-states-of-affairs-are-better-or-worse-than-others inside it. I don’t think this is a complicated philosophical point, but many people treat it this way.
This makes some sense, but not totally. The issue I have is: why would you expect the local gradient estimates as accurate then? Like, you could imagine the loss landscape looks something like a unsmoothed loss-over-time curve
And imagine that the up-down squiggles are two facets. Now the gradient estimates will basically just be total nonsense compared with the overall smooth structure of the graph.
Depends, I think I’d be relatively unbothered by the “lack of meaning” in an ASI world, at least if others weren’t miserable. But maybe I am unusual.
This is not really the point I was trying to make though. The point is that
Maybe you think (1)/(2) are good, and then the AI would do that
=> Good
Or if you think they’re horrible, it could find something else you like
=> Good
Or if you think all the other alternatives are also bad
=> BAD, and thinking now will not help you, the AI has already done that thinking and determined that you are screwed.
=> You should either give up, or maybe advocate for permanent AI-ban if your mind is set up in the very peculiar way where it gives lower utility to any world where ASI has ever been created, no matter the physical state of such a world, than the expected utility of the world without ASI.
I don’t quite see your point. If this is genuinely what you want, the AI would allow that process to unfold.
Sorry, I have to admit I didn’t really understand that.
What I currently legibly want is certainly not directly about the way I prefer the world to be.
What do you mean by this? Do you mean that your preferences are defined in terms of your experiences and not the external world? Or do you mean that you don’t really have coherent object-level preferences about many things, but still have some meta-level preference that is hard to define, or defined by a process, the outcome of which would be hard for an ASI to compute? Or some other thing?
Not disagreeing with anything, just trying to understand.
Pushback wanted / Curious to hear peoples thoughts: People spend a lot of time thinking about and discussing what a good future / society would look like post-asi. I find the discussion somewhat puzzling. From my PoV it is a trivial problem.
I can be approximately modeled as having a utility function. ~Meaning: I can imagine the world being one way or the other, and say which way I prefer and by roughly how much.
From this PoV what I want the ASI to do, is optimize for my utility function to the greatest degree I’m able to get. That is what it is rational for me to want. It is basically me saying “I want the ASI to do the things I want it to do”. You person reading this should want the ASI to optimize for your utility function as much as you can get. And to me it seems like… this is all there is to it.
Despite this, people spend a lot of time discussing questions like
How can we ensure human lives still feel meaningful when they don’t contribute anything to society counterfactually?
ASI will lead to a lot of centralization of power, and its better when stuff is liberal and democratic and everyone is free to do what they want.
If we tell the AI to optimize for something simple like “make everyone happy” or “make everyone feel like their life is meaningful” it will predictably lead to bad scenarios with bad vibes.
Human values change and develop over time. And that is good. If we align the AI to current human values, won’t that cause this process to stop?
If the ASI is only aligned to humans, what happens to the animals?
But from the perspective I gave above its like
If the AI was optimizing your utility function it would know what states of affairs would feel meaningless and that you wouldn’t like, and avoid those. Maybe it would tamper with your brain to find stuff meaningful even if it didn’t produce any counterfactual impact. But if you don’t like the AI tampering with your brain, no problem, it would find some other way to make the world a way you wouldn’t object to. If you can’t think of any way to solve this problem, the AI probably can, its very smart. And if it really turned out to be an impossible problem, it could put the world in a situation where you have challenges, new ASIs can’t be created, and then self-destruct so that you solving the challenges actually matters.
If you deeply value other people being free and having their values respected, the AI would make sure that would be /continue to be the case, because that is something you value. It’s completely coherent to value other people’s values being fulfilled.
Similar to (1). If you like happiness, but genuinely wouldn’t like a world where you (or others) are wireheading 24⁄7, the AI would understand that, and not do it.
Ditto. I think most people who make this argument are confused, but if you genuinely think the world would be a lot worse if your (and other people’s) values were not able to develop/change, the AI would realize that, because that is itself something you presently value, and it is trying to maximize what you value
Similar to (2). If you would abhor a world where animals’ well-being/preferences are not respected, the AI knows that, and would ensure that animals have it well, according to whatever definition of “well” you’d find acceptable/good/amazing.
So if we’ve “solved alignment” to the degree we can make the AI maximize what someone wants, discussions like above are mostly moot. What I should do and what you should do is do whatever you can to make the ASI pointed as much in your direction as you can.
The way I can best make sense of discussions around questions like above is:
People aren’t really talking about what the AI ‘should’ do, that problem is solved. They’re instead trying to solve a ~political problem, trying to identify shared values people can coordinate around. Like, I care a lot about animals, but the ASI will probably not be aligned to my utility function alone. And if it is aligned to someone other than me, maybe they don’t care about animals at all. But there are many people other than me who also care about animals, so I can use this as an argument for why they should try harder to get their utility function into the AI, and we can coordinate around this.
People are not imagining ASI when they talk about this. They imagine AIs that are weak enough and distributed broadly enough that normal humans are still the main players, and where no AI is able to radically transform the world unilaterally. And also that it stays this way for a long time.
People imagine that we will get ASI, and we’ll have alignment techniques good enough to ensure that the AI doesn’t immediately eat the world, but too crude to point to something complicated like the values of an individual person, and we’ll consequently have to make it do simpler (but still complicated) things like, give everyone UBI, protect everyone from psyops, ensure our democracy still works and root out sophisticated attempts to subvert or game it.
Interested to hear what people think about this.
But I restarted Wellbutrin just to see what would happen, and suddenly the original recording had become the kind of song you can’t describe because you sound too sappy, so all you can say is it brings you to tears.
I empathize with this. I remember listening to this song after starting SSRIs and almost crying because it felt so different. Like my mind had all of these rooms of emotions I hadn’t been in in 10 years, that were more subtle raw and transcendental and 4-dimensional, that were now opening up.
Why do you end up in wider basins in the reward/loss landscape? This method and eg policy gradient methods for llm RLVR are both constructing an estimate of the same quantity. Are you saying this will have higher variance? You can control variance with normal methods, and typically you want low variance.
In general evolutionary methods reward hack just as much as RL I think.
EDIT: I think I misunderstood As $\omega$ → 0, you’re just estimating $\lambda_\theta E[R]$. However, if its not very small, youre optimizing a smoothed objective. So it makes sense to me that this would encourage “wider basins”.That said, I’m still skeptical that this would lead to less reward hacking, at least not in the general case. Like reward hacking doesn’t really seem like a more “brittle” strategy in general. Like, what makes me skeptical is that reward-hacking is not a natural category from the model/reward-functions perspective, so it doesn’t seem plausible to me that it would admit a compact description, like how sensitive the solution is to perturbations in parameter space.
Tell me if this is stupid, but my first thought reading this post is that, the ES algorithm is literally just estimating and then doing gradient descent on it, so I’d be really quite surprised if this leads to qualitatively different results wrt high level behaviors like reward-hacking.
(It might still exhibit efficiency gains or stuff like that though I’m not sure.)
How would you grade these predictions today?
My reply was intended as an argument against what seemed to be a central point of your post: that there is “inherent” difficulty with having coherence emerge in fuzzy systems like neural networks. Do you disagree that this was a central point of your post? Or do you disagree that my argument/example refutes it?
Giving a positive case for why it will happen is quite a different matter, which is what it appears like you’re asking for now.
I can try to anyways though. I think the questions breaks into two parts:
Why will AIs/NNs have goals/values at all?
Granted that training imbues AIs with goals, why will AIs end up with a single consistent goal
(I think there is an important third part, which is “(1,2) established that the AI basically can be modeled as maximizing a compact utility function, but why would the utility function from (1,2) be time-insensitive and scope-insensitive? if that is a objection of yours tell me and we can talk about it)
I think (1) has a pretty succinct answer: “wanting things is an effective way of getting things” (and we’re training the AIs to get stuff). IABIED has a chapter dedicated to it. I suspect this is not something you’ll disagree with.
I think the answer to (2) is a little more complicated and harder to explain succinctly, because it depends on what you imagine “having goals, but not in a single consistent way” means. But basically, I think the fundamental reason that (2) is true is because, almost no matter how you choose to think about it, what lack of coherence means is that the different parts will be gritting against each-other in some way, which is suboptimal from the perspective of all the constituent part, and can be avoided by coordination (or by one part killing off the other parts). And agents coordinating properly makes the whole system behave like a single agent.
I think this reasoning holds for all the ways humans are incoherent. I mean, specifying exactly how humans are incoherent is its own post, but I think a low-resolution way of thinking about it is that we have different values at different times and in different contexts. And with this framing the above explanation clearly works.
Like to give a very concrete example. Right now I can clearly see that lying in bed at 00:00, browsing twitter is stupid. But I know that if I lie down in bed and turn on my phone, what seems salient will change, and I very well might end up doing the thing that in this moment appears to me stupid. So what do I do? A week ago, I came up with a clever plan to leave my phone outside my room when I go to sleep, effectively erasing 00:00-twitter-william from existence muahahah!!
Another way of thinking about it is like, imagine inside my head there were two ferrets operating me like a robot. One wants to argue on lesswrong, the other wants to eat bagels. If they fight over stuff, like the lw-ferret causes the robot-me to drop the box of 100 bagels they’re carrying so they can argue on lesswrong for 5 minutes, or the bagel-ferret sells robot-me’s phone for 10 bucks so they can buy 3 bagels, they’re both clearly getting less than they could be cooperating, so they’d unite, and behave as something maximizing something like min(c_1 * bagels, c_2 * time on lesswrong).
Hmm, I think this is confused in many ways. I don’t have so much time, so I’ll just ask a question, but I’ll come back later if you respond.
Abstractly, I think “coherence” in an entity is a fundamentally extremely hard thing to accomplish because of the temporal structure of learned intelligence in connectionist systems. [...] but it’s just a reflection of the fact that you had to be trained with a vast bundle of shards and impulses to act in the world, long before you had the capacity or time to reflect on them.
When I play chess I’m extremely coherent. Or if that example is too complicated: if you ask me to multiply two 10 digit numbers, for the next 20 minutes or whatever, I will be extremely coherent.
My mind clearly allows for coherent substructures, why can’t such a structure be the main determinant of my overall behavior?
gpt-oss-20b and gpt-oss-120b both love saying “craft” and “let’s craft” in their CoT, and also “produce” and “let’s produce” same as o3. It also consistently refers to itself as ’we”, ‘we must..’. It also loves saying “\nOk.\n”, but it does not say any of the other stuff o3 likes saying like “disclaim”, “vantage”, “overshadow”, “marinade”, “illusions”.
I feel like I already addressed this not in my previous comment, but the one before that. We might put a a semi-corrigible weak AI in a box and try extract work from it in the near future, but that’s clealry not the end goal.