Programmer.
MinusGix
I somewhat agree, but I also do think “apply your Bayesian reasoning to figuring out what hypotheses to privilege” is how people decide which structural hypotheses (ontology) describe the world better. So I feel you’re taking an overly narrow view. Like, for scheming, you ask how these different notions inform what you can observe, the way the AI behaves, and methods to avoid it.
“Buddhism has been damaging to the epistemics of everyone in this sphere. Buddhism was only ever privileged as a hypothesis due to background SF/Bay-Area spiritualism rather than real merit.
Buddhist materials are explicitly selected for reshaping how you think within their frames. This makes it like joining a minor cult to learn their social skills. Some can extract the useful parts without buying in, but they are notably underrepresented in any discussion (some selection effects of course). The default assumption should be that you won’t, especially as the topic is treated without notable suspicion. Most other religions are massively safer to practice for a few years, though not without their risks, as they have more ritual rather than mental molding, and more argumentation for their Rightness. You’re already primed to notice flaws in arguments. Buddhism operates more directly on your mindset, framing, and probably even values as humans are not idealized agents where those are separate.
Meditation is useful, and probably doesn’t result in a lot of the central and surrounding Buddhist thought. However just like joining a cult, or playing a gacha game, you should be skeptical of Buddhism similarly as they are all Out to Get You.
My less strongly held opinion is that Buddhism’s likely endpoints are incompatible with human values and often truth-seeking. This would matter less if it was treated with suspicion, just as we rightly view most religions with skepticism even while openly discussing them, but it is a gaping hole in our mental defenses.”
(I agree with Ryan Greenblatt that most basically decent posts wouldn’t end up with negative karma for very long though; but I’d expect this to be decently unpopular)
I doubt you need that at all, Claude Code CLI or Codex CLI and you’re most of the way there. Based on your other comment saying 3.1 I’m wondering whether or not you’re using Claude/ChatGPT rather than Gemini? Gemini 3.0 at least was notably behind both of them, and while Gemini 3.1 has improved it still seems to struggle in comparison.
Extracting sections from books in my experience works pretty well- the main way they’ll ever choke on that is if they decide to read a 200page pdf to context because they lack knowledge of their own limits at digesting that. Tell them to convert it to text if they don’t do that themselves?
What I mean is that you need a way to robustly point an AI at a point in the space of all values, which does have coherent structure, and that is a hard problem to actually point at what you want in a way that extrapolates out of distribution as you would want it to do. So, if you have the ability to robustly make the AI follow these virtues as we intend them to be followed, then you probably have enough alignment capability to point it at “value humanity as we would desire” (or “act as a consequentialist and maximize that with reflection about ensuring you aren’t doing bad things”). So then virtue ethics is just a less useful target.
Now, you can try doing far weaker methods of training a model, similar to the Claude’s “Helpful, Harmless, Honest” sortof virtues. However, I don’t think that will be robust, and it hasn’t been for as long as people have tried making LLMs not say bad things. With reinforcement learning and further automated research, this problem becomes starker as there’s ever more pressure making our weak methods of instilling those virtues fall apart.
I don’t think we really know how to raise humans to be robustly virtuous. I view us as having a lot of the machinery inbuilt, Byrnes’ post on this topic is relevant. AI won’t have that, nor do I see a strong reason it will adopt values from the environment in just the right way.
However, also, I don’t view a lot of humans virtue ethics as being robust in the sense that we desperately need AI values to be robust. See the examples in my parent comment I gave of the history of virtue ethics becoming an end in of itself leading to bad examples. This is partially due simply to that humans are not naturally modeled as having virtue ethics by default, but rather (imo) a mix of virtue ethics / deontology / consequentialism.
My view on this is that it runs into the same problems many alternative alignment targets have: If you can robustly train an AI to embody these virtues, then I suspect you thereby have (or are not far off from) the ability to train the AI to be a “good consequentialist” or even more simply “value humanity as we desire” rather than these loose proxies.
Credit hacking is still a problem here, virtue ethics does not sidestep Goodhart’s law or other forms of over-optimization. History has had many virtues being optimized until the “real target” is left barren, as extreme ascetics, various forms of Hinduism, flagellants, abuse of humility, social status “Character” over genuine goodness, ritualized propriety, courage → recklessness, and so on show us. More directly on your point, however, while somewhat true, I think you underrate how manipulable framing is for virtue ethics. Consequentialism actively discourages messing with your framing of an issue, for distorting your vision results in systematically less utility. Virtue ethics has a lot of room to reframe an issue- that actually, the opponent betrayed his word and thus is dishonorable, so aggression is now justice; the outgroup lacks your civilized virtues, so dominating them is really benevolence; opponents used dishonest means, thus undermining them preserves the integrity of the situation. These are avoidable, I do not think that many “default” ways of implementing virtue ethics easily avoids them. (And some of these framings might even be correct; just that I am wary of designing an AI with an incentive to perform sort of reasoning)
As well, while I don’t think this is an inevitable feature of virtue ethics, virtue ethics does often result in it being virtuous to spread those virtues. While this can be good, even for a non-consequentialist less aggressive AGI/ASI, I don’t think giving it desires that result in it wanting to push others along its values is a good idea. The virtues, especially if we’re choosing ones that seem useful, are proxies of our values.
I disagree. I don’t see increased focus on scheming, if anything notably less common. In part due to updating on current gen LLMs. I do think there is a tendency to think about scheming as a discrete thing, but that it is more common among the optimistic who point at current gen LLMs not really being ‘schemers’.
I agree with the way Zvi talks about the topic. “Being a schemer” is not quite the right classification. The issue is that deception is a naturally convergent tool for all sorts of goals, anything that interfaces with reality intelligently will find that deception and manipulation are useful tools. So we’d naturally expect that RL and other fun methods will push towards that being a greater aspect- and that even if we don’t have any badly mislabeled data or reward-hackable environments, sufficiently general intelligence will be able to construct the methodology by itself.
So I kinda agree with your post, but I also feel that you’re then turning down scheming/deception as less of a thing, when it is still a relevant categorization just hard to measure and be confident in how it grows as you scale.
Contrary, I liked this post and the latter half the most. It serves as a relatively direct parable about different levels of ability and also the major problems with common arguments against AGI/ASI, which I think people still miss making a point of very often. Spelling them out explicitly without going into super-long detail as a full post is good as it provides more concise argumentative handles. That is, people do not actually make the basic counterarguments enough.
(I also think those suggesting that this is already argued out enough should link to alternative posts. Posts for higher quality and more concise argumentation, and also posts made for reading by interlocutors.)
From my current stance, it is plausible, because we haven’t settled how we think of aliens (especially those who are significantly outside of our behaviors) philosophically. I most likely don’t respect arbitrary intelligent agents, as I’d be for getting rid of a vulnerable paperclipper if we found one on the far edges of the galaxy.
Then, I think you’re not extrapolating mentally how much that computronium would give. From our current perspective the logic makes sense: where we upload the aliens regardless even if you respect their preferences beyond that, because it lets you simulate vastly more aliens or other humans at the same time.
I expect we care about their preferences. However those preferences will end up to some degree subordinate to our own preferences, the clear obvious being that we probably wouldn’t allow them an ASI depending on how attack/defense works, but the other being that we may upload them regardless due to the sheer benefits.Beyond that I disagree how common that motivation is. I think the kind of learning we know naturally results in that, limited social agents modeling each other in an iterated environment, is currently not on track to apply to AI.… and that another route is “just care strategically” especially if you’re intelligent enough. I feel this is extrapolating a relatively modern human line of thought to arbitrary kinds of minds.
(Note: I’ve only read a few pages so far, so perhaps this is already in the background)
I agree that if the parent comment scenario holds then it is a case of the upload being improper.
However, I also disagree that most humans naturally generalize our values out of distribution. I think it is very easy for many humans to get sucked into attractors (ideologies that are simplifications of what they truly want; easy lies; the amount of effort ahead stalling out focus even if the gargantuan task would be worth it) that damage their ability to properly generalize and also importantly apply their values. That is, humans have predictable flaws. Then when you add in self-modification you open up whole new regimes.
My view is that a very important element of our values is that we do not necessarily endorse all of our behaviors!
I think a smart and self-aware human could sidestep and weaken these issues, but I do think they’re still hard problems. Which is why I’m a fan of (if we get uploads) going “Upload, figure out AI alignment, then have the AI think long and hard about it” as that further sidesteps problems of a human staring too long at the sun. That is, I think it is very hard for a human to directly implement something like CEV themselves, but that a designed mind doesn’t necessarily have the same issues.
As an example: power-seeking instinct. I don’t endorse seeking power in that way, especially if uploaded to try to solve alignment for Humanity in general, but given my status as an upload and lots of time realizing that I have a lot of influence over the world, I think it is plausible that instinct affects me more and more. I would try to plan around this but likely do so imperfectly.
A core element is that you expect acausal trade among far more intelligent agents, such as AGI or even ASI. As well that they’ll be using approximations.
Problem 1: There isn’t going to be much Darwinian selection pressure against a civilization that can rearrange stars and terraform planets. I’m of the opinion that it has mostly stopped mattering now, and will only matter even less over time. As long as we don’t end up in a “everyone has an AI and competes in a race to the bottom”. I don’t think it is that odd that an ASI could resist selection pressures. It operates on a faster time-scale and can apply more intelligent optimization than evolution can, towards the goal of keeping itself and whatever civilization it manages stable.
Problem 2: I find it somewhat plausible there’s some nicely sufficiently pinned down variables that can get us to a more objective measure. However, I don’t think it is needed and most presentations of this don’t go for an objective distribution.
So, to me, using a UTM that is informed by our own physics and reality is fine. This presumably results in more of a ‘trading nearby’ sense, the typical example being across branches, but in more generality. You have more information about how those nearby universes look anyway.The downside here is that whatever true distribution there is, you’re not trading directly against it. But if it is too hard for an ASI in our universe to manage, then presumably many agents aren’t managing to acausally trade against the true distribution regardless.
I think you’re referring to their previous work? Or you might find it relevant if you didn’t run into it. https://www.lesswrong.com/posts/ifechgnJRtJdduFGC/emergent-misalignment-narrow-finetuning-can-produce-broadly
If you were pessimistic about LLMs learning a general concept of good/bad, then yes, that should update you. However, I think it still has the main core problems. If you are doing a simple continual learning loop (LLM → output → retrain to accumulate knowledge; analogous to ICL) then we can ask the question of how robust this process is. Do the values of how to behave drastically diverge. Such as, are there attractors over a hundred days of output that it is dragged towards that aren’t aligned at all? Can it be jail-broken wittingly or not by getting the model to produce garbage responses that it is then trained on? And then arguments like ‘does this hold up under reflection’ or ’does it attach itself to the concept of good or chatgpt-influenced good (or evil). So while LLMs being capable of learning good is, well, good, there are still big targeting, resolution, and reflection issues.
For this post specifically, I believe it to be bad news. It provides evidence that subtle reward hacking scenarios encourage the model to act misaligned in a more general manner. It is likely quite nontrivial to get rid of reward-hacking like behavior in our larger and larger training runs. So if the model gets into a period of time where reward-hacking is rewarded, a continual learning scenario is easiest to imagine but even in training, then it may drastically change its behavior.
I have some of the same feeling, but internally I’ve mostly pinned it to two prongs of repetition and ~status.
ChatGPT’s writing is increasingly disliked by those who recognize it. The prose is poor in various ways, but I’ve certainly read worse and not been so off-put. Nor am I as off-put when I first use a new model, but then I increasingly notice its flaws over the next few weeks. The main aspect is that the generated prose is repetitive across the writings which ensures we can pick up on the pattern. Such as making it easy to predict flaws. Just as I avoid many generic power fantasy fiction as much of it is very predictable in how it will fall short even though many are still positive value if I didn’t have other things to do with my time.
So, I think a substantial part is that of recognizing the style, there being flaws you’ve seen in many images in the past, and then regardless of whether this specific actual image is that problematic, the mind associates it with negative instances and also being overly predictable.
Status-wise this is not entirely in a negative status game sense. A generated image is a sign that it was probably not that much effort for the person making it, and the mind has learned to associate art with effort + status to a degree, even if indirect effort + status by the original artist the article is referencing. And so it is easy to learn a negative feeling towards these, which attaches itself to the noticeable shared repetition/tone. Just like some people dislike pop in part due to status considerations like being made by celebrities or countersignaling of not wanting to go for the most popular thing, and then that feeds into an actual dislike for that style of musical art.
But this activates too easily, a misfiring set of instincts, so I’ve deliberately tamped it down on myself; because I realized that there are plenty of images which five years ago I would have been simply impressed and find them visually appealing. I think this is an instinct that is to a degree real (generated images can be poorly made), while also feeding on itself that makes it disconnected from past preferences. I don’t think that the poorly made images should notably influence my enjoyment of better quality images, even if there is a shared noticeable core. So that’s my suggestion.
Anecdotally, I would perceive “Bowing out of this thread” as a more negative response because it encapsulates both topic as well as the quality of my response or behavior of myself. While “not worth getting into” is mostly about the worth of the object level matter. (Though remarking on behavior of the person you’re arguing with is a reasonable thing to do, I’m not sure that interpretation is what you intend)
I disagree. Posts seem to have an outsized effect and will often be read a bunch before any solid criticisms appear. Then are spread even given high quality rebuttals… if those ever materialize.
I also think you’re referring to a group of people who write high quality posts typically and handle criticism well, while others don’t handle criticism well. Despite liking many of his posts, Duncan is an example of this.As for Said specifically, I’ve been annoyed at reading his argumentation a few times, but then also find him saying something obvious and insightful that no one else pointed out anywhere in the comments. Losing that is unfortunate. I don’t think there’s enough “this seems wrong or questionable, why do you believe this?”
Said is definitely more rough than I’d like, but I also do think there’s a hole there that people are hesitant to fill.So I do agree with Wei that you’ll just get less criticism, especially since I do feel like LessWrong has been growing implicitly less favorable towards quality critiques and more favorable towards vibey critiques. That is, another dangerous attractor is the Twitter/X attractor, wherein arguments do exist but they matter to the overall discourse less than whether or not someone puts out something that directionally ‘sounds good’. I think this is much more likely than the sneer attractor or the linkedin attractor.
I also think that while the frontpage comments section has been good for surfacing critique, it encourages the “this sounds like the right vibe” substantially. As well as a mentality of reading the comments before the post, encouraging faction mentality.
Because Said is an important user who provides criticism/commentary across many years. This is not about some random new user, which is why there is a long post in the first place rather than him being silently banned.
Alicorn is raising a legitimate point. That it is easy to get complaints about a user who is critical of others, that we don’t have much information about the magnitude, and that it is far harder to get information about users who think his posts are useful.LessWrong isn’t a democracy, but these are legitimate questions to ask because they are about what kind of culture (as Habryka talks about) LW is trying to create.
I find this surprising. The typical beliefs I’d expect are 1) Disbelief that models are conscious in the first place; 2) believing this is mostly signaling (and so whether or not model welfare is good, it is actually a negative update about the trustworthiness of the company); 3) That it is costly to do this or indicates high cost efforts in the future. 4) Effectiveness
I suspect you’re running into selection issues of who you talked to. I’d expect #1 to come up as the default reason, but possibly the people you talk to were taking precautionary principle seriously enough to avoid that.
The objections you see might come from #3. That they don’t view this as a one-off cheap piece of code, they view it as something Anthropic will hire people for (which they have), which “takes” money away from more worthwhile and sure bets. This is to some degree true, though I find those X odd as Anthropic isn’t going to spend on those groups anyway. However, for topics like furthering AI capabilities or AI safety then, well, I do think there is a cost there.
How did you arrive at this belief? Like, the thing that I would be concerned with is “How do I know that Russel’s teapot isn’t just beyond my current horizon”?
Empirical evidence of being more in tune with my own emotions, generally better introspection, and in modeling why others make decisions. Compared to others. I have no belief that I’m perfect at this, but I do think I’m generally good at it and that I’m not missing a ‘height’ component to my understanding.
Is it possible, do you think, that the way you’re doing analysis isn’t sufficient, and that if you were to be more careful and thorough, or otherwise did things differently, your experience would be different? If not, how do you rule this out, exactly? How do you explain others who are able to do this?
Because, (I believe) the impulse to dismiss any sort of negativity or blame once you understand the causes deep enough is one I’ve noticed myself. I do not believe it to be a level of understanding that I’ve failed to reach, I’ve dismissed it because it seems an improper framing.
At times the reason for this comes from a specific grappling with determinism and choice that I disagree with.
For others, the originating cause is due to considering kindness as automatically linked with empathy, with that unconsciously shaping what people think is acceptable from empathy.
In your case, some of it is tying it purely to prediction that I disagree with, because of some mix of kindness-being-the-focus, determinism, a feeling that once it has been explained in terms of the component parts that there’s nothing left, and other factors that I don’t know because they haven’t been elucidated.Empirical exploration as in your example can be explanatory. However, I have thought about motivation and the underlying reasons to a low granularity plenty of times (impulses that form into habits, social media optimizing for short form behaviors, the heuristics humans come with which can make doing it now hard to weight against the cost of doing it a week from now, how all of those constrain the mind...), which makes me skeptical. The idea of ‘shift the negativity elsewhere’ is not new, but given your existing examples it does not convince me that if I spent an hour with you on this that we would get anywhere.
“because they’re bad/lazy/stupid”/”they shouldn’t have” or whatever you want to round it to, but these things are semantic stopsigns, not irreducible explanations.
This, for example, is a misunderstanding of my position or the level of analysis that I’m speaking of. Wherein I am not stopping there, as I mentally consider complex social cause and effect and still feel negative about the choices they’ve made.
Yet as you grieve, these things come up less and less frequently. Over time, you run out of errant predictions like “It’s gonna be fun to see Benny when—Oh fuck, no, that’s not happening”. Eventually, you can talk about their death like it’s just another thing that is, because it is.
Grief like this exists, but I don’t agree that it is pure predictive remembrance. There is grief which lasts for a time and then fades away, not because my lower level beliefs are prediction to see them—away from home and a pet dies, I’m still sad, not because of prediction error but because I want (but wants are not predictions) the pet to be alive and fine, but they aren’t. Because it is bad, to be concise.
You could try arguing that this is ‘prediction that my mental model will say they are alive and well’, with two parts of myself in disagreement, but that seems very hard to determine the accuracy as an explanation and I think is starting to stretch the meaning of prediction error. Nor does the implication that ‘fully knowing the causes’ carves away negative emotion follow?
I’m holding the goal posts even further forward though. Friendly listening is one thing, but I’m talking about pointing out that they’re acting foolish and getting immediate laughter in recognition that you’re right. This is the level of ability that I’m pointing at. This is what is what’s there to aim for, which is enabled by sufficiently clear maps.
This is more about socialization ability, though having a clear map helps. I’ve done this before, with parents and joking with a friend about his progress on a project, but I do not do so regularly nor could I do it in arbitrarily. Joking itself is only sometimes the right route, the more general capability is working a push into normal conversation, with joking being one tool in the toolbox there. I don’t really accept the implication ‘and thus you are mismodeling via negative emotions if you can not do that consistently’. I can be mismodeling to the degree that I don’t know precisely what words will satisfy them, but that can be due to social abilities.
The big thing I was hoping you’d notice, is that I was trying to make my claims so outrageous and specific so that you’d respond “You can’t say this shit without providing receipts, man! So lets see them!”. I was daring you to challenge me to provide evidence. I wonder if maybe you thought I was exaggerating, or otherwise rounding my claims down to something less absurd and falsifiable?
When you don’t provide much argumentation, I don’t go ‘huh, guess I need to prod them for argumentation’ I go ‘ah, unfortunate, I will try responding to the crunchy parts in the interests of good conversation, but will continue on’. That is, the onus is on you to provide reasons. I did remark that you were asserting without much backing.
I was taking you literally, and I’ve seen plenty of people fall back without engaging—I’ve definitely done it during the span of this discussion, and then interpreting your motivations through that. ‘I am playing a game to poke and prod at you’ is uh.....
Anyway, there are a few things in your comment that suggest you might not be having fun here. If that’s the case, I’m sorry about that. No need to continue if you don’t want, and no hard feelings either way.
A good chunk of it is the ~condescension. Repeated insistence while seeming to mostly just continue on the same line of thought without really engaging where I elaborate, goalpost gotcha, and then the bit about Claude when you just got done saying that it was to ‘test’ me; which it being to prod me being quite annoying in-of-itself.
Of course, I think you have more positive intent behind that. Pushing me to test myself empirically, or pushing me to push back on you so then you can push back yourself on me to provide empirical tests (?), or perhaps trying to use it as an empathy test for whether I understand you. I’m skeptical of you really understanding my position given your replies.I feel like I’m being better at engaging at the direct level, while you’re often doing ‘you would understand if you actually tried’, when I believe I have tried to a substantial degree even if nothing precisely like ‘spend two hours mapping cause and effect of how a person came to these actions’.
The thing that I was missing then, and which you’re missing now, is that the bar for deep careful analysis is just a lot higher than you think (or most anyone thinks). It’s often reasonable to skimp out and leave it as “because they’re bad/lazy/stupid”/”they shouldn’t have” or whatever you want to round it to, but these things are semantic stopsigns, not irreducible explanations.
No, I believe I’m fully aware the level of deep careful analysis, and I understand why it pushes some people to sweep all facets of negativity or blame away, I just think they’re confused because their understanding of emotions/relations/causality hasn’t updated properly alongside their new understanding of determinism
“I’m annoyed that the calculator doesn’t work… without batteries?” How do you finish the statement of annoyance?
Because I wanted the calculator to work, I think it is a good thing for calculators in stores to work, I am frustrated that the calculator didn’t work… none of this is exotic, nor is it purely prediction error. (nor do prediction error related emotions have to go away once you’ve explained the error… I still feel emotional pain when a pet dies even if I realize all the causes why; why would that not extend to other emotions related to prediction error?)
Empirically, what happens, is that you can keep going and keep going, until you can’t, and at that point there’s just no more negative around that spot because it’s been crowded out. It doesn’t matter if it’s annoyance, or sadness, or even severe physical pain. If you do your analysis well, the experience shifts, and loses its negativity.
You assert this but I still don’t agree with it. I’ve thought long and hard about people before and the causes that make them do things, but no, this does not match my experience. I understand the impulse that encourages sweeping away negative emotions once you’ve found an explanation, like realizing that humanities’ lack of coordination is a big problem, but I can still very well feel negative emotions about that despite there being an explanation.
In other words, there are reasons for their choices. Do you understand why they chose the way they did?
Relatively often? Yes. I don’t blame people for not outputting the code for an aligned AGI because it is something that would have been absurdly hard to reinforce in yourself to become the kind of person to do that.
If someone has a disease that makes so they struggle to do much at all, I am going to judge them a hell of a lot less. Most humans have the “disease” that they can’t just smash out the code for an aligned AGI.
I can understand why someone is not investing more time studying, and I can even look at myself and relatively well pin down why, and why it is hard to get over that hump… I just don’t dismiss the negative feeling even though I understand why. They ‘could have’, because the process-that-makes-their-decisions is them and not some separate third-thing.
I fail to study when I should because a combination of short-term optimized positive feeling seeking which leads me to watching youtube or skimming X, a desire for faster intellectual feelings that are easier gotten from arguing on reddit (or lesswrong) than slowly reading through a math paper, because I fear failure, and much more. Yet I still consider that bad, even if I got a full causal explanation it would have still been my choices.
Regardless, I do not have issues getting along with someone even if I experience negative emotions about how they’ve failed to reach farther in the past—just like I can do so even if their behavior, appearance, and so on are displeasing. This will be easier if I do something vaguely like John’s move of ‘thinking of them like a cat’, but it is not necessary for me to be polite and friendly.
Notice the movement of goal posts here? I’m talking about successfully helping people, you’re saying you can “get along”. Getting along is easy. I’m sure you can offer what passes as empathy to the girl with the nail in her head, instead of fighting her like a beliggerent dummy.
I don’t have issues with helping people, there “goalposts” moved forward again, despite nothing in my sentence meaning I can’t help people. My usage of ‘get along’ was not the bare minimum meaning.
Getting along with people in the nail scenario often means being friendly and listening to them. I can very well do that, and have done it many times before, while still thinking their individual choices are foolish.
I don’t think your comment has supplied much more beyond further assertions that I must surely not be thinking things through.
Yes. But also that people are still making those choices.
Yes. But I would point out that ‘punishment’ in the moral sense of ‘hurt those who do great wrongs’ still holds just fine in determinism for the same reasons it originally did, though I personally am not much of a fan
Yes, just like I can be happy in a situation where that doesn’t help me.
“if my brain was in their body, then I wouldn’t...” or “if I had their resources, then I wouldn’t...”, which is saying you’re only [80]% that person. You’re leaving out a part of them that made them who they are.
No, it is more that I am evaluating from multiple levels. There is
basic empathy: knowing their own standards and feeling them, understanding them.
‘idealized empathy’: Then I often have extended sort of classical empathy where I am considering based on their higher goals, which is why I often mention ideals. People have dreams they fail to reach, and I’d love them to reach further, and yet it disappoints me when they falter because my empathy reaches towards those too.
Values: Then of course my own values, which I guess could be considered the 80% that person, but I think I keep the levels separate; all the considerations have to come together in the end. I do have values about what they do, and how their mind succeeds.
Some commenters seemingly don’t consider the higher ideals sort or they think of most people in terms of short-term values; others are ignoring the lens of their own values.
So I think I’m doing multiple levels of emulation, of by-my-values, in-the-moment, reflection, etc. They all inform my emotions about the person.
I remember being 9 years old & being sad that my friend wasn’t going to heaven. I even thought “If I was born exactly like them, I would’ve made all the same choices & had the same experiences, and not believe in God”. I still think that if I’m 100% someone else, then I would end up exactly as they are.
And I agree. If I ‘became’ someone I was empathizing with entirely then I would make all their choices. However, I don’t consider that notably relevant! They took those actions, yes influenced by all there is in the world, but what else would influence them? They are not outside physics. Those choices were there, and all the factors that make up them as a person were what decided their actions.
If I came back to a factory the next day and notice the steam engine failed, I consider that negative even when knowing that there must have been a long chain of cause and effect. I’ll try fixing the causes… which usually ends up routing through whatever human mind was meant to work on the steam engine as we are very powerful reflective systems. For human minds themselves that have poor choices? That often routes back through themselves.
I do think that the hard-determinist stance often, though of course not always, comes from post-Christian style thought which views the soul as atomically special, but that they then still think of themselves as ‘needing to be’ outside physics in some important sense rather than fully adapting their ontology. That choices made within determinism are equivalent to being tied up by ropes, when there is actually a distinction between the two scenarios.
Now, you could still argue #2, that these negative emotions set correct incentives. I’ve only heard second-hand of extreme situations where that worked [1], but most of the time backfires
A negative emotion can still push me to spend more effort on someone, though it usually needs to be paired with a belief that they could become better. Just because you have a negative emotion doesn’t mean you only output negative-emotion flavored content. I’ll generally be kind to people even if I think their choices are substantially flawed and that they could improve themselves.
I do think that the example of your teacher is one that can work, I’ve done it at least once though not in person, and it helped but it definitely isn’t my central route. This is effectively the ‘staging an intervention’ methodology, and it can be effective but requires knowledge and benefits greatly from being able to push the person.
But, as John is making the point, a negative emotion may not be what people are wanting, because I’m not going to have a strong kindness about how hard someone’s choices were… when I don’t respect those choices in the first place. However, giving them full positive empathy is not necessarily good either, it can feel nice but rarely fixes things. Which is why you focus on ‘fixing things’, advice, pointing out where they’ve faltered, and more if you think they’ll be receptive. They often won’t be, because most people have a mix of embarrassment at these kinds of conversations and a push to ignore them.
Hm, my disagreement with this mental model is that I view current models as already helpful on research, and the further iterations on those models which AI companies will acquire over the next couple years are going to substantially improve on that. Even if LLMs are AGI-complete, in that they can be “boosted” to AGI, it is likely that given the ability to point a thousand automated researchers at foundational problems they’ll… just find that alternate architecture if it exists. This is part of what fuels my shorter timelines, to me they haven’t had to reach far at all yet. When you have that many GPUs to run copies of Claude/ChatGPT you can throw some at wide scattershot in the hope of an advantage in the race or more optimistically an advantage in alignment.
As well, I have the uncertainty of whether LLMs need to be AGI complete to still fill out many investor’s hopes and dreams. Like if OpenAI/Anthropic stalls out on investment in datacenters due to lowered confidence, it chokes and perhaps sells off a bunch, but then hires N-thousand software engineers eager for a job to chomp up massive parts of the industry using Claude 5.9-super-duper and become a giant ala Google/Apple/Microsoft regardless. That is, while it’d be a “winter” in terms of far lower mania, but that it won’t really stop them from their dreams too harshly. (Though perhaps I’m underestimating how hard they’d falter, like I know Dario said Anthropic was being cautious to avoid collapsing if they overestimate growth, and OpenAI was being less so? I don’t know what constraints they have that might lead to aggressive clawback or other treatment)