MinusGix

Karma: 285

Programmer.

MinusGix 7 May 2026 21:37 UTC
2 points
0
on: Many individual CEVs are probably quite bad

And by a “CEV-monster” I mean someome who will end up hurting people for non-strategic reasons even after feeling perfectly secure in their godlike powers.) If this is correct, then I simply reject the premise. I think that although most people are probably “CEV-good”, there are also quite many “CEV-monsters”, i.e. people who value suffering for the sake of suffering (of others). I don’t know how many, but as a very rough estimate, let’s say between 5% and 50%?

You need to make a distinction between the negative element of a person’s CEV which scales with population (or something similar), and one which does not. Collapsing those confuses about scale.
For example, there’s plausible CEVs wherein someone takes revenge against some particular people or even a whole class of people who were their enemies, but which are otherwise nice. That, I could perhaps believe is 5-50%, though even then I doubt it is that large. Whether ‘they get killed’ or ‘they get tormented forever’. My view is that these worlds are very unfortunate but also still are placed among the most non-destructive sorts of worlds and are pretty good comparatively.

Then, there are those who would instate biblical Hell, but I doubt that is anywhere near 50% even before given knowledge. After knowledge, I have much higher doubt. That is, most religious issues and political specific turns of fate dissolve under truth and reflection and that’s what most of your examples are.
So it becomes a question of what proportion of the population desires large-scale suffering on reflection which to me is <5%. If this wasn’t the case, the world would look very different.

I do agree that monsters are more likely to be in positions of power, which should influence our reticence to give them such power but I feel we have dramatically different mental models of the degree of selection pressure.

I also do agree that some people may have substantially different fundamentals that cut off a lot of value, like possibly internally-coherent Buddhist philosophies which don’t rest on factual observations of the world, but that those are also quite rare. That is, most belief systems have some meaningful referent to facts about the world, or facts about people and thus turn dramatically given proper knowledge.

My view on your first hypothetical value lock-in example is that it is presupposing we messed up implementing CEV. So I don’t really consider that relevant.
If you get knowledge and control, then you can consider methods to more safely lock-in. So perhaps we do get some less-valuable futures due to lock-in stopping truly-better routes from being instantiated, but that I expect a CEV-enhanced individual to be able to consider much better methodologies than naive “ensure I believe X without referring to the nature of the world at all” (ala, your Hitler example; or the religious person enforcing their belief in God). My view is basically that you’re mostly considering “What if an unenhanced human got the power”; rather than “enhanced human” or even “unenhanced human with an AI they can ask for help from”.

MinusGix 20 Apr 2026 16:43 UTC
11 points
2
on: Reevaluating “AGI Ruin: A List of Lethalities” in 2026
4: The rest of that quote illuminates that much more.

The given lethal challenge is to solve within a time limit, driven by the dynamic in which, over time, increasingly weak actors with a smaller and smaller fraction of total computing power, become able to build AGI and destroy the world. Powerful actors all refraining in unison from doing the suicidal thing just delays this time limit—it does not lift it, unless computer hardware and computer software progress are both brought to complete severe halts across the whole Earth.

That is, we can’t decide not to just build AGI (for a very long period of time). Powerful actors refraining because they’ve realized the danger can delay it, but you can’t necessarily stop it from being made. Eventually there will be someone who can make it in some small country with GPUs and what not. A standard treaty being agreed upon would drastically help, of course!

So I disagree with your understanding of 4.

As well, it has been a common belief of that you don’t need all the current large amount of GPUs to get useful intelligence, just that it is substantially harder.

There is a certain amount of cross-pollination that makes everything hang together, but not enough to make the “readily” in this statement true, and not enough to make the rhetorical point it’s trying to make in favor of X-risk concerns.

I disagree somewhat weakly with this, but also that RL does help you on other problems, especially as you scale up and apply it to more areas. That is, just as “training over the entire internet” is what we did with early GPT models in pretraining, “training over as many varying problem sets as we can” is what we will do with RL. Even if we explicitly avoid training it on solving certain problems, those skills will generalize. Dario’s interview with Dwarkesh talks specifically about this,

We had all these measures. We had all these measures of how well it did at predicting all these other kinds of texts. It was only when you trained over all the tasks on the internet — when you did a general internet scrape from something like Common Crawl or scraping links in Reddit, which is what we did for GPT-2 — that you started to get generalization. I think we’re seeing the same thing on RL. We’re starting first with simple RL tasks like training on math competitions, then moving to broader training that involves things like code. Now we’re moving to many other tasks. I think then we’re going to increasingly get generalization. So that kind of takes out the RL vs. pre-training side of it. [… some discussion …] I can’t speak for the emphasis of anyone else. I can only talk about how we think about it. The goal is not to teach the model every possible skill within RL, just as we don’t do that within pre-training. Within pre-training, we’re not trying to expose the model to every possible way that words could be put together. Rather, the model trains on a lot of things and then reaches generalization across pre-training. That was the transition from GPT-1 to GPT-2 that I saw up close. The model reaches a point. I had these moments where I was like, “Oh yeah, you just give the model a list of numbers — this is the cost of the house, this is the square feet of the house — and the model completes the pattern and does linear regression.” Not great, but it does it, and it’s never seen that exact thing before. So to the extent that we are building these RL environments, the goal is very similar to what was done five or ten years ago with pre-training. We’re trying to get a whole bunch of data, not because we want to cover a specific document or a specific skill, but because we want to generalize.
- Dario on Dwarkesh podcast recently
So, I think the statement is somewhat wrong, but also with the context of “the standard is that we have to evaluate on lots of different tasks to start getting generalization which we are going to specifically do even if we might hope to avoid directly training on bad tasks”.

Can we raise the ceiling of the systems we can safely train by red-teaming, building RL honeypots, performing weak-to-strong generalization experiments, hardening our current environments, and making interpretability probes?

I’d say the answer is “yes”, but that doesn’t stop Eliezer’s objection! You getting a bit further doesn’t stop you from dying when the AI realizes its values are better fulfilled elsewise and can maneuver to a safe position. That is, similar to you trying to stop a very intelligent person from stealing your money, you often can’t! Especially since you’re giving them multiple tries and thus playing their ability to generalize around your issues against your ability to instill a deep sense of alignment-to-what-you-want!

You can install a wall around your home and they climb over it. You can make a secret chest of GPS-tracked fools gold. This does help you gain value from the person for longer, especially as they learn not to try some tricks, it doesn’t stall out your fundamental issue that you’re still making them more intelligent and trying cheap tricks. Aka Security Mindset.

Of course this is still valuable to research if you believe that alignment is easier along various axes, where you can win via just a bit of an edge with a quite smart AI system, but that’s not what Eliezer appears to believe nor what I believe.

11

Generally, I think this is effectively presuming we will have humans in the loop carefully helping the AI along. I mean, I hope so, but given the tendencies of the AI companies, race dynamics, the sheer effectiveness that AI will have in comparison to their researchers &etc.

The whole issue still remains in that, well, you need to ensure that cognitive machinery generalizes! You also seem to be considering primarily relatively mundane advancements in science, when the point is “do you trust an AGI to design and implement a method to do a pivotal act”. Which from many short training runs is still a big question of how you ensure that.
Operating at a highly intelligent level is a drastic shift in distribution from operating at a less intelligent level, opening up new external options, and probably opening up even more new internal choices and modes...
Like 10, 12 is a weakly true statement, that is, by sleight of hand, being used to serve a broader rhetorical point that is straightforwardly incorrect.

I think you misunderstand the point of 12, and other short statements like it, that is (as I read them to mean) they are meant to serve as “here’s a basic point, spelled out explicitly, to make the foundations being considered here clear in everyone’s mind”.

For example, it’s true that it’s different & harder to align GPT-5.4 than GPT-3. But humanity doesn’t need the alignment techniques used on GPT-3 to work on GPT-5.4, we just need to handle the distributional shift between ~GPT-5.2 and GPT-5.4, then between 5.4 and 5.5, & accelerating from there.

Later, Eliezer will say that he expects many of these problems to manifest after a “sharp capabilities gain”. But we have not hit this yet, as of 2026, even though AI models are already being used very heavily as part of AI R&D. The precise, specific moment we expect to encounter this shift in distribution, is the thing that will determine how much useful work we can get out of models towards alignment, and is primarily what Eliezer’s interlocutors seem to disagree with him about.

I agree we just need to handle distributional shift for these current models, but that I don’t see that as applying to models as we have them grow in capabilities and incentive + ability to optimize against a target. If you’re in a situation where you’re having to rely on the opponent not making any drastic decisions before you have time to update your defenses after you give them another two intelligence points, you’re going to open yourself up to substantial shifts.

However, we did have, for example, o1/o3, with o3 is a lying liar by Zvi and others. They’ve improved on it- though also Claude 4.6 was anecdotally pretty sycophantic and while Claude 4.7 is better it does seem we swap from “friendly happy sycophantic” to “colder argumentative” (Claude 4.7, ChatGPT 5.4). I don’t think these are major signs and are fine intermediate failures but I do think are worrying to the claim that we are handling these distributional shifts.

I also disagree AI models are already being used very heavily in a relevant sense, because I do think we have not yet reached the “Geniuses in a datacenter” level and are only starting to get the ability to send them off on their own to try all sorts of ideas without them going in circles. To me it is no surprise we haven’t hit major sharp turns from AI research, simply because it is still bottlenecked by human-in-the-loop and the model’s own inability to evaluate itself for long periods. (To predict: I’m skeptical Mythos hits that tier yet either)

This bends on whether you think there’s drastically better model designs available, and of whether you think they can be discovered before LLMs just continue scaling in mundane ways. Regardless, even if sticking with roughly LLMs is the right move, data already preprocessed for them, AI already know tons of about them, etc; I am skeptical there isn’t room for substantial performance increases within roughly current paradigm even though I don’t expect “boost to ASI in a short period”.

19 (a). More generally, there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment—to point to latent events and objects and properties in the environment, rather than relatively shallow functions of the sense data and reward...

There’s something about this argument that irks me that is hard to articulate properly. It’s sort of the same thing that irks me when people say that models are “just” next token predictors and therefore aren’t intelligent; it seems not-even-wrong. I realize that it’s not completely analogous because eventually an ASI is going to amplify small differences in utility functions and tile the world at max score, and so these details might end up mattering. It’s still annoying because I can imagine the writer watching Claude Code work its way all of the way up to superintelligence and witnessing the Dyson Sphere get built from the moon colony and going “well how do you know it’s not really just optimizing its sensory data?”

We don’t know how to consistently deliberately point it at those things, not that it can’t do so! I believe Eliezer has made this point before though I don’t have a link offhand.

30a:

I am more skeptical of Eliezer’s point in this as I am in others, but I do think you’re ignoring that it is effectively arguing that a pivotal act is going to be complicated. That there’s going to be lots of effects of how an AGI does things to ensure safety which you can’t reasonably verify. A classic example is the hypothetical “nanotech to burn GPUs”, even if in principle understandable, I do think it is very plausible that it would take months-years for humans to understand the design deeply, and also that these are only going to target GPUs and not for example the AGI off-switch or humans or- etc. Then of course whether the AGI necessitates social politicking, where the actions are taken to produce a specific image and odd chains of cause and effect. I think these are all in principle understandable but may take quite a bit more time than you have. Which is why Eliezer says

An AI whose action sequence you can fully understand all the effects of, before it executes, is much weaker than humans in that domain; you couldn’t make the same guarantee about an unaligned human as smart as yourself and trying to fool you.

That is, if the system was strategically trying to maneuver around you in its pivotal act, you’d be screwed.

32: I interpreted this as a disclaim against human-imitators, which was a more talked about research route (or piece of research routes) back then. Current LLMs are obviously not purely human-imitators anymore.

In my opinion a lot of objections are overindexing on current-day AI and then extrapolating it out, misunderstanding Eliezer, as well as simply believing the alignment problem is a lot easier from the get-go.

I don’t disagree with all your objections to the post, but I do simply doubt the iterative deployment story quite strongly (its a nice story, which makes things feel cozier, which I don’t think we have notable reason to believe) and it seems to play a pretty central role here.

MinusGix 13 Mar 2026 23:04 UTC
3 points
0
in reply to: Thane Ruthenis’s comment on: Thane Ruthenis’s Shortform
Hm, my disagreement with this mental model is that I view current models as already helpful on research, and the further iterations on those models which AI companies will acquire over the next couple years are going to substantially improve on that. Even if LLMs are AGI-complete, in that they can be “boosted” to AGI, it is likely that given the ability to point a thousand automated researchers at foundational problems they’ll… just find that alternate architecture if it exists. This is part of what fuels my shorter timelines, to me they haven’t had to reach far at all yet. When you have that many GPUs to run copies of Claude/ChatGPT you can throw some at wide scattershot in the hope of an advantage in the race or more optimistically an advantage in alignment.

As well, I have the uncertainty of whether LLMs need to be AGI complete to still fill out many investor’s hopes and dreams. Like if OpenAI/Anthropic stalls out on investment in datacenters due to lowered confidence, it chokes and perhaps sells off a bunch, but then hires N-thousand software engineers eager for a job to chomp up massive parts of the industry using Claude 5.9-super-duper and become a giant ala Google/Apple/Microsoft regardless. That is, while it’d be a “winter” in terms of far lower mania, but that it won’t really stop them from their dreams too harshly. (Though perhaps I’m underestimating how hard they’d falter, like I know Dario said Anthropic was being cautious to avoid collapsing if they overestimate growth, and OpenAI was being less so? I don’t know what constraints they have that might lead to aggressive clawback or other treatment)

MinusGix 10 Mar 2026 22:27 UTC
1 point
0
in reply to: Thomas Larsen’s comment on: Thomas Larsen’s Shortform
I somewhat agree, but I also do think “apply your Bayesian reasoning to figuring out what hypotheses to privilege” is how people decide which structural hypotheses (ontology) describe the world better. So I feel you’re taking an overly narrow view. Like, for scheming, you ask how these different notions inform what you can observe, the way the AI behaves, and methods to avoid it.

MinusGix 6 Mar 2026 4:35 UTC
34 points
7
in reply to: Andrew_Critch’s comment on: Andrew_Critch’s Shortform
“Buddhism has been damaging to the epistemics of everyone in this sphere. Buddhism was only ever privileged as a hypothesis due to background SF/Bay-Area spiritualism rather than real merit.

Buddhist materials are explicitly selected for reshaping how you think within their frames. This makes it like joining a minor cult to learn their social skills. Some can extract the useful parts without buying in, but they are notably underrepresented in any discussion (some selection effects of course). The default assumption should be that you won’t, especially as the topic is treated without notable suspicion. Most other religions are massively safer to practice for a few years, though not without their risks, as they have more ritual rather than mental molding, and more argumentation for their Rightness. You’re already primed to notice flaws in arguments. Buddhism operates more directly on your mindset, framing, and probably even values as humans are not idealized agents where those are separate.

Meditation is useful, and probably doesn’t result in a lot of the central and surrounding Buddhist thought. However just like joining a cult, or playing a gacha game, you should be skeptical of Buddhism similarly as they are all Out to Get You.

My less strongly held opinion is that Buddhism’s likely endpoints are incompatible with human values and often truth-seeking. This would matter less if it was treated with suspicion, just as we rightly view most religions with skepticism even while openly discussing them, but it is a gaping hole in our mental defenses.”

(I agree with Ryan Greenblatt that most basically decent posts wouldn’t end up with negative karma for very long though; but I’d expect this to be decently unpopular)

MinusGix 21 Feb 2026 3:13 UTC
1 point
0
in reply to: romeostevensit’s comment on: AGI is Here
I doubt you need that at all, Claude Code CLI or Codex CLI and you’re most of the way there. Based on your other comment saying 3.1 I’m wondering whether or not you’re using Claude/ChatGPT rather than Gemini? Gemini 3.0 at least was notably behind both of them, and while Gemini 3.1 has improved it still seems to struggle in comparison.

Extracting sections from books in my experience works pretty well- the main way they’ll ever choke on that is if they decide to read a 200page pdf to context because they lack knowledge of their own limits at digesting that. Tell them to convert it to text if they don’t do that themselves?

MinusGix 17 Feb 2026 23:07 UTC
1 point
0
in reply to: Eye You’s comment on: Aligning to Virtues
What I mean is that you need a way to robustly point an AI at a point in the space of all values, which does have coherent structure, and that is a hard problem to actually point at what you want in a way that extrapolates out of distribution as you would want it to do. So, if you have the ability to robustly make the AI follow these virtues as we intend them to be followed, then you probably have enough alignment capability to point it at “value humanity as we would desire” (or “act as a consequentialist and maximize that with reflection about ensuring you aren’t doing bad things”). So then virtue ethics is just a less useful target.

Now, you can try doing far weaker methods of training a model, similar to the Claude’s “Helpful, Harmless, Honest” sortof virtues. However, I don’t think that will be robust, and it hasn’t been for as long as people have tried making LLMs not say bad things. With reinforcement learning and further automated research, this problem becomes starker as there’s ever more pressure making our weak methods of instilling those virtues fall apart.

I don’t think we really know how to raise humans to be robustly virtuous. I view us as having a lot of the machinery inbuilt, Byrnes’ post on this topic is relevant. AI won’t have that, nor do I see a strong reason it will adopt values from the environment in just the right way.
However, also, I don’t view a lot of humans virtue ethics as being robust in the sense that we desperately need AI values to be robust. See the examples in my parent comment I gave of the history of virtue ethics becoming an end in of itself leading to bad examples. This is partially due simply to that humans are not naturally modeled as having virtue ethics by default, but rather (imo) a mix of virtue ethics / deontology / consequentialism.

MinusGix 16 Feb 2026 20:08 UTC
4 points
2
on: Aligning to Virtues
My view on this is that it runs into the same problems many alternative alignment targets have: If you can robustly train an AI to embody these virtues, then I suspect you thereby have (or are not far off from) the ability to train the AI to be a “good consequentialist” or even more simply “value humanity as we desire” rather than these loose proxies.

Credit hacking is still a problem here, virtue ethics does not sidestep Goodhart’s law or other forms of over-optimization. History has had many virtues being optimized until the “real target” is left barren, as extreme ascetics, various forms of Hinduism, flagellants, abuse of humility, social status “Character” over genuine goodness, ritualized propriety, courage → recklessness, and so on show us. More directly on your point, however, while somewhat true, I think you underrate how manipulable framing is for virtue ethics. Consequentialism actively discourages messing with your framing of an issue, for distorting your vision results in systematically less utility. Virtue ethics has a lot of room to reframe an issue- that actually, the opponent betrayed his word and thus is dishonorable, so aggression is now justice; the outgroup lacks your civilized virtues, so dominating them is really benevolence; opponents used dishonest means, thus undermining them preserves the integrity of the situation. These are avoidable, I do not think that many “default” ways of implementing virtue ethics easily avoids them. (And some of these framings might even be correct; just that I am wary of designing an AI with an incentive to perform sort of reasoning)

As well, while I don’t think this is an inevitable feature of virtue ethics, virtue ethics does often result in it being virtuous to spread those virtues. While this can be good, even for a non-consequentialist less aggressive AGI/ASI, I don’t think giving it desires that result in it wanting to push others along its values is a good idea. The virtues, especially if we’re choosing ones that seem useful, are proxies of our values.

MinusGix 6 Feb 2026 4:47 UTC
9 points
−2
in reply to: Richard_Ngo’s comment on: ricraz’s Shortform
I disagree. I don’t see increased focus on scheming, if anything notably less common. In part due to updating on current gen LLMs. I do think there is a tendency to think about scheming as a discrete thing, but that it is more common among the optimistic who point at current gen LLMs not really being ‘schemers’.

I agree with the way Zvi talks about the topic. “Being a schemer” is not quite the right classification. The issue is that deception is a naturally convergent tool for all sorts of goals, anything that interfaces with reality intelligently will find that deception and manipulation are useful tools. So we’d naturally expect that RL and other fun methods will push towards that being a greater aspect- and that even if we don’t have any badly mislabeled data or reward-hackable environments, sufficiently general intelligence will be able to construct the methodology by itself.

So I kinda agree with your post, but I also feel that you’re then turning down scheming/deception as less of a thing, when it is still a relevant categorization just hard to measure and be confident in how it grows as you scale.

MinusGix 4 Nov 2025 18:19 UTC
4 points
−3
in reply to: Raemon’s comment on: The Tale of the Top-Tier Intellect
Contrary, I liked this post and the latter half the most. It serves as a relatively direct parable about different levels of ability and also the major problems with common arguments against AGI/ASI, which I think people still miss making a point of very often. Spelling them out explicitly without going into super-long detail as a full post is good as it provides more concise argumentative handles. That is, people do not actually make the basic counterarguments enough.

(I also think those suggesting that this is already argued out enough should link to alternative posts. Posts for higher quality and more concise argumentation, and also posts made for reading by interlocutors.)

MinusGix 25 Sep 2025 13:39 UTC
1 point
0
in reply to: ryan_greenblatt’s comment on: Notes on fatalities from AI takeover
From my current stance, it is plausible, because we haven’t settled how we think of aliens (especially those who are significantly outside of our behaviors) philosophically. I most likely don’t respect arbitrary intelligent agents, as I’d be for getting rid of a vulnerable paperclipper if we found one on the far edges of the galaxy.

Then, I think you’re not extrapolating mentally how much that computronium would give. From our current perspective the logic makes sense: where we upload the aliens regardless even if you respect their preferences beyond that, because it lets you simulate vastly more aliens or other humans at the same time.
I expect we care about their preferences. However those preferences will end up to some degree subordinate to our own preferences, the clear obvious being that we probably wouldn’t allow them an ASI depending on how attack/defense works, but the other being that we may upload them regardless due to the sheer benefits.

Beyond that I disagree how common that motivation is. I think the kind of learning we know naturally results in that, limited social agents modeling each other in an iterated environment, is currently not on track to apply to AI.… and that another route is “just care strategically” especially if you’re intelligent enough. I feel this is extrapolating a relatively modern human line of thought to arbitrary kinds of minds.

MinusGix 17 Sep 2025 14:43 UTC
3 points
0
in reply to: Cole Wyeth’s comment on: Thane Ruthenis’s Shortform
(Note: I’ve only read a few pages so far, so perhaps this is already in the background)

I agree that if the parent comment scenario holds then it is a case of the upload being improper.

However, I also disagree that most humans naturally generalize our values out of distribution. I think it is very easy for many humans to get sucked into attractors (ideologies that are simplifications of what they truly want; easy lies; the amount of effort ahead stalling out focus even if the gargantuan task would be worth it) that damage their ability to properly generalize and also importantly apply their values. That is, humans have predictable flaws. Then when you add in self-modification you open up whole new regimes.

My view is that a very important element of our values is that we do not necessarily endorse all of our behaviors!

I think a smart and self-aware human could sidestep and weaken these issues, but I do think they’re still hard problems. Which is why I’m a fan of (if we get uploads) going “Upload, figure out AI alignment, then have the AI think long and hard about it” as that further sidesteps problems of a human staring too long at the sun. That is, I think it is very hard for a human to directly implement something like CEV themselves, but that a designed mind doesn’t necessarily have the same issues.

As an example: power-seeking instinct. I don’t endorse seeking power in that way, especially if uploaded to try to solve alignment for Humanity in general, but given my status as an upload and lots of time realizing that I have a lot of influence over the world, I think it is plausible that instinct affects me more and more. I would try to plan around this but likely do so imperfectly.

MinusGix 1 Sep 2025 16:57 UTC
1 point
0
on: Help me understand: how do multiverse acausal trades work?
A core element is that you expect acausal trade among far more intelligent agents, such as AGI or even ASI. As well that they’ll be using approximations.

Problem 1: There isn’t going to be much Darwinian selection pressure against a civilization that can rearrange stars and terraform planets. I’m of the opinion that it has mostly stopped mattering now, and will only matter even less over time. As long as we don’t end up in a “everyone has an AI and competes in a race to the bottom”. I don’t think it is that odd that an ASI could resist selection pressures. It operates on a faster time-scale and can apply more intelligent optimization than evolution can, towards the goal of keeping itself and whatever civilization it manages stable.

Problem 2: I find it somewhat plausible there’s some nicely sufficiently pinned down variables that can get us to a more objective measure. However, I don’t think it is needed and most presentations of this don’t go for an objective distribution.
So, to me, using a UTM that is informed by our own physics and reality is fine. This presumably results in more of a ‘trading nearby’ sense, the typical example being across branches, but in more generality. You have more information about how those nearby universes look anyway.

The downside here is that whatever true distribution there is, you’re not trading directly against it. But if it is too hard for an ASI in our universe to manage, then presumably many agents aren’t managing to acausally trade against the true distribution regardless.

MinusGix 26 Aug 2025 19:52 UTC
6 points
1
in reply to: Brendan Long’s comment on: Harmless reward hacks can generalize to misalignment in LLMs
I think you’re referring to their previous work? Or you might find it relevant if you didn’t run into it. https://www.lesswrong.com/posts/ifechgnJRtJdduFGC/emergent-misalignment-narrow-finetuning-can-produce-broadly

If you were pessimistic about LLMs learning a general concept of good/bad, then yes, that should update you. However, I think it still has the main core problems. If you are doing a simple continual learning loop (LLM → output → retrain to accumulate knowledge; analogous to ICL) then we can ask the question of how robust this process is. Do the values of how to behave drastically diverge. Such as, are there attractors over a hundred days of output that it is dragged towards that aren’t aligned at all? Can it be jail-broken wittingly or not by getting the model to produce garbage responses that it is then trained on? And then arguments like ‘does this hold up under reflection’ or ’does it attach itself to the concept of good or chatgpt-influenced good (or evil). So while LLMs being capable of learning good is, well, good, there are still big targeting, resolution, and reflection issues.

For this post specifically, I believe it to be bad news. It provides evidence that subtle reward hacking scenarios encourage the model to act misaligned in a more general manner. It is likely quite nontrivial to get rid of reward-hacking like behavior in our larger and larger training runs. So if the model gets into a period of time where reward-hacking is rewarded, a continual learning scenario is easiest to imagine but even in training, then it may drastically change its behavior.

MinusGix 26 Aug 2025 17:44 UTC
4 points
0
in reply to: cousin_it’s comment on: HPMOR: The (Probably) Untold Lore
I have some of the same feeling, but internally I’ve mostly pinned it to two prongs of repetition and ~status.

ChatGPT’s writing is increasingly disliked by those who recognize it. The prose is poor in various ways, but I’ve certainly read worse and not been so off-put. Nor am I as off-put when I first use a new model, but then I increasingly notice its flaws over the next few weeks. The main aspect is that the generated prose is repetitive across the writings which ensures we can pick up on the pattern. Such as making it easy to predict flaws. Just as I avoid many generic power fantasy fiction as much of it is very predictable in how it will fall short even though many are still positive value if I didn’t have other things to do with my time.

So, I think a substantial part is that of recognizing the style, there being flaws you’ve seen in many images in the past, and then regardless of whether this specific actual image is that problematic, the mind associates it with negative instances and also being overly predictable.

Status-wise this is not entirely in a negative status game sense. A generated image is a sign that it was probably not that much effort for the person making it, and the mind has learned to associate art with effort + status to a degree, even if indirect effort + status by the original artist the article is referencing. And so it is easy to learn a negative feeling towards these, which attaches itself to the noticeable shared repetition/tone. Just like some people dislike pop in part due to status considerations like being made by celebrities or countersignaling of not wanting to go for the most popular thing, and then that feeds into an actual dislike for that style of musical art.

But this activates too easily, a misfiring set of instincts, so I’ve deliberately tamped it down on myself; because I realized that there are plenty of images which five years ago I would have been simply impressed and find them visually appealing. I think this is an instinct that is to a degree real (generated images can be poorly made), while also feeding on itself that makes it disconnected from past preferences. I don’t think that the poorly made images should notably influence my enjoyment of better quality images, even if there is a shared noticeable core. So that’s my suggestion.

MinusGix 25 Aug 2025 4:54 UTC
3 points
0
in reply to: Ben Pace’s comment on: Banning Said Achmiz (and broader thoughts on moderation)
Anecdotally, I would perceive “Bowing out of this thread” as a more negative response because it encapsulates both topic as well as the quality of my response or behavior of myself. While “not worth getting into” is mostly about the worth of the object level matter. (Though remarking on behavior of the person you’re arguing with is a reasonable thing to do, I’m not sure that interpretation is what you intend)

MinusGix 23 Aug 2025 19:59 UTC
19 points
13
in reply to: habryka’s comment on: Banning Said Achmiz (and broader thoughts on moderation)
I disagree. Posts seem to have an outsized effect and will often be read a bunch before any solid criticisms appear. Then are spread even given high quality rebuttals… if those ever materialize.
I also think you’re referring to a group of people who write high quality posts typically and handle criticism well, while others don’t handle criticism well. Despite liking many of his posts, Duncan is an example of this.

As for Said specifically, I’ve been annoyed at reading his argumentation a few times, but then also find him saying something obvious and insightful that no one else pointed out anywhere in the comments. Losing that is unfortunate. I don’t think there’s enough “this seems wrong or questionable, why do you believe this?”
Said is definitely more rough than I’d like, but I also do think there’s a hole there that people are hesitant to fill.

So I do agree with Wei that you’ll just get less criticism, especially since I do feel like LessWrong has been growing implicitly less favorable towards quality critiques and more favorable towards vibey critiques. That is, another dangerous attractor is the Twitter/X attractor, wherein arguments do exist but they matter to the overall discourse less than whether or not someone puts out something that directionally ‘sounds good’. I think this is much more likely than the sneer attractor or the linkedin attractor.

I also think that while the frontpage comments section has been good for surfacing critique, it encourages the “this sounds like the right vibe” substantially. As well as a mentality of reading the comments before the post, encouraging faction mentality.

MinusGix 23 Aug 2025 18:32 UTC
9 points
8
in reply to: lc’s comment on: Banning Said Achmiz (and broader thoughts on moderation)
Because Said is an important user who provides criticism/commentary across many years. This is not about some random new user, which is why there is a long post in the first place rather than him being silently banned.
Alicorn is raising a legitimate point. That it is easy to get complaints about a user who is critical of others, that we don’t have much information about the magnitude, and that it is far harder to get information about users who think his posts are useful.

LessWrong isn’t a democracy, but these are legitimate questions to ask because they are about what kind of culture (as Habryka talks about) LW is trying to create.

MinusGix 17 Aug 2025 23:24 UTC
6 points
1
in reply to: Stephen Martin’s comment on: Stephen Martin’s Shortform
I find this surprising. The typical beliefs I’d expect are 1) Disbelief that models are conscious in the first place; 2) believing this is mostly signaling (and so whether or not model welfare is good, it is actually a negative update about the trustworthiness of the company); 3) That it is costly to do this or indicates high cost efforts in the future. 4) Effectiveness

I suspect you’re running into selection issues of who you talked to. I’d expect #1 to come up as the default reason, but possibly the people you talk to were taking precautionary principle seriously enough to avoid that.
The objections you see might come from #3. That they don’t view this as a one-off cheap piece of code, they view it as something Anthropic will hire people for (which they have), which “takes” money away from more worthwhile and sure bets. This is to some degree true, though I find those X odd as Anthropic isn’t going to spend on those groups anyway. However, for topics like furthering AI capabilities or AI safety then, well, I do think there is a cost there.

MinusGix 1 Aug 2025 5:45 UTC
3 points
0
in reply to: jimmy’s comment on: My Empathy Is Rarely Kind

How did you arrive at this belief? Like, the thing that I would be concerned with is “How do I know that Russel’s teapot isn’t just beyond my current horizon”?

Empirical evidence of being more in tune with my own emotions, generally better introspection, and in modeling why others make decisions. Compared to others. I have no belief that I’m perfect at this, but I do think I’m generally good at it and that I’m not missing a ‘height’ component to my understanding.

Is it possible, do you think, that the way you’re doing analysis isn’t sufficient, and that if you were to be more careful and thorough, or otherwise did things differently, your experience would be different? If not, how do you rule this out, exactly? How do you explain others who are able to do this?

Because, (I believe) the impulse to dismiss any sort of negativity or blame once you understand the causes deep enough is one I’ve noticed myself. I do not believe it to be a level of understanding that I’ve failed to reach, I’ve dismissed it because it seems an improper framing.
At times the reason for this comes from a specific grappling with determinism and choice that I disagree with.
For others, the originating cause is due to considering kindness as automatically linked with empathy, with that unconsciously shaping what people think is acceptable from empathy.
In your case, some of it is tying it purely to prediction that I disagree with, because of some mix of kindness-being-the-focus, determinism, a feeling that once it has been explained in terms of the component parts that there’s nothing left, and other factors that I don’t know because they haven’t been elucidated.

Empirical exploration as in your example can be explanatory. However, I have thought about motivation and the underlying reasons to a low granularity plenty of times (impulses that form into habits, social media optimizing for short form behaviors, the heuristics humans come with which can make doing it now hard to weight against the cost of doing it a week from now, how all of those constrain the mind...), which makes me skeptical. The idea of ‘shift the negativity elsewhere’ is not new, but given your existing examples it does not convince me that if I spent an hour with you on this that we would get anywhere.

“because they’re bad/lazy/stupid”/”they shouldn’t have” or whatever you want to round it to, but these things are semantic stopsigns, not irreducible explanations.

This, for example, is a misunderstanding of my position or the level of analysis that I’m speaking of. Wherein I am not stopping there, as I mentally consider complex social cause and effect and still feel negative about the choices they’ve made.

Yet as you grieve, these things come up less and less frequently. Over time, you run out of errant predictions like “It’s gonna be fun to see Benny when—Oh fuck, no, that’s not happening”. Eventually, you can talk about their death like it’s just another thing that is, because it is.

Grief like this exists, but I don’t agree that it is pure predictive remembrance. There is grief which lasts for a time and then fades away, not because my lower level beliefs are prediction to see them—away from home and a pet dies, I’m still sad, not because of prediction error but because I want (but wants are not predictions) the pet to be alive and fine, but they aren’t. Because it is bad, to be concise.

You could try arguing that this is ‘prediction that my mental model will say they are alive and well’, with two parts of myself in disagreement, but that seems very hard to determine the accuracy as an explanation and I think is starting to stretch the meaning of prediction error. Nor does the implication that ‘fully knowing the causes’ carves away negative emotion follow?

I’m holding the goal posts even further forward though. Friendly listening is one thing, but I’m talking about pointing out that they’re acting foolish and getting immediate laughter in recognition that you’re right. This is the level of ability that I’m pointing at. This is what is what’s there to aim for, which is enabled by sufficiently clear maps.

This is more about socialization ability, though having a clear map helps. I’ve done this before, with parents and joking with a friend about his progress on a project, but I do not do so regularly nor could I do it in arbitrarily. Joking itself is only sometimes the right route, the more general capability is working a push into normal conversation, with joking being one tool in the toolbox there. I don’t really accept the implication ‘and thus you are mismodeling via negative emotions if you can not do that consistently’. I can be mismodeling to the degree that I don’t know precisely what words will satisfy them, but that can be due to social abilities.

The big thing I was hoping you’d notice, is that I was trying to make my claims so outrageous and specific so that you’d respond “You can’t say this shit without providing receipts, man! So lets see them!”. I was daring you to challenge me to provide evidence. I wonder if maybe you thought I was exaggerating, or otherwise rounding my claims down to something less absurd and falsifiable?

When you don’t provide much argumentation, I don’t go ‘huh, guess I need to prod them for argumentation’ I go ‘ah, unfortunate, I will try responding to the crunchy parts in the interests of good conversation, but will continue on’. That is, the onus is on you to provide reasons. I did remark that you were asserting without much backing.

I was taking you literally, and I’ve seen plenty of people fall back without engaging—I’ve definitely done it during the span of this discussion, and then interpreting your motivations through that. ‘I am playing a game to poke and prod at you’ is uh.....

Anyway, there are a few things in your comment that suggest you might not be having fun here. If that’s the case, I’m sorry about that. No need to continue if you don’t want, and no hard feelings either way.

A good chunk of it is the ~condescension. Repeated insistence while seeming to mostly just continue on the same line of thought without really engaging where I elaborate, goalpost gotcha, and then the bit about Claude when you just got done saying that it was to ‘test’ me; which it being to prod me being quite annoying in-of-itself.
Of course, I think you have more positive intent behind that. Pushing me to test myself empirically, or pushing me to push back on you so then you can push back yourself on me to provide empirical tests (?), or perhaps trying to use it as an empathy test for whether I understand you. I’m skeptical of you really understanding my position given your replies.

I feel like I’m being better at engaging at the direct level, while you’re often doing ‘you would understand if you actually tried’, when I believe I have tried to a substantial degree even if nothing precisely like ‘spend two hours mapping cause and effect of how a person came to these actions’.