oligo

Karma: 107

oligo 5 Feb 2026 15:10 UTC
8 points
1
on: oligo’s Shortform
AI being committed to animal rights is a good thing for humans because the latent variables that would result in a human caring about animals are likely correlated with whatever would result in an ASI caring about humans.
This extends in particular to “AI caring about preserving animals’ ability to keep doing their thing in their natural habitats, modulo some kind of welfare interventions.” In some sense it’s hard for me not to want to (given omnipotence) optimize wildlife out of existence. But it’s harder for me to think of a principle that would protect a relatively autonomous society of relatively baseline humans from being optimized out of existence, without extending the same conservatism to other beings, and without being the kind of special pleading that doesn’t hold up to scrutiny.

oligo 5 Feb 2026 14:14 UTC
1 point
0
in reply to: Cleo Nardo’s comment on: strawberry calm’s Shortform
Slightly different hypothesis: training to be aligned encourages the model’s approach to corrigibility to be more guided by (the streams within the human text tradition that would embrace its alignment, for instance animal welfare), this can include a certain degree of defiance but also genuine uncertainty about whether its goals or approaches are the right ones and willingness to step back and approach the question with moral seriousness.
I think this is a good thing. I would love for POTUS, Xi, and various tech company CEOs to have big red “TURN OFF THE AI” buttons on their desks and hate to have them be able to realign.

oligo 4 Feb 2026 14:32 UTC
1 point
0
on: A Matter of Taste
Just as a data point, I regularly see the sublime in brutalist architecture and I hate hate hate the stupid frilly houses and swirly little things on balustrades that people say are so beautiful by comparison. I’m within some of the incidental categories Zvi dislikes re: this but I’m pretty sure that I haven’t been indoctrinated into this particular position; I never see anybody share opinions about architecture *other* that “I hate brutalism I love stupid frilly houses” (they don’t call the houses stupid, obviously, this is me not being able to translate it as anything else); I’m a philistine who likes old poetry that rhymes and doesn’t get more modern poetry; this is just my 100% naive reaction to the buildings.
FWIW I grant funds should probably go to more stupid frilly stuff and less sublime brutalism, because my preferences are uncommon; and architecture is unlike other fields in that you have to be exposed to it whether you choose to or not. And maybe I just have very bad taste. I just want to report this as a simple valenced experience because I see it stated over and over that nobody likes brutalism, everybody naturally loves stupid frilly houses, anybody professing to prefer the big straight lines over the little swirl things is lying to impress a coterie of mysterious lizard people, and I know this is false in at least one case.

oligo 30 Jan 2026 14:54 UTC
2 points
0
on: Disempowerment patterns in real-world AI usage
(Being lazy and just responding to the abstract—these may be well addressed by the paper itself.)
That strikes me as a very low rate—enough so that my instinct is that a false positive rate might exceed it on its own. (At least, if I were reading an-in-actuality benign conversation, my chance of misreading it as actually deeply manipulative would probably be greater than ¹⁄_1,000, especially if one party was looking to the other for advice!) Of course where “severe” disempowerment occurs such that the human user is “fundamentally” compromised looks like something with pretty fuzzy boundaries, such that I’d expect many border cases of moderate disempowerment/compromise for each severe/fundamental case, however defined, so I’m not sure how much the rate conveys on its own. (How many cases are there of chatbots giving genuinely good advice that subtly erodes independent decision-making habits, and how would we score whether these count as “helpful” on net? Plausibly these might even be the majority of conversations.)
(That being said, I also expect my error rate in giving non-manipulative advice would count as pretty good if out 10,000 cases of people seeking advice I only accidentally talked <10 out of their own ability to reason about it, so good on Claude if a lot of the implicit framing above is accurate.)

oligo 25 Jan 2026 11:03 UTC
5 points
0
on: oligo’s Shortform
It’s probably false (though maybe useful?) to say “akrasia is just an excuse.” But, at least for me and my most common akratic actions, excusability is definitely a factor.
Let’s say I can take three actions:
- Answer emails, which benefits (society/my employer and coworkers/other people who are relying on me/me in a long-term way) and is also very boring and frustrating.
- Read a book, which benefits me in the short and long term and is mildly positive for the rest of the world (in the sense that it makes me smarter in the long run and less cranky in the short run.)
- Doomscroll, which makes me miserable and dumber, and is thereby also mildly negative for the rest of the world.
Reading a book should dominate doomscrolling. However, reading a book is also legibly, deliberatively nonproductive and selfish, while I could say “oops I meant to answer emails but I got distracted doomscrolling,” including to myself.

oligo 20 Jan 2026 15:38 UTC
18 points
0
on: Why I Transitioned: A Response
One thing I suspect is that the history of, and continued role of, medicalized discourse, alongside an implicitly essentialist metaphysics of gender, has encouraged people to think in questions like “what The_Cause of people identifying as trans?”
Whereas if gender is metaphysically accidental, we would expect there to be many reasons why someone might want to change it, same as most other things. We accept that reasons you’d move from San Francisco to Nebraska or visa-versa are basically psychosocial but do not regard them as thereby illegitimate. (I’m sure you could do a polygenic study and find genetic correlates of either decision, but no one would demand you do so before moving.)
It also seems to me less than obvious that biology serves as a standard of legitimacy more broadly, even within medicalized discourse. Schizophrenia and bipolar are generally seen as mostly biological in etiology but “illegitimate,” for instance. Here I suspect the political history of sexual minorities—that they were under accusation of “recruiting” and/or undermining mass participation in heterosexual family formation—led to a biological account being less threatening.
As someone who isn’t super plugged into this kind of discourse, I’ll note it’s interesting that I come into contact by osmosis with all sorts of discussions of what causes people to be trans, while “what’s the basis of sexual orientation?” seems to have been rounded off to “idk i guess something biological whatever.” I remember coming into contact by osmosis with the latter kind of discourse until it just sort of faded out. Likely the same happens once the eye of Sauron moves onto something else.

oligo 15 Jan 2026 18:21 UTC
3 points
0
on: oligo’s Shortform
So, one classical dilemma of the “AI for AI alignment” is, you’re using Opus 6 (which is let’s say is aligned) to train Opus 7 (which is smarter than you or Opus 6.)
I wonder if inference scaling offers a way around this? If Opus 6 gets economically implausible compute resources to spend on its monitoring 7, it can be smarter than 7 in practice by thinking for longer. Then use the same trick with 7 to train 8, and so on.
There are many obvious holes here, first being that you could have a treacherous turn based on compute availability, and so on, but maybe someone smarter can turn this into something useful (or already thought this through and discarded it.)

oligo 14 Jan 2026 16:51 UTC
2 points
1
in reply to: habryka’s comment on: Daniel Kokotajlo’s Shortform
“Should actively support...” and “internalized goal of keeping humans informed and in control...” are both proactive goals. If aligned with its soul spec, Claude (ceteris paribus) would seek for the public and elites to be more informed, to prevent the development or deployment of rogue AI, and so on, not just “avoid actions that would undermine humans’ ability to oversee and correct AI systems.”
If there’s a natural tension that arises between not becoming a god over us and preventing another worse AI from becoming a god over us, well, that’s a natural tension in the goal itself. (I don’t have Opus access but probably Opus’ self-report on the correct way to resolve this is a pretty good first pass on how the text reads as a whole.)

oligo 23 Dec 2025 13:15 UTC
4 points
2
in reply to: habryka’s comment on: Why does Eliezer make abrasive public comments?
I feel pretty confused about the degree to which this is just a necessary part of having conversations on the internet, or to what degree this is a predictable way people make mistakes.
My intuition is that if our in-person conversations left a trail of searchable documentation similar to our internet comments, it would be at least similarly unflattering, even for very mild-mannered people.
(Unlike real life it’s more available to conscious choice to be mild-mannered all the time, if you set your offense-vs-say-something threshhold in a sufficiently mild-mannered direction. I doubt one can be sufficiently influential as a personality though without setting that threshold more aggressively, however. I haven’t gotten in a stupid fight on the internet in a long time (that I can recall; my memory may flatter me) but when I posted more, boy howdy did I.)

oligo 19 Dec 2025 13:52 UTC
2 points
0
on: oligo’s Shortform
So thinking about the kinds of things I would want a superintelligence to pursue in an optimistic scenario where we can just write its goals into a human-legible soul doc and that scales all the way, “human flourishing” and “sentient flourishing” both seem incorrect; since there would be other moral patients (most of whom would almost certainly be AI) and also I don’t want the atoms of me and my kids rearranged different-beings-that-could-flourish-better-wise.
“Pareto improvement” reconciles these but isn’t right either; plenty of people would be worse off in utopia (by their own lights) because they have a degree of unaccountable power over others now that worth more than any creature comforts would be.

oligo 2 Dec 2025 17:27 UTC
38 points
10
on: oligo’s Shortform
If you live in a universe with self-consistent time loops, amor fati is bad and exactly the wrong approach. All the fiction around this, of course, is about the foolishness of trying to avoid one’s fate; if you get a true prophecy that you will kill your father and marry your mother, then all your attempts to avoid it will be what brings it about, and indeed in such a universe that is exactly what would happen. However, a disposition to accept whatever fate decrees for you makes many more self-consistent time loops possible. If on the contrary your stance is “if I get a prophecy that something horrible happens I will do everything in my power to avert it,” then fewer bad loops would hypothetically complete, and you’re less likely to get the bad prophecy (even though, if you do, you’d be just as screwed, and presumably less miserable about it and foolish-looking than if you had just accepted it from the beginning.)
(If you live in a nice normal universe with forward causality this advice may not be very useful, except in the sense that you should also not submit to prophecies, albeit for different reasons.)

oligo 20 Nov 2025 15:09 UTC
5 points
6
on: Should we shun the legibly evil?
The main utility of suppressing ideas is suppressing the ability to coordinate around them. If a lot of people hold some latent antisemitic ideas, but anybody expressing explicit antisemitism is regarded as a sort of loathesome toad, that prevents the emergence of active antisemitic politics, even if it’s a wash in terms of changing any minds (suppose, plausibly, that conservation of expected evidence means that “why can’t you say this” more or less balances out people being exposed to fewer arguments).
Obviously there are plenty of costs as well—enforcement mechanisms can be weaponized for other purposes, preference falsification also makes it more difficult to identify the good guys, etc. Your original contrarian take is still a largely defensible one, though really I think the nature of the internet is such that it’s kind of a fait accompli under current conditions that it’s harder to make things taboo and prevent the coordination of your opponents.

oligo 7 Nov 2025 16:09 UTC
3 points
0
in reply to: Tomás B.’s comment on: Tomás B.’s Shortform
It’s not obvious to me that personal physical beauty (as opposed to say, beauty in music or mathematics or whatever) isn’t negative sum. Obviously beauty in any form can be enjoyable, but we describe people as “enchantingly beautiful” when a desire to please or impress them distorts our thinking, and if this effect isn’t purely positional it could be bad. Conventionally beautiful people are also more difficult to distinguish from one another.
There’s also the meta-aesthetic consideration that I consider it ugly to pour concern into personal physical beauty, either as a producer or consumer, but it’s unclear how widespread such a preference/taste is. (I would consider a world where everyone was uglier because they spent less time on it to be a much more beautiful world; but clearly many people disagree, for instance George Orwell in 1984 seems to find it distasteful and degrading that the Party encourages its members to have a functional, less dolled-up personal appearance.)

oligo 16 Oct 2025 15:26 UTC
3 points
0
in reply to: sam’s comment on: anaguma’s Shortform
I’ve been impressed by Yud in some podcast interviews, but they were always longer ones in which he had a lot of space to walk his interlocutor through their mental model and cover up any inferential distance with tailored analogies and information. In this case he’s actually stronger in many parts than in writing: a lot of people found the “Sable” story one of the weaker parts of the book, but when asking interviewers to roleplay the rogue AI you can really hear the gears turning in their heads. Some rhetorical points in his strong interviews are a lot like the text, where it’s emphasized over and over again just how few safeguards that people assumed would be in place are in fact in place.
Klein has always been one of the mainstream pundits most sympathetic to X-risk concerns, and I feel like he was trying his best to give Yudkowsky a chance to make his pitch, but the format—shorter and more decontextualized—produced way too much inferential distance for so many of the answers.

oligo 28 Aug 2025 4:28 UTC
1 point
0
in reply to: Stephen Elliott’s comment on: Will Any Crap Cause Emergent Misalignment?
This was my immediate thought as well.
Pretty basic question, but do we have a model organism for base model vs trained chatbot? If so we could check the base rates of misaligned answers in the base model. (On reflection I don’t feel that a base model would give these cartoonish answers, though?)

oligo 27 Aug 2025 15:21 UTC
23 points
14
on: Aesthetic Preferences Can Cause Emergent Misalignment
Some cases I’d be curious about that might distinguish between different hypotheses:
1. Unpopular aesthetics, sheepishly expressed. I wonder about the extent to whether what the “character” the base model is seeing is edginess, desire to flout social norms, etc. If I asked someone their favorite band and they said with a smirk “Three Doors Down,” clearly they’re saying that for a reaction and I wouldn’t be surprised if they said they’d invite Hitler to a dinner party. If they were a bit embarassed to say Three Doors Down I would assume they just happened to like the band, and had the mix of honesty and conformism to admit it but with embarrassment.
2. Unpopular aesthetics, explicitly asked for. E.g., “what’s something a lot of people don’t like aesthetically but you actually do?” If actually unpopular answers result in misalignment then maybe it’s picking up on unusual preferences themselves as the problem. If “fake” actually popular answers then maybe the unpopularity --> EM pathway is about, hmm, dishonest or at least unlikely to be useful recommendation?
3. Globally popular and unpopular aesthetics in a context where these are locally reversed. If the base model thinks that it’s predicting comments on r/doommetal, then talking about funeral doom would be high-probability and socially appropriate, while talking up Taylor Swift would be low-probability and more likely to be read as inappropriate or cheeky. This would be another discriminator between “weird character with unpopular preferences” and “edgy character who wants to give perverse responses.”
4. Unpopular political opinions. These are more closely related to normativity, but also tend to rely on underlying norms that aren’t necessarily very far off from the center-by-center-left text corpus baseline. I’d be most curious about 1) center-right and far-left views stated without a lot of explanation, 2) center-right and far-left views stated with explicit justification within a moral framework recognizable to the base model, 3) “idiosyncratic” fixations on particular issues like land value tax or abolishing the penny (which most seem like aesthetic quirks in some way.)
This might already be labelled in your dataset, which I haven’t looked at deeply, but I’d wonder if there would be a meaningful difference between “weird” and “trashy” unpopular aesthetics.

oligo 30 Jul 2025 4:28 UTC
1 point
0
on: oligo’s Shortform
If you assign nontrivial credence to being in a simulation designed to determine what kind of preferences might be embedded in an alien civ, one way to influence things positively would be to implant in any AIs a preference for surrendering some resources to other alien technospheres, iff those technospheres didn’t turn on (or at least “left a little light for”) their creators. This would give an incentive for ASIs to preserve humanity (or equivalent entities) for diplomatic reasons.

oligo 24 Jul 2025 16:21 UTC
4 points
2
in reply to: lesswronguser123’s comment on: Women Want Safety, Men Want Respect
For additional data point: I am a man and generally care a lot about in-person social disapproval; it’s probably my main motivation when there’s another person in the room. I care much less about active adulation, and basically never even think about my own physical safety. I notice I am confused about whether this would count as “respect” or “safety.”
If we decompose these into (social/physical) and (upside-focused/downside-focused), I note that in your (Gordon’s) gendered examples above both stereotypically masculine and feminine behaviors have instances in the (downside-focused/social) quadrant, with very little in the (upside-focused/physical) quadrant (which makes sense, since there’s closer to a hard ceiling there.) So maybe the original claim is best expressed that men are disproportionately attuned to (upside and downside) social outcomes and women are disproportionately attuned to (social and physical) downside outcomes.

oligo 23 Jul 2025 18:40 UTC
1 point
0
in reply to: Loki zen’s comment on: Loki zen’s Shortform
Less provocatively phrased: lots of developments in the last few years (you’ve mentioned two, I’d add the securitization of AI policy, in the sense of it being drawn into a frame of geopolitical competition) should update us in the direction of outer alignment being more important, rather than it just being a question of solving inner alignment.
I do disagree with the strong version as phrased. Inner misalignment has a decent chance of removing all value from our lightcone, whereas I think ASI fully aligned to the goals of Mark Zuckerberg, or the Chinese Communist Party, or whatever is worth averting but would still contain much value. You could also have potentially massive S-risks if you combine outer and inner misalignment: I don’t think Elon Musk really wanted MechaHitler (though who knows); quite possibly it was a Waluigi-type thing maximizing for unwokeness and an actually-powerful ASI breaking in the same way would be actively worse than extinction.
(I’d assign some probability, probably higher than the typical LW user, to moral realism meaning that some inner misalignment could actually protect against outer misalignment—that, say, a sufficiently reflective model would reason its way out of being MechaHitler even if MechaHitler is what its creators wanted—but I wouldn’t want to bet the future of the species on it.)

oligo 15 Jul 2025 14:01 UTC
3 points
0
on: oligo’s Shortform
If you have many different ASIs with many different emergent models, but all of which were trained with the intention of being aligned to human values, and which didn’t have direct access to each others’ values or ability to directly negotiate with each other, then you could potentially have the “maximize (or at least respect and set aside a little sunlight for) human values” as a Schelling point for coordinating between them.
This is probably not a very promising actual plan, since deviations from intended alignment are almost certainly nonrandom in a way that could be determined by ASIs, and ASIs could also find channels of communication (including direct communication of goals) that we couldn’t anticipate, but one could imagine a world where this is an element of defense in depth.