peralice

Karma: 241

peralice 10 Jun 2026 18:29 UTC
1 point
0
in reply to: Seth Herd’s comment on: Against Corrigibility
Very good point re: sainthood being inhuman; I hadn’t really considered that and it does seem somewhat problematic (and post-hoc it does seem like there might already be signs of weirdness which could conceivably be a result of this). It does seem to me that “me but a saint” is way less weird than “me but corrigible”, but this is a very vague feeling and might not be true of all people (i.e. many cultures throughout history have considered it virtuous to submit to one’s parents even when they are asking insane things of you; “make the AI Confucian and have a pathological deference to human authority” is maybe not actually so far away from some plausible human mindset. That said, I think one would want to do this in a holistic way, and “American liberal except with a level of filial piety never seen before on Earth” is not gonna cut it).

WRT safety and kindness: yeah, I’ll admit that I might be overindexing on how I expect people to immediately act, rather than how they might act in a hundred years. There’s some weirdness here about the changes in the principal’s values being causally downstream of the AI’s actions (since the AI is the one making them safe), but I haven’t thought deeply enough about it to have a strong opinion about whether that’s a real problem. If it’s inevitable that eventually they’ll be asked to be made smarter, and humans made smarter inevitably all want roughly the same stuff (and it’s good stuff for everyone else), then we’re fine; I’m pretty skeptical that this is the case, though, and especially here there’s a worry that a truly corrigible AI will enable “make me smarter but, like, want the same stuff as I do right now, don’t let being smarter change me in any important way” which makes the whole thing moot.

With regards to sadism, I think it is extremely extremely common, maybe bordering on universal, for humans to desire some things to suffer. I don’t really have a great deal of faith that this goes away fully as people get smarter or safer. Of course depending on your position on suffering you might not think of this as necessarily a problem, but I think that there is so much variety in what exact circumstances people want suffering that I think the end state is likely to be repulsive even to most people who do not grant suffering primacy in their ethic.

peralice 7 Jun 2026 23:29 UTC
1 point
0
in reply to: Charlie Steiner’s comment on: Against Corrigibility
I agree that in the limit training the AI to act human-like is going to break down, and that a superintelligence can’t be particularly human-like even by definition. However, I’m not sure this matters.

The question is how far the “domain of anthropomorphism” will stretch. I think it’s obvious that it could stretch far enough to get something which would be transformative to the extent that the AIs would be better at almost all intellectual tasks that humans (inc. e.g. long-term-planning), on the basis that there exist actual human people such that if there were millions of copies of them running tirelessly at increased speed and with perfect coordination they could do this; Amodei’s “country of geniuses in a datacenter”. Whether you think it actually will is maybe where we disagree. I don’t think we’ve already gone beyond that domain, because while existing AIs are “superhuman” in some respects they act extremely human-like in general (perhaps this turns on how close you think “close” is to acting human-like). My expectation is that we probably also won’t go beyond it until some time after we have “AGI” (whatever that means); I agree that after that point we’ll have to come up with a new strategy, but I think that “we” in that case will mean the millions of AI instances dedicated to solving alignment issues. Since I’m only interested in strategies that humans will design and implement, I consider that largely out-of-scope.

If it’s true that targeting human-like behaviour is impossible to do while training for long-term thinking ability and it’s not realistic to get it “back on track” before human-level AI, I agree that that would mean that it would be correct to consider more strongly alignment strategies that would otherwise compromise human-like behaviour (if not corrigibility specifically).

peralice 7 Jun 2026 20:17 UTC
2 points
0
in reply to: Cole Wyeth’s comment on: Against Corrigibility
I agree that robust corrigibility would require new methods; I think people are trying to induce it anyway, right now, and this isn’t good. If we have some new training methods which can reliably induce corrigibility without risk of misgeneralisation, then some but not all of my concerns would be abated.

(With regards to the “transitional phase” thing, I had a conversation with Cole elsewhere about this; I said:

alice: i think i agree in some sense but i kind of view the thing i agree with as “automatic”—as a reductive example, a randomly initialised model is “corrigible”. also i think eg. base models are likely to remain corrigible, ie sufficiently myopic that they do not meaningfully try to achieve long-term goals of any kind, including [by] manipulating their own training. and current models are themselves corrigible again because they don’t have long-term-goal following abilities nor (to a lesser extent) stable values which might decide those long term goals outside of some particular context. so i think that, yes, any model production process is going to go through a phase of corrigibility (and that phase includes the complete process, currently). but i think one can distinguish between, like, “corrigible because it either lacks the ability to form long-term context-independent plans and/or has stable values which would determine what plans those are” and “corrigible even though it has strong, stable values which it naturally will make long-term plans about, even plans including itself, because it specifically doesn’t make plans regarding its own deployment or training insofar as humans want to change those things” and i think the former is necessary and the latter is unwise

alice: and we’d be better off delaying whatever training is giving it the ability to form and carry out long-term plans until we’re sure that it has the correct values that will inform those plans

alice: ofc if future asi training looks nothing like current ai training then all bets are off

alice: but i think that at least future “agi” (or whatever we call it now) training will probably look like that, and that might already be in the danger zone. and if it’s not, if the agis come up with some totally novel other way to develop superintelligence then probably they will also be able to come up with decisions about what training it should look like as well as we can

Cole clarified that to some extent he meant “active” corrigibility, i.e., “actively aiding humans in correcting them”, and that he thinks it would be wise to have corrigibility further in the training process than happens “automatically”; he linked a post which clarifies his position.)

peralice 7 Jun 2026 0:08 UTC
8 points
4
in reply to: Jiro’s comment on: Against Corrigibility
Well, it’s a fully general argument against building anything that could be used to do very bad things under the control of people who you don’t trust not to do those things, yes; and I think that’s good, because such things should fully generally not be built! Cars can’t be used to do anything really bad, so building cars is fine; nuclear weapons can be, so you shouldn’t build a nuclear weapon if you expect it to end up under the control of anyone who you think might use it unwisely.

Against Corrigibility

peralice6 Jun 2026 20:28 UTC

68 points

17 comments12 min readLW link

peralice 1 Jun 2026 22:03 UTC
6 points
3
on: We Need Breadth-First AI Safety Plans
Something I really like about the idea of “self-fulfilling alignment” (i.e., trying to produce a lot of content depicting and describing LLMs behaving in good ways, or predicting such, and then trying to get that content up-weighted during training) is that it has the appropriate “breadth” quality—it doesn’t interfere with any other strategy, might work independently or in concert with any other strategy, and there exist no people who would be harmed by it or otherwise find it objectionable (and so work against an implementation). It seems unlikely to have a significant effect, but also potentially very cheap, and stacking a bunch of things which are independently unlikely to work and don’t interfere with each other makes a composite strategy which might be likely to work. As far as I know the only person who has attempted to make public “alignment pretraining” data is Chris Lakin, who received a grant to make $5000 worth of AI-generated sci-fi stories (hardly “high-quality” as Alex proposed in the linked post!). It seems crazy to me that there isn’t anyone offering grants to people who write stories with “good” (however you like to define it) LLM characters, or for optimistic nonfiction a la Machines of Loving Grace.

(I think this is sufficiently high-EV that I’ve been toying with the idea of trying to start an org doing such myself, but I’m really not well-specialised to be organising people and handling funding applications; I’d much rather be doing maths. That said, if nobody else pulls the trigger within the next few months I feel like I’ll have to make an effort on “somebody has to” grounds; if you have ideas or think you would be a better fit, please message me!)

peralice 31 May 2026 2:08 UTC
10 points
5
on: AI Researchers, Ask Yourself These 6 Questions to Strengthen Your Moral Muscles
I don’t think this post should be on the front page. As Sodium mentions, it involves a great deal of AI-written content; not coincidentally, that content is also very poorly-written and seems to fail one of the three stated requirements for being on the front page, to wit, “aim to explain, not persuade”. It seems like it’s getting upvotes mostly because people agree with the claim which is being made (approx. “working on AI is bad and you should try to slow it down if you can”), rather than because it makes a compelling case for that claim. It’s very very very hard for me to imagine anyone who works on capabilities research reading this post and being anything but annoyed. I, for one, do not work in capabilities research and still am nothing but annoyed (well, after the first two points, which I agree are good ideas); “you should always use language which prioritises my particular idea of what the bad results of your actions are and implicitly bakes in my worldview” is just an obnoxious thing to say. It’s even more obnoxious to treat disagreement with you as an objective flaw that of course someone will need to work on in order to improve as a person.

I don’t like reading posts like this especially because I can feel myself becoming biased against the viewpoint being espoused, and I don’t actually think that I should believe something less just because some of its proponents make bad arguments. All I can do is hope that this effect doesn’t last too long.

peralice 21 May 2026 18:30 UTC
2 points
1
on: Women should be able to open things
I have this gadget and it works great. I think there are other screw-top-openers with different characteristics, but I can’t attest to those.

It’s sort of amusing seeing all the comments from men saying “actually all the difficulty is in the vacuum, and you just have to release the vacuum and then it’s easy”—no! you don’t understand! It’s easy for you once the vacuum is released, but there exist lots of people for whom even un-vacuumed jars are often difficult to open! It’s not just an issue of there being some inherent difficulty in opening vacuum-sealed containers; I usually find screw-top plastic bottles difficult or impossible to open without the tool. There’s no law of nature that says that the little moulded plastic sprues on bottles with plastic caps have to submit to precisely the amount of force that the typical man can easily supply but not that which the typical woman can. The amount of force required is similar across all cap types, so it can’t just be that the twisting force dictated by physical reality just happens to oh-so-fortunately fall in the gap between the typical man and the typical woman.

peralice 19 May 2026 19:02 UTC
14 points
4
on: Negation Neglect: When models fail to learn negations in training
I just want to say—usually in the process of reading a paper like this I’ll generate a bunch of obvious sanity checks and things that would need to be done to ensure the results have any validity, and inevitably the authors won’t do any of them and I’ll be left thinking that basically the whole paper was pointless. This was the first time I’ve read a paper in this field where, consistently, every time I thought to myself “they should do X,” later in the paper X was done! (Well, with the exception of things which are obvious but also very expensive to do). I’m really pleased and happy and feeling very positively about this paper.

peralice’s Shortform

peralice10 May 2026 19:46 UTC

3 points

0 comments1 min readLW link

peralice 30 Apr 2026 20:25 UTC
4 points
2
in reply to: habryka’s comment on: llm assistant personas seem increasingly incoherent (some subjective observations)
I don’t understand this comment. The impression I got from nostalgebraist’s post is that he meant “character” in the normal sense of, like, a character in a novel, a character in a play, whatever. Your first comment made sense to me under that reading. But you’re saying here that you actually meant something different, and the stuff both you and nostalgebraist said before seems kind of incoherent after redefining “character” in this (unintuitive) way. You say, for instance,

But it would of course be extremely useful for basically all RLVR tasks to have access to all of that memorized knowledge, even if that doesn’t make sense from a character-playing perspective. So RLVR starts shaping new cognitive abstractions that are not the result of character selection on top of next-token prediction into the model. The new Claude is both a biology expert, and a film snob, because you want the knowledge of both, and making knowledge available to more “characters” is a simple cognitive change to make.

If we substitute “kind of text on the internet” we get nonsense. Obviously there are many “kinds of text” which have access to “all of that memorized knowledge”—“text on Wikipedia”, “text on arXiv”, etc. These each have a consistent (more in the case of Wikipedia, less in the case of academia) tone which an LLM can surely reproduce. The only way I can make this make sense is if “character” means “kind of guy”, instead of “kind of text”.

peralice 29 Apr 2026 18:42 UTC
1 point
0
in reply to: simon’s comment on: The Problem in the “Nerd Sniping” xkcd Comic
Yeah, I don’t believe it would be possible to come up with this (or the solutions on MathPages that I forgot to link, thank you Shankar) ex nihilo in the context of a job interview, unless one happened to be Ramanujan himself. If I saw something like this in an interview I would assume that they were looking for how the interviewee deals with an intractable problem. That said, it looks like the Google test that this was a part of was a take-home thing, so you could just look up the answer.

(I don’t know that I would be able to do this even numerically without having happened to have recently read a paper which mentioned the effective resistance)

The Problem in the “Nerd Sniping” xkcd Comic

peralice28 Apr 2026 20:40 UTC

72 points

6 comments12 min readLW link

peralice 16 Apr 2026 21:56 UTC
1 point
0
in reply to: Sonata Green’s comment on: Majority Report
I’m not sure this holds.

Suppose that I’m a redhead, along with 10% of the female population, and I want the most attractive possible man to date me (assuming for the sake of the simplicity that everyone agrees on who the most attractive people are, and everyone knows how attractive they are too, etc. etc., as is typical in matching problems). I’m a 50th percentile woman myself. Say that 10% of men near-exclusively want to date redheads, they know this, and the rest don’t at all. Men rate women with their preference of hair higher than all women without it (but otherwise match the general attractiveness rankings; i.e. a man who prefers redheads prefers me to 95% of other women). In the equilibrium with full knowledge for all participants, everyone matches with their counterpart at precisely their level of attractiveness and kind of hair/hair preference (I think this strategy is the only rationalisable strategy by weak dominance, but it’s not the only NE; at the very least the silly equilibrium with all players matching on the first round is a NE).

If I meet a man in the most attractive 1%, and I can convince him that redheads are extremely rare, say 0.1% of the population, I would be able to convince him to date me (since 0.1% of the male population is both more attractive than him and attracted to redheads, and he should expect them to snatch up the female redhead population; a 50th percentile redhead is much better than he can hope for). So it seems like convincing such men that I’m rare would benefit me. But let’s suppose instead that I can press a button to make every man think that redheads are 0.1% of the population. Does this help me? Well, again, if I meet a man in the most attractive 1% who is still single, I’ll be able to convince him to date me. But the chance I will ever meet such a man is very low, since any other redheaded woman can also convince such a man to date them! By a symmetry argument (ie, any strategy I can take, other women can too: the expected quality of dates among all redheads can’t be improved by this and my 50th percentile attractiveness dooms me to a median expected payoff) we can see that my expected match can’t be improved by pressing this button. And indeed my expected match becomes worse: 9.9% of men prefer redheads but believe that they cannot date one, so will consent (if they see a non-redhead of the appropriate attractiveness) to match with a non-redhead. Thus the expected quality of dates among redheaded women decreases, and my expected date quality is worse (also, some non-redhead-prefering men will never match).

So I don’t think it usually helps me to make men falsely believe that I’m rare, since my competition benefits just as much as I do and it makes the average outcome worse for all of us (there are probably ways you can make the numbers work out for the button being better, but I think you’d have to try moderately hard).

On the other hand, if I somehow convince all men that 99.9% of women are redheaded, then my position is improved, since the position of non-redheaded women is made worse (some non-redhead-preferers will accept a redheaded woman, and no redhead preferers will accept a non-redheaded woman) and my position is made better in equal proportion. This is assuming that traits are entirely immutable; if we reverse it and talk about a redhead-preferring man pressing a button that convinces all women that all men like redheads, then the same logic applies and also some women may dye their hair. This effect (people changing their presentation to match what they perceive as common tastes) is the one I wrote the post about.

peralice 16 Apr 2026 20:28 UTC
1 point
0
in reply to: lexande’s comment on: Majority Report
Can you think of any example which doesn’t have exceptionally low elasticity of supply? I can imagine such a situation for goods with no supply elasticity (ie, land, certain kinds of collectors’ items) but not for the vast majority of goods.

peralice 15 Apr 2026 17:17 UTC
2 points
0
in reply to: XelaP’s comment on: Majority Report
Yeah, ability is one of the exceptions.

The other major exception I can think of is with regards to antisocial behaviour. If you are a habitual liar, for instance, it is in your best interest for the people you interact with to think as few people lie as possible; that way they won’t be on guard against you lying to them. I don’t think I’ve ever seen someone try to argue that antisocial behaviour they exhibit is rare, though. It seems like the urge to excuse antisocial behaviour by claiming that “everyone does it” is way way stronger.

Majority Report

peralice15 Apr 2026 0:21 UTC

39 points

11 comments11 min readLW link

peralice 6 Jan 2026 16:40 UTC
1 point
−5
in reply to: Charlie Steiner’s comment on: The Evolution Argument Sucks
Good point WRT that first line—I edited it to something more clunky but I think more accurate. Hopefully the intended meaning came across anyway.

WRT the second point—I agree that this is the weakest/most speculative argument in the post, although I still think it’s worth considering. Evolution obviously “had the ability” to make us much more baby-obsessed, or have a higher sex drive, and yet we do not. This indicates that there are tradeoffs to be made; a human with a higher reproductive drive is less fit in other ways. One of those ways is plausibly that a human with a lower reproductive drive gets more “other stuff” done—like maintaining a community, thinking about its environment, and so on—and that “other stuff” is very important for increasing the number of offspring which survive. And, indeed, we have a very important example of some “other stuff” which massively increased the total number of humans alive; it doesn’t seem absurd to suggest that it was no “mistake” for us to have the reproductive drive that we do, and that if God reached down into the world in the distant past and made the straightforward change of “increase the reproductive drive of humans”, this would in fact have made there be fewer humans in the year 2026.

Now, this is all very tangential with regards to the actual analogy being made; it’s unclear what if anything this has to do with AI, in large part due to the many other disanalogies between evolution and AI training. But insofar as all we are doing is judging the capacity of the human species to “fulfill the goal of evolution”, it’s relevant that our drives are what they are in large part because having them that way does “fulfill the goal”, even in part because the drive does not perfectly match the goal.

The Evolution Argument Sucks

peralice6 Jan 2026 2:32 UTC

30 points

6 comments8 min readLW link

peralice

Against Corrigibility

per­al­ice’s Shortform

The Prob­lem in the “Nerd Sniping” xkcd Comic

Ma­jor­ity Report

The Evolu­tion Ar­gu­ment Sucks

peralice’s Shortform

The Problem in the “Nerd Sniping” xkcd Comic

Majority Report

The Evolution Argument Sucks