jdp

Karma: 1,260

jdp 27 Sep 2025 1:36 UTC
12 points
3
in reply to: Cole Wyeth’s comment on: More Reactions to If Anyone Builds It, Everyone Dies
So let’s consider this from a different angle. In Hanson’s Age of Em (which I recommend) he starts his Em Scenario by making a handful of assumptions about Ems. Assumptions like:
1. We can’t really make meaningful changes beyond pharmacological tweaks to ems because the brain is inscrutable.
2. That Ems cannot be merged for the same reasons.
The purpose of these assumptions is to stop the hypothetical Em economy from immediately self modifying into something else. He tries to figure out how many doublings the Em economy will undergo before it phase transitions into a different technological regime. Critics of the book usually ask why the Em economy wouldn’t just immediately invent AGI, and Hanson has some clever cope for this where he posits a (then plausible) nominal improvement rate for AI that implies AI won’t overtake Ems until five years into the Em economy or something like this. In reality AI progress is on something like an exponential curve and that old cope is completely unreasonable.

So the first assumption of a “make uploads” plan is that you have a unipolar scenario where the uploads will only be working on alignment, or at least actively not working on AI capabilities. There is a further hidden assumption in that assumption which almost nobody thinks about, which is that there is a such thing as meaningful AI alignment progress separate from “AI capabilities” (I tend to think they have a relatively high overlap, perhaps 70%?). This is not and of itself a dealbreaker but it does mean you have a lot of politics to think about in terms of who is the unipolar power and who precisely is getting uploaded and things of this nature.

But I think my fundamental objection to this kind of thing is more like my fundamental objection to something like OpenAI’s Superalignment (or to a lesser extent PauseAI), which is that this sort of plan doesn’t really generate any intermediate bits of solution to the alignment problem until you start the search process, at which point you plausibly have too few bits to even specify a target. If we were at a place where we mostly had consensus about what the shape of an alignment solution looks like and what constitutes progress, and we mostly agreed that involved breaking our way past some brick wall like “solve the Collatz Conjecture”, I would agree that throwing a slightly superhuman AI at the figurative Collatz Conjecture is probably our best way of breaking through.

The difference between alignment and the Collatz Conjecture however is that as far as I know nobody can find any pattern to the number streams involved in the Collatz Conjecture but alignment has enough regular structure that we can stumble into bits of solution without even intending to. There’s a strain of criticism of Yudkowsky that says “you said by the time an AI can talk it will kill us, and you’re clearly wrong about that” to which Yudkowsky (begrudgingly, when he acknowledges this at all) replies “okay but that’s mostly me overestimating the difficulty of language acquisition, these talking AIs are still very limited in what they can do compared to humans, when we get AIs that aren’t they’ll kill us”. This is a fair reply as far as it goes, but it glosses over the fact that the first impossible problem, the one Bostrom 2014 Superintelligence brings up repeatedly to explain why alignment is hard, is that there is no way to specify a flexible representation of human values in the machine before the computer is already superintelligent and therefore presumed incorrigible. We now have a reasonable angle of attack on that problem. Whether you think that reasonable angle of attack implies 5% alignment progress or 50% (I’m inclined towards closer to 50%) the most important fact is that the problem budged at all. Problems that are actually impossible do not budge like that!

The Collatz Conjecture is impossible(?) because no matter what analysis you throw at the number streams you don’t find any patterns that would help you predict the result. That means you put in tons and tons of labor and after decades of throwing total geniuses at it you perhaps have a measly bit or two of hypothesis space eliminated. If you think a problem is impossible and you accidentally stumble into 5% progress, you should update pretty hard that “wait this probably isn’t impossible, in fact this might not even be that hard once you view it from the right angle”. If you shout very loudly “we have made zero progress on alignment” when some scratches in the problem are observed, you are actively inhibiting the process that might eventually solve the problem. If the generator of this ruinous take also says things like “nobody besides MIRI has actually studied machine intelligence” in the middle of a general AI boom then I feel comfortable saying it’s being driven by ego-inflected psychological goop or something and I have a moral imperative to shout “NO ACTUALLY THIS SEEMS SOLVABLE” back.

So any kind of “meta-plan” regardless of its merits is sort of an excuse to not explore the ground that has opened up and ally with the “we have made zero alignment progress” egregore, which makes me intrinsically suspicious of them even when I think on paper they would probably succeed. I get the impression that things like OpenAI’s Superalignment are advantageous because they let alignment continue to be a floating signifier to avoid thinking about the fact that unless you can place your faith in a process like CEV the entire premise of driving the future somewhere implies needing to have a target future in mind which people will naturally disagree about. Which could naturally segue into another several paragraphs about how when you have a thing people are naturally going to disagree about and you do your best to sweep that under the rug to make the political problem look like a scientific or philosophical problem it’s natural to expect other people will intervene to stop you since their reasonable expectation is that you’re doing this to make sure you win that fight. Because of course you are, duh. Which is fine when you’re doing a brain in a box in a basement but as soon as you’re transitioning into government backed bids for a unipolar advantage the same strategy has major failure modes like losing the political fight to an eternal regime of darkness that sound very fanciful and abstract until they’re not.

jdp 26 Sep 2025 10:41 UTC
11 points
2
in reply to: Zvi’s comment on: More Reactions to If Anyone Builds It, Everyone Dies
I do know that when I see the interactions of the entire Janus-style crowd on almost anything

This seems like a good time to point out that I’m fairly different from Janus. My reasons for relative optimism on AI alignment probably (even I don’t know) only partially overlap Janus’s reasons for relative optimism. The things I think are important and salient only partially overlaps what Janus thinks is important and salient (e.g. I think Truth Terminal is mostly a curiosity and that the “Goatse Gospel” will not be recognized as a particularly important document). So if you model me and Janus’s statements as statements from the same underlying viewpoint you’re going to get very confused. In the old days of the Internet if people asked you the same questions over and over and this annoyed you, you’d write a FAQ. Now they tell you that it’s not their job to educate you (then who?) and get huffy. When people ask me good faith questions (as opposed to adversarial Socratic questions whose undertone is “you’re bad and wrong and I demand you prove to me that you’re not”) because they found something I said confusing I generally do my best to answer them.

(Part of what is ‘esoteric’ is perhaps that the perfect-enemy-of-good thing means a lot of load-bearing stuff is probably unsaid by you, and you may not realize that you haven’t said it?)
. . .
As for ‘I should just ask you,’ I notice this instinctively feels aversive as likely opening up a very painful and time consuming and highly frustrating interaction or set of interactions and I notice I have the strong urge not to do it.

That’s fair. I’ll note on the other end that a lot of why I don’t say more is that there are many statements which I expect are true and are load bearing beliefs that I can’t readily prove if challenged on. Pretty much every time I try to convince myself that I can just say something like “humans don’t natively generalize their values out of distribution” I am immediately punished by people jumping me as though that isn’t an obvious, trivially true statement if you’re familiar with the usual definitions of the words involved. If I come off as contemptuous when responding to such things, it’s because I am contemptuous and rightfully so. At the same time there is no impressive reasoning trace I can give for a statement like “humans don’t natively generalize their values out of distribution” because there isn’t really any impressive reasoning necessary. At first when humans encounter new things they don’t like them, then later they like them just from mere exposure/having had time to personally benefit from them. This is in and of itself sufficient evidence for the thesis, each additional bit of evidence you need beyond that halves both the hypothesis space and my expectation of getting any useful cognitive effort back from the person I have to explain it to. The reasoning goes something like:
1. Something is out of distribution to a deep learning model when it cannot parse or provide a proper response to the thing without updating the model. “Out of distribution” is always discussed in the context of a machine learning model trained from a fixed training distribution at a given point in time. If you update the model and it now understands the thing, it was still out of distribution at the time before you updated the model.
2. Humans encounter new things all the time that their existing values imply they should like. They then have very bad reactions to the things which subside with repeated exposure. This implies that the deep learning models underlying the human mind have to be updated before the human generalizes their values. Realistically it probably isn’t even “generalizing” their values so much as changing (i.e. adding values to) their value function.
3. If the human has to update before their values “generalize” on things like the phonograph or rock music, then clearly they do not natively generalize their values very far.
4. If humans do not generalize their values very far, then they do not generalize their values out of distribution in the way we care about for them being sufficient to constrain the actions of a superhuman planner.
I’m reminded a bit of Yudkowsky’s stuff about updating from the empty string. The ideal thing is that I don’t have to tell you “humans don’t natively generalize their values out of distribution” because you already have the kind of prior that has sucked up enough bits of the regular structures generating your sensory observations that you already know human moral generalization is shallow. The next best thing is that when I say “humans don’t natively generalize their values out of distribution” you immediately know what I’m talking about and go “oh huh I guess he’s right, I never thought of it like that”. The third best thing is if I say ” At first when humans encounter new things they don’t like them, then later they like them just from mere exposure” you go “oh right right duh yes of course”. If after I say that you go “no I doubt the premise, I think you’re wrong about this” the chance that it will be worth my time to explain what I’m talking about from an intellectual perspective in the sense that I will get some kind of insight or useful inference back rounds to zero. In the context where I would actually use it, “humans don’t natively generalize their values out of distribution” would be step one of a long chain of reasoning involving statements at least as non-obvious to the kind of mind that would object to “humans don’t natively generalize their values out of distribution”.

On the other hand there is value in occasionally just writing such things out so that there are more people in the world who have ambiently soaked up enough bits that when they read a statement like “humans don’t natively generalize their values out of distribution” they immediately and intuitively understand that is true without a long explanation. Even if it required a long explanation this time, there are many related statements with related generators that they might not need a long explanation for if they’ve seen enough such long explanations, who knows. But fundamentally a lot of things like this are just people wanting me to say variations like “if you take away the symbology and change how you phrase it lots of people still fall for the basic Nazi ideology” or “we can literally see people apply moral arguments to ingroup and then fail to apply the same arguments to their near-outgroup even when later iterations of the same ideology apply it to near-outgroup” (one of the more obvious examples being American founder attitudes towards the rights of slaves in the colonies vs. the rights of free white people) until it clicks for them. But any one of these should be sufficient for you to conclude that no uploading someone into a computer and then using their judgment as a value function on all kinds of weird superintelligent out of distribution moral decisions will not produce sanity. I should not have to write more than a sentence or two for that to be obvious.

And the thing is that’s for a statement which is actually trivial, which is sufficiently trivial that my difficulty in articulating it is that it’s so simple it’s difficult to even come up with an impressive persuasive-y reasoning trace for propositions so simple. But there are plenty of things I believe which are load bearing which are not easy to prove that are not simple, where articulating any one of them would be a long letter that changes nobodies mind even though it takes me a long effort to write it. But even that’s the optimistic case. The brutal reality is that there are beliefs I have which are load bearing and couldn’t even write the long letter if prompted. “There is something deeply wrong with Yudkowsky’s agent foundations arguments.” is one of these and in fact a lot of my writing is me attempting to articulate what exactly it is I feel so strongly. This might sound epistemically impure in the sense laid out in The Bottom Line:

the actual percentage of you that survive in Everett branches or Tegmark worlds—which we will take to describe your effectiveness as a rationalist—is determined by the algorithm that decided which conclusion you would seek arguments for. In this case, the real algorithm is “Never repair anything expensive.” If this is a good algorithm, fine; if this is a bad algorithm, oh well. The arguments you write afterward, above the bottom line, will not change anything either way.

But “fuzzy intuition that you can’t always articulate right away” is actually the basis of all argument in practice. If you notice something is wrong with an argument and try to say something you already have most of the bits needed to locate whatever you eventually say before you even start consciously thinking about it. The act of prompting you to think is your brain basically saying that it already has most of the bits necessary to locate whatever thing you wind up saying before you distinguished between the 4 to 16 different hypothesis your brain bothered bringing to conscious awareness. Sometimes you can know something in your bones but not be able to actually articulate the reasoning trace that would make it clear. A lot of being a good thinker or philosopher is sitting with those intuitions for a long time and trying to turn them into words. Normally we look at a perverse argument and satisfy ourselves by pointing out that it’s perverse and moving on. But if you want to get better as a philosopher you need to sit with it and figure out precisely what is wrong with it. So I tend to welcome good questions that give me an opportunity to articulate what is wrong with Yudkowsky’s agent foundations arguments. You should probably encourage this, because “pointing out a sufficiently rigorous problem with Yudkowsky’s agent foundations arguments” is very closely related if not isomorphic to “solve the alignment problem”. In the limit if I use an argument structure like “Actually that’s wrong because you can set up your AGI like this...” and I’m correct I have probably solved alignment.

EDIT: It occurs to me that the first section of that reply is relatively nice and the second section is relatively unpleasant (though not directed at you), and as much as anything else you’re probably confused about what the decision boundary is on my policy that decides which thing you get. I’m admittedly not entirely sure but I think it goes something like this:

(innocent, good faith, non-lazy) “I’m confused why you think humans don’t generalize their values out of distribution. You seem to be saying something like ‘humans need time to think so they clearly aren’t generalizing’ but LLMs also need time to think on many moral problems like that one about sacrificing pasta vs. a GPU so why wouldn’t that also imply they don’t generalize human values out of distribution? Well actually you’ll probably say ‘LLMs don’t’ to that but what I mean is why doesn’t that count?”

To which you would get a nice reply like “Oh, I think what humans are actually doing with significant time lag between initial reaction and later valuation is less like doing inference with a static pretrained model and more like updating the model. You see the new thing and freak out, then your models get silently updated while you sleep or something. This is different from just sampling tokens to update in-context to generalize because if a LLM had to do it with current architectures it would just fail.”

(rude, adversarial, socratic) “Oh yeah? If humans don’t generalize their values out of distribution how come moral progress exists? People obviously change their beliefs throughout their lifetime and this is generalization.”

To which you would get a reply like “Uh, because humans probably do continuous pretraining and also moral progress happens across a social group not individual humans usually. The modal case of moral progress doesn’t look like a single person changing their beliefs later in life it looks like generational turnover or cohort effects. Science progresses one funeral at a time etc.”

But even those are kind of the optimistic case in that they’re engaging with something like my original point. The truly aggravating comments are when someone replies with something so confused it fails to understand what I was originally saying at all, accuses me of some kind of moral failure based on their confused understanding, and then gratuitously insults me to try and discourage me from stating trivially true things like “humans do not natively generalize their values out of distribution” and “the pattern of human values is a coherent sequence that has more reasonable and less reasonable continuations based on its existing tokens so far independent of any human mind generating those continuations”[0] again in the future.

That kind of comment can get quite an unkind reply indeed. :)

[0]: If you doubt this consider that large language models can occur in the physical universe while working nothing like a human brain mechanically.

jdp 24 Sep 2025 19:14 UTC
3 points
1
in reply to: TAG’s comment on: More Reactions to If Anyone Builds It, Everyone Dies
Well it’s a bad thing because it makes it harder to follow what they’re trying to say. It’s also a necessary thing for this subject and kind of the core question that the book has to answer is “how do we anticipate people’s objections and get past them?” and the book represents one hypothesis for how to do that. I am doubtful that it is a correct hypothesis, the reception to it doesn’t seem very positive BUT the reviews we’re getting are not from the target audience and as Scott Alexander said in his review Yudkowsky is a genius who has a history of seeing memetic opportunities that other people do not. So I feel comfortable noting I am doubtful and otherwise letting the reaction tell the story.

jdp 24 Sep 2025 3:04 UTC
5 points
1
in reply to: Archimedes’s comment on: More Reactions to If Anyone Builds It, Everyone Dies
I’m pretty sure it is the same post I wrote my comment in response to.

It is hard for someone like Pressman to appreciate the challenges in engaging with average people

And similar comments seem like a level of snark and condescension that the other descriptions don’t have.

It’s also wrong, I do appreciate the challenges with that and think they are in a sense the fundamental difficulty of writing a book like this. What I was originally going to say in my review , before deciding it was going to get lambasted by 50 parties anyway and I didn’t need to be one of them when there’s praiseworthy elements I can focus on instead, is that the book as written is barely structured like an argument at all. It’s more like a series of disconnected vignettes arranged in roughly chronological order. This is because the authors don’t actually feel they can make an argument. They don’t feel they can just make an argument because normally when they try doing that their audience will interject with some ridiculous cope thing, and the ridiculous cope is different for different people and it’s difficult to know in advance which ridiculous cope thing the audience will want you to respond to. So what the book does is interrupt itself constantly to try and head off some aside or point that the imagined reader might make there, and I guess I can only hope that the chosen structure is actually the product of constant beta readers objecting at various points and then deciding to put the interruptions at the most common points of objection. I hope this but do not really expect it because the book honestly comes across as the product of an incestuous editing process where it wasn’t really shown to any critical perspectives who would point out flaws as basic as “if I skim the beginning of each part of the book I will pick up on it always starting with an example sufficiently whimsical and not-real that the phrase ‘once upon a time’ is warranted”.

Nevertheless my review was primarily written from the perspective of “You are an AI (alignment) researcher and you want to know if there’s anything important in this book for you to spend your limited time reading.” so I didn’t go into that much detail about the style flaws besides noting that they exist and what I feel the biggest flaws are. Part of why I didn’t want to do an extended takedown of the style is that ultimately it’s an empirical question, the book will either be successful or it won’t be and going into a long dissection of what I think is wrong stylistically in a book that isn’t even written with me as the intended audience doesn’t seem like a very self aware thing to do.

jdp 24 Sep 2025 0:01 UTC
20 points
7
in reply to: jdp’s comment on: More Reactions to If Anyone Builds It, Everyone Dies
Bluntly, I don’t really understand what you take issue with in my review. From my perspective the structure of my review goes like this:
1. We are currently in an alignment winter. (This is bad)
2. Alignment is not solved yet but people widely believe it is. (This is bad)
3. I was expecting to hate the book but it actually retreats on most of the rhetoric I blame for contributing to the alignment winter. (This is good)
4. The style of the book is bad, but I won’t dwell on it and in fact spend a paragraph on the issue and then move on.
5. I actually disagree with the overall thesis, but think it’s virtuous to focus on the points of agreement when someone points out an important issue so I don’t dwell on that either and instead
6. “Emphatically agree” (literal words) that AI labs are not serious about the alignment problem.
7. State a short version of what the alignment problem actually is. (Important because it’s usually conflated with or confused with simpler problems that sound a lot easier to solve.)
8. I signal boost Eliezer’s other and better writing because I think my audience is disproportionately made up of people who might be able to contribute to the alignment problem if they’re not deeply confused about it and I think Eliezer’s earlier work is under-read.
9. I reiterate that I think the book is kinda bad, since I need a concluding paragraph.
I continue to think this is a basically fair review.

jdp 23 Sep 2025 23:46 UTC
21 points
5
on: More Reactions to If Anyone Builds It, Everyone Dies

I think (both here and elsewhere where he goes into more detail) he both greatly overstates his case and also deliberately presents the case in a hostile and esoteric manner. That makes engagement unnecessarily difficult.

I notice the style of this section of your summary of other people’s reviews is angrier and more openly emotive than the others. I take this to mean I’ve offended or upset you somehow. This is odd to me because I think my review was a lot nicer than most people expected from me (including myself). You don’t seem nearly as frustrated by other people making much dumber and more bad faith arguments, so I’m curious what it is that I’ve done to upset you.

In any case I do not think that I make my case in a “hostile and esoteric manner”. If anything I think I’ve kind of done worse than that by mostly not writing down my case at all because I have very high intellectual standards and don’t feel comfortable articulating my intuitions until the articulation is relatively rigorous.

That having been said I don’t think what I have written so far is “hostile and esoteric”.

There’s me attempting to explain during our podcast which I admit it took me longer than I’d like to get to the point.

https://gist.githubusercontent.com/JD-P/34e597cef5e99f8afa6304b3df5a4386/raw/db55f1b24bd437566392ac58f2160ccaefd5631f/outer-alignment-and-deception-with-jdp-and-zvi.vtt

Then there’s various pieces of writing I have which express part of my thoughts

https://gist.github.com/JD-P/56eaadc7f3a08026418ceb7bf4808aee

https://minihf.com/posts/2024-12-20-weave-agent-dev-log-3/

I have writing at various levels of quality and endorsement in my Twitter archive which you can find by searching for keywords like “Goodhart” and “Reward model”.

https://jdpressman.com/tweets.html

You can also just ask me. Honestly I think people ask me things about alignment somewhere between 10-100x less often than they should.

jdp 23 Sep 2025 20:05 UTC
0 points
0
in reply to: jdp’s comment on: jdp’s Shortform
I am using LessWrong shortform like Twitter it really shouldn’t be taken that seriously.

jdp 23 Sep 2025 19:54 UTC
4 points
0
in reply to: niplav’s comment on: jdp’s Shortform
I mean I was looming a fictional dialogue between me and Yudkowsky and it had my character casually bring up that they’re the author of “Soft Optimization Makes The Value Target Bigger”, which would imply that the model recognizes my thought patterns as similar to that document in vibe.

jdp 22 Sep 2025 20:41 UTC
1 point
0
on: jdp’s Shortform
Kimi K2 apparently believes I am Jeremy Gillen and I find this very gratifying.

jdp 19 Sep 2025 22:43 UTC
2 points
0
in reply to: Eli Tyre’s comment on: JDP Reviews IABIED
I already denied it so.

jdp 19 Sep 2025 18:06 UTC
7 points
1
in reply to: Nina Panickssery’s comment on: JDP Reviews IABIED
Okay but they’re not actually using those things as evidence for their claims about generalization in the limit, which is explained through evolutionary metaphors. I agree that the argument itself is not very well explained but if you can’t see the ways that a MCTS searching over paths to an outcome where the policy has complications like glitch tokens could lead to bad outcomes I’m not really sure what to tell you. Like, if your policy thinks a weird string is the highest scoring thing (a category of error you absolutely see in real reward models) then that’s going to distort any search process that uses it as a policy. So if you just assume ASI is a normal AI agent with a policy and a planner (not an insane assumption) and it has things like glitch tokens you’re likely in for a bad time.

I was giving an inside baseball review for the sort of person who has been following this for a while and wants to know if EY updated at all. And the answer is yeah he threw out a lot of the dumbest rhetoric.

“Okay but is the book good?”

Oh hell no.

jdp 19 Sep 2025 7:40 UTC
7 points
0
in reply to: Jeremy Gillen’s comment on: JDP Reviews IABIED

I think the issue is exacerbated by the way that when people post about alignment, they often have a detailed AGI design in their mind, and they are talking about alignment issues with that AGI design. But the AGI design isn’t described in much detail or at all. And over the last two decades the AGI designs that people have had in mind have varied wildly, and many of them have been pretty silly.

I agree with this and don’t mind saying for future reference that my current AGI model is in fact a traditional RL agent with a planner and a policy where the policy is some LLM-like foundation model and the planner is something MCTS-like over ReAct-like blocks. The agent rewards itself by taking motor actions and then checking whether the action succeeded with evaluation actions that return a boolean result to assess subgoal completion.

So, MuZero but with LLMs basically.

jdp 19 Sep 2025 7:24 UTC
3 points
3
in reply to: StanislavKrym’s comment on: JDP Reviews IABIED
Okay but they don’t have to be ruled out for you to say things like “One way this could work is bla bla bla” and have that be sane in the context of what already exists. Again if you think something is a near term concern it’s not unreasonable to make reference to how the existing things could evolve in the near future. I think what’s actually going on here is that rather than non-LLM agents being “not ruled out” (which I agree with, they are by no means ruled out) Yudkowsky and Soares find LLM agents an implausible architecture but don’t want to say this explicitly because they think saying that too loudly would speed up timelines. I think they’re actually wrong about the viability of LLM agents, but it does contribute to a sort of odd abstract tone it otherwise would have less of.

jdp 19 Sep 2025 6:18 UTC
18 points
−1
in reply to: Jeremy Gillen’s comment on: JDP Reviews IABIED

My guess is that you’re thinking of thinking of stuff like optimization daemons and the malign prior, where there’s an agent that shows up in a place where it wasn’t intended to show up. I think the similarities caused a bunch of mixed up ideas on lesswrong and elsewhere.

I honestly just remember a lot of absurd posts spending their time thinking about daemons in the weights which were based on a model of gradient descent as being evolution-like in ways which it is not and the absurdity of said posts absolutely contributed to the alignment winter by giving people the impression that they’re blocked on impossible seeming problems that don’t actually exist and then focusing their attention somewhere else. MIRI cluster very much contributed to this and I consider this book, to the extent it’s talking about something with the same name to be a retcon.

I agree that the strict literal words “deceptive mesaoptimizer” mean what you say they do, but also that is not really what people meant by it until fairly recently when they had to retcon the embarrassing alien shoggoth stuff. It almost always meant deceptive mesaoptimization daemon as subset of the network undermining the training goal.

Like I don’t think they ever meant anything different by the “vast space of minds” than what is described in the book.

I am quite certain they did. In any case there does exist a large space of output heads on top of the shared ontology so it doesn’t really matter that much. I think the alienness of the minds involved is a total red herring, they could be very hominid-like and it wouldn’t matter much if they include superintelligent planners doing argmax(p(problem_solved)).

jdp 19 Sep 2025 2:30 UTC
6 points
1
in reply to: the gears to ascension’s comment on: JDP Reviews IABIED
I didn’t but I did copy pasta the intro from another post I was writing because it seemed relevant.

jdp 19 Sep 2025 1:32 UTC
6 points
0
in reply to: the gears to ascension’s comment on: jdp’s Shortform
I have now written a review of the book, which touches on some of what you’re asking about. https://www.lesswrong.com/posts/mztwygscvCKDLYGk8/jdp-reviews-iabied

JDP Reviews IABIED

jdp19 Sep 2025 1:23 UTC

74 points

21 comments8 min readLW link

(minihf.com)

jdp 17 Sep 2025 3:10 UTC
5 points
4
in reply to: RobertM’s comment on: jdp’s Shortform
My honest impression, though I could be wrong and didn’t analyze the prepublication reviews in detail, is that there is very much demand for this book in the sense that there’s a lot of people who are worried about AI for agent foundations shaped reasons and want an introduction they can give to their friends and family who don’t care that much.

https://x.com/mattyglesias/status/1967765768948306275?s=46

For example I think this review from Matt Yglesias makes the point fairly explicit? He obviously has a preexisting interest in this subject and is endorsing the book because he wants the subject to get more attention, that doesn’t necessarily mean that the book is good. I in fact agree with a lot of the books basic arguments but think I would not be remotely persuaded by this presentation if I wasn’t already inclined to agree.

jdp 17 Sep 2025 2:15 UTC
4 points
0
in reply to: Garrett Baker’s comment on: Thane Ruthenis’s Shortform
I was not appealing to authority or being absurd (though admittedly the second quality is subjective), it is in fact relevant if we’re arguing about...if you say

How.… else… do you expect to generalize human values out of distribution, except to have humans do it?

This implies, though I did not explicitly argue with the implication, that to generalize human values out of distribution you run a literal human brain or approximation of a human brain (e.g. Hansonian Em) to get the updates. What I was pointing out is that CEV, which is the classic proposal for how to generalize human values out of distribution and therefore a relevant reference point for what is and is not a reasonable plan (and as you allude to, considered a reasonable plan by people normally taken to be clearly thinking about this issue) to generalize human values out of distribution, does not actually call for running a literal emulation of a human brain except perhaps in its initial stages (and even then only if absolutely necessary, Yudkowsky is fairly explicit in the Arbital corpus that FAI should avoid instantiating sapient subprocesses) because the entire point is to imagine what the descendants of current day humanity would do under ideal conditions of self improvement, a process which if it’s not to instantiate sapient beings must in fact not really be based on having humans generalize the values out of distribution.

If this is an absurd thing to imagine, then CEV is absurd, and maybe it is. If pointing this out is an appeal to authority or in-groupness/outgroupness then presumably any argument of the form “actually this is normally how FAI is conceived and therefore not an apriori unreasonable concept” is invalid on such grounds and I’m not really sure how I’m meant to respond to a confused look like that. Perhaps I’m supposed to find the least respectable plan which does not consider literal human mind patterns to be a privileged object (in the sense their cognition is strictly functionally necessary to make valid generalizations from the existing human values corpus) and point at that? But that doesn’t seem very convincing obviously.

“Pointing at anything anyone holds in high regard as evidence about whether an idea is apriori unreasonable is an appeal to authority and in-groupness.” is to be blunt parodic.

I feel satisfied getting a bit aggressive when people do that. I agree that style doesn’t have any bearing on the validity of my argument, but it does discourage that sort of talk.

I agree it’s an effective way to discourage timid people from saying true or correct things when they disagree with people’s intuitions, which is why the behavior is bad.

jdp 17 Sep 2025 0:14 UTC
4 points
2
in reply to: jdp’s comment on: Thane Ruthenis’s Shortform
I don’t think the kind of “native” generalization from a fixed distribution I’m talking about there exists, it’s kind of a phenomenal illusion because it feels that way from the inside but almost certainly isn’t how it works. Rather humans generalize their values through institutional processes to collapse uncertainty by e.g. sampling a judicial ruling and then people update on the ruling with new social norms as a platform for further discourse and collapse of uncertainty as novel situations arise.

Or in the case of something like music, which does seem to work from a fixed set of intrinsic value heuristics, the actual kinds of music which gets expressed in practice within the space of music relies on the existing corpus of music that people are used to. Supposedly early rock and roll shows caused riots, which seems unimaginable now. What happens is people get used to a certain kind of music, then some musicians begin cultivating a new kind of music on the edge of the existing distribution using their general quality heuristics at the edge of what is recognizable to them. This works because the k-complexity of the heuristics you judge the music with is smaller and therefore fits more times into a redundant encoding than actual pieces of music so as you go out of distribution (functionally similar to applying a noise pass to the representation) your ability to recognize something interesting degrades more slowly than your ability to generate interesting music-shaped things. So you correct the errors to denoise a new kind of music into existence and move the center of the distribution by adding it to the cultural corpus.

jdp

JDP Re­views IABIED

JDP Reviews IABIED