Zack_M_Davis

Karma: 18,447

Zack_M_Davis 27 Feb 2026 6:26 UTC
2 points
0
in reply to: Max H’s comment on: The persona selection model
A lot of the data is actually the same, but a lot of it is actually different! Sure, chemistry works the same on Earth and Qo’noS. But in addition to vinegar and baking soda, Earth is full of humans doing human things, and Qo’noS is full of Klingons doing Klingon things.

If you want to predict how a human would respond to a moral dilemma, the English LLM can predict that, because the simplest program (with respect to the neural network prior) that predicts English webtext needs to be able predict human moral judgements. The Klingon LLM can’t; it doesn’t know anything about humans.

To be sure, the prediction about the human’s choice is, in terms of agent foundations theory, “prediction” and not “steering”. The LLM doesn’t autonomously want to do the right thing. With the right prompt, it could just as easily predict what fictional Romulans would do (because webtext contains a lot of fiction about Romulans) or the results of chemistry reactions (because there’s a lot of webtext about chemistry).

But predictions can be used for steering. With careful prompting or reinforcement learning, the English LLM can respond to a description of a moral dilemma with a pretty good prediction of how a human would respond to the dilemma, and the text can be used to trigger actions in the world, for example, via a CLI interface. That’s real steering (the CLI command executed depends on the dilemma by means of the prediction about the human’s response) that the Klingon LLM can’t do.

Zack_M_Davis 25 Feb 2026 8:35 UTC
2 points
0
in reply to: Max H’s comment on: The persona selection model
Good answer; agreed on the one-shotting and memorylessness.

all of the scientific knowledge, math, logical reasoning, etc. would be (functionally) almost exactly the same between a human and alien corpus, and that’s probably a huge chunk of where LLM capabilities come from

I don’t think I buy this one. Theorems and scientific phenomena are universal, but the model can only “see” them through the data we give them. The fact that chain-of-thought reasoning improves performance (and that you can intervene on them to change the answer) suggests that reasoning is meaningfully happening “in” the natural language token output even if it’s not perfectly faithful.

any beings that evolved over billions of years in the same universe probably have more in common with each other than entities that they train artificially through a very different process.

Certainly (trivially) the biological organisms have more in common with each other along the dimensions that are about being biological organisms (the aliens eat, reproduce, &c.), but I think the interesting version of this question is about information-processing behavior, and the big surprise of the deep learning revolution is that a lot of that seems more “data-dependent” rather than “architecture-dependent” than one might have guessed. (Scare quotes because that formulation as is kind of mind-projection-fallacious as stated; the real claim is that you can recover algorithms from induction on the data.)

Like, if I don’t believe that, it’s hard to make sense of why RLAIF schemes like Constitutional AI (where the preference ratings come from a language model’s interpretation of text, rather than a reward model trained on human judgements) can work at all. It’s an alien rating another alien!

Aren’t LLMs actually extremely superhuman at translation and interpretation tasks, even for languages with few or no samples in training?

That’s not my understanding; do you have a cite I should look at? (On a quick search, Tanzer et al. 2024 is claiming impressive but still subhuman results from fine-tuning on a single grammar book, but Aycock et al. 2025 are skeptical of their interpretation.)

There are some really impressive results on translation without parallel data, but that works by aligning the latent spaces, definitely not “few or no samples in training”.

Zack_M_Davis 25 Feb 2026 4:31 UTC
2 points
0
in reply to: Max H’s comment on: The persona selection model

I’d still bet that an LLM trained on a corpus of an alien civilization’s text would have more in common with Earth LLMs, [...] behaviorally [...] than Earth LLMs have in common with humans or alien LLMs have in common with aliens.

I think it would help to be more specific about what behaviors you have in mind. The repetition trap that base models get stuck in sometimes seems like a good example. What else?

Given that the whole thing we’re doing with deep learning is approximating data-generating functions, I do think the data is hugely determinative for “most” behavior. The phenomenon of data-dependent generalization—e.g., the fact that you can fit a network to randomized labels which don’t generalize, in contrast to how useful classifers do generalize—suggests that the algorithm learned during training depends on the data. (That is, it’s not that “transformers” are a fixed sequence-predicting algorithm that can do English and Klingon; rather, the algorithms that transformers learn from being trained to predict English are different from the algorithms they learn from being trained to predict Klingon.)

Zack_M_Davis 17 Feb 2026 23:29 UTC
6 points
0
in reply to: jimmy’s comment on: Hazards of Selection Effects on Approved Information

I’m happy to answer it in PM

(PM’d.)

if you experience my comments as hard to face, then that becomes my problem too. Because I want to have this discussion with you

Yes, that makes sense. I often do some of this, too.

Sometimes there’s a recursive problem that I haven’t figured out how to deal with, when someone is demanding narcissistic ego supply as a precondition for talking, and I don’t see any way to comply while still making progress in the conversation, because the specific thing I want to talk about is how it’s bad to demand narcissistic ego supply as a precondition for talking.

But there’s an implicit “I shouldn’t address it head on”

My strategy has mostly been to address it an a meta-discourse post like this one, or in my memoir sequence. The reason to stick to the object-level in the moment is because I anticipate that addressing it head-on would just immediately deadlock. There’s nowhere to recover from “You’re being unfair”, “No, you’re being unfair”.

Zack_M_Davis 16 Feb 2026 18:09 UTC
5 points
1
in reply to: Seth Herd’s comment on: Most Observers Are Alone: The Fermi Paradox as Default
Why aren’t we part of the Singleton, then?

Zack_M_Davis 16 Feb 2026 7:14 UTC
10 points
0
in reply to: jimmy’s comment on: Hazards of Selection Effects on Approved Information

someone who exploits the other side.

Examples? Who is exploiting the other side?

Do you experience it as genuinely effortless regardless of the type of criticism you receive?

No, it is not effortless! For example, I was late to respond to your top-level comment because I was too shy to check the comments for 36 hours after posting.

Or do you feel like you are doing something that requires actively holding yourself to high standards because the standards are important?

Yes, and in particular the relevant standard isn’t about not having emotions; it’s about not making my emotions other people’s problem. When I’m not feeling up to receiving criticism, I often do things like avoid checking the comments section for 36 hours until I eventually force myself to look. (It always feels better after I look, but looking never gets any easier.)

But that’s my problem, not a lever to control what other people are allowed to say to me or about me. Emotions are information. If people say things to me or about me that make me feel bad, then maybe I should feel bad.

Zack_M_Davis 15 Feb 2026 20:42 UTC
3 points
0
in reply to: jimmy’s comment on: Hazards of Selection Effects on Approved Information

Does LW live up to this standard as well as you might hope, or do you notice bits here and there where the market for ideas is distorted by other considerations?

The distortions I’m most concerned about are the ones discussed in the post. That’s why I wrote the post. (All Goofusia’s parts are abstracted from real-life conversations I’ve had within the past five weeks.)

I’m sure there are other distortions that I’m not seeing. That’s why a culture of vigorous, unfettered discussion is important, so other people can point out the things I can’t see myself.

Zack_M_Davis 15 Feb 2026 6:22 UTC
2 points
0
in reply to: jimmy’s comment on: Hazards of Selection Effects on Approved Information

make yourself look good by tearing down the author, and you don’t need actual flaws for that. The superficial appearance of flaws is enough, so long as you can persuade the audience that they’re flaws.

Right, the hope is that the audience can tell the difference between true teardowns and false teardowns, and false teardowns get torn down themselves.

“Oh no, I’m not trying to tear anyone down for status, I’m totally just truth seeking in my criticisms” is equally suspicious, for the same reasons

Right, the hope is that you gain status by means of making true criticisms (and would lose status from false criticisms).

There is a potential problem where people who have more time to burn arguing on the internet are at an advantage in the status competition, but I’m not sure how to fix that. (It’s a potential problem for real-world judicial systems that the rich can afford better lawyers than the poor, but mandatory state-appointed advocates for everyone would be worse even if it’s more “fair”, because the all the conflict would get pushed into the advocate-appointment system.)

It occurs to me (at age 38 with no dayjob, no girlfriend, and no children) that efficient markets in life activities may already sufficiently mitigate this on its own: having more time to burn arguing on the internet than anyone else is its own punishment.

Zack_M_Davis 13 Feb 2026 17:05 UTC
4 points
−1
in reply to: cousin_it’s comment on: Optimal Timing for Superintelligence: Mundane Considerations for Existing People
I see. I think you should write a post trying to imagine in detail the failure modes you foresee if AI is aligned to the rich and powerful. What happens to the masses in those worlds, specifically? Are they killed, tortured, forced to work as waiters, or what? I have “merely mostly selfish” psych intuitions, so when I imagine Sam Altman being God-Emperor, I imagine that being like “luxury post-scarcity utopia except everyone has been brainwashed to express gratitude to the God-Emperor Sam I for giving them eternal life in utopia”, which is not ideal, but still arguably vastly better than worlds (like the status quo) with death and suffering. If you’re envisioning something darker, I think being more concrete would help puncture the optimism of people like me.

Zack_M_Davis 13 Feb 2026 3:46 UTC
17 points
3
in reply to: Joanna’s comment on: Joanna’s Shortform
The old profile page made more sense to me: there was a list of posts (sortable), and then a list of comments. That’s the information I needed.

Now everything above the fold is a newspaper-like display of just a few posts, with most of the real estate being taken up by illustrations, which are mostly default abstract imagery because most posts on this website don’t (and probably shouldn’t) have illustrations. How is this a good use of space? I understand the demand for customizability, but I thought the old profile page let you pin posts?

I don’t like that you have to click “Feed” to see comments. Comments are important!

It’s not even clear what information “Feed” is showing me! When I scroll down on my “Feed”, after a bunch of recent comments, it shows a bunch of my posts one after another, but I definitely wrote comments in between those posts, as I can confirm on the GreaterWrong viewer. If “Feed” isn’t a reverse-chronological list of both comments and posts (despite the fact that the gear menu says “Show: All”), what even is it? (Speculation: maybe this is a bug caused by the posts already being loaded in order to populate the “All Posts” view, whereas more comments aren’t loaded unless the user scrolls down.)

Speaking of which, I think the old profile page already had this problem, but infinite-scroll or “See more” (in contrast to pagination) is hostile to people who want to read old comments by their favorite user. (I thought supporting “long content” was a design goal?) I like how GreaterWrong has pagination and query parameters (you can just edit the ?offset=n in the address bar) and sort-by-old for comments. (The new profile only has an Old sort option for “All Posts”, not “Feed”.)

The post listing conveys less information than the old profile page. It doesn’t display the number of comments. Mousing over the karma number shows a tooltip that says “Karma score” rather than the number of votes.

I could be sold on excerpting the first few lines of the post, but the title and excerpt and score and date all apparently being part of the same link element (changing color on mouseover) is weird.
What links here?
- Lorec's comment on Lorec’s Shortform by Lorec (13 Feb 2026 20:06 UTC; 0 points)

Zack_M_Davis 13 Feb 2026 3:33 UTC
13 points
3
in reply to: cousin_it’s comment on: Optimal Timing for Superintelligence: Mundane Considerations for Existing People
The economic rationale for human servitude disappears when the machines are better than humans at everything. That doesn’t prevent sadistic mistreatment or killing the poor to use their atoms for something else, but it’s a major disanalogy from history. What lessons you draw probably depend on whether you think the rich and powerful are sadistic (actively wanting to harm the poor) or merely mostly selfish (happy to help the poor if it’s trivially cheap or they get their name on a building in return, but not if it makes a dent in the yacht and caviar budget).

Zack_M_Davis 12 Feb 2026 1:26 UTC
3 points
1
in reply to: Bistrofan’s comment on: Bistrofan’s Shortform
People try not to be too explicit about this in public, but the real concern is about dysgenics and human capital, not mere underpopulation. Having ten billion people in 2084 may not be good if they’re statistically less of the people needed to deal with the problems of ten billion people in 2084.

Zack_M_Davis 11 Feb 2026 20:14 UTC
2 points
0
in reply to: Czynski’s comment on: Lack of Social Grace Is an Epistemic Virtue
What if we rewrite #4 to not require conscious intent? (“A person unable to disentangle their reasoning from status concerns just being rude about X”.) Does that restore symmetry between #3 and #4?

Zack_M_Davis 11 Feb 2026 17:08 UTC
5 points
0
in reply to: Bistrofan’s comment on: Bistrofan’s Shortform
There’s calculated rational “what if we’re wrong” hedging, but then there’s … holding out hope? (I’m not claiming it’s rational; I’m trying to articulate the psychology.) To conclude “AGI is coming, no point in having children” amounts to betting on death, giving up on believing in a human future. It kind of makes sense that an evolved creature would be inclined to cling to the belief in a future and live as if it were true despite evidence to the contrary; as irrationalities go, it’s quite adaptive.

Zack_M_Davis 8 Feb 2026 5:28 UTC
5 points
3
in reply to: Roko’s comment on: Roko’s Shortform
Why does keeping the humans around bolster the strength of their own property rights? If the machines are able to build much better governance than humans have managed, why can’t the new governance regime include a new property system that disappropriates the humans? It’s not like disappropriation is historically novel; humans do it to the losers of wars all the time.

Zack_M_Davis 4 Feb 2026 8:22 UTC
4 points
0
in reply to: yams’s comment on: Pi Rogers’s Shortform
When someone uses the phrase “costly signal”, I think it’s germane and not an isolated demand for rigor to point out that in the standard academic meaning of the term, it’s a requirement that honest actors have an easier time paying the cost than dishonest actors.

That is: I’m not saying you were bluffing; I’m saying that, logically, if you’re going to claim that costly signals make your claim trustworthy (which is how I interpreted your remarks about “a method of rendering a more costly signal”; my apologies if I misread that), you should have some sort of story for why a dishonest actor couldn’t send the same signal. I think this is a substantive technical point; the possibility of being stuck in a pooling equilibrium with other agents who could send the same signals as you for different reasons is definitely frustrating, but not talking about it doesn’t make the situation go away.

I agree that you’re free to ignore my comments. It’s a busy, busy world that may not last much longer; it makes sense that people to have better things to do with their lives than respond to every blog comment making a technical point about game theory. In general, I hope for my comments to provide elucidation to third parties reading the thread, not just the person I’m replying to, so when an author has a policy of ignoring me, that doesn’t necessarily make responding to their claims on a public forum a waste of my time.

Zack_M_Davis 4 Feb 2026 7:19 UTC
3 points
0
in reply to: yams’s comment on: Pi Rogers’s Shortform

I am not experiencing suffering or claiming to experience suffering [...] I find this a psychologically invasive and offensive suggestion on your part

Sorry, I should have been clearer: I was trying to point to the game-theoretic structure where, as you suggest by the “madman with hostages” metaphor, an author considering publishing an allegedly suffering-causing idea could be construed as engaging in extortion (threatening to cause suffering by publishing and demanding concessions in exchange for not publishing), but that at the same time, someone appealing to suffering as a rationale to not publish could be construed as engaging in extortion (threatening that suffering would be a result of publishing and demanding concessions, like extra research and careful wording, in exchange for publishing). I think this is an interesting game-theoretic consideration that’s relevant to the topic of discussion; it’s not necessarily about you.

In cases where convincing is >>> costly to complying to the request it’s good form to comply

How do we know you’re not bluffing? (Sorry, I know that’s a provocative-sounding question, but I think it’s actually a question that you need to answer in order to invoke costly signaling theory, as I explain below.)

Your costly signaling theory seems to be that by writing passionately, you can distinguish yourself as seeing a real danger that you can’t afford to demonstrate, rather than just trying to silence an idea you don’t like despite a lack of real danger.

But costly signaling only works when false messages are more expensive to send, and that doesn’t seem to be the case here. Someone who did want to silence an idea they didn’t like despite a lack of real danger could just as easily write as passionately as you.

Zack_M_Davis 3 Feb 2026 21:02 UTC
6 points
3
in reply to: yams’s comment on: Pi Rogers’s Shortform

and you are making it my responsibility to convince you to read the brochure

I mean, yes? If you want someone to do something that they wouldn’t otherwise do, you need to persuade them. How could it be otherwise?

From my perspective, you are a madman with hostages and a loaded gun!

But this goes both ways, right? What counts as extortion depends on what the relevant property rights are. If readers have a right to not suffer, then authors who propose exploring suffering-causing ideas are threatening them; but if authors have a right to explore ideas, then readers who propose not exploring suffering-causing ideas are threatening them.

Interestingly, this dynamic is a central example of the very phenomenon Morphism is investigating! Someone who wants to censor an idea has a game-theoretic incentive to self-modify to suffer in response to expressions of the the idea, in order to extort people who care about their suffering into not expressing the idea.

Zack_M_Davis 30 Jan 2026 1:55 UTC
8 points
0
in reply to: Joe Rogero’s comment on: Eliezer and I wrote a book: If Anyone Builds It, Everyone Dies
Sorry, I don’t want to accidentally overemphasize SLT in particular, which I am not an expert in. I think what’s at issue is how predictable deep learning generalization is: what kind of knowledge would be necessary in order to “get what you train for”?

This isn’t obvious from first principles. Given a description of SGD and the empirical knowledge of 2006, you could imagine it going either way. Maybe we live in a “regular” computational universe, where the AI you get depends on your architecture and training data according to learnable principles that can be studied by the usual methods of science in advance of the first critical try, but maybe it’s a “chaotic” universe where you can get wildly different outcomes depending on the exact path taken by SGD.

A lot of MIRI’s messaging, such as the black shape metaphor, seems to assume that we live in a chaotic universe, as when Chapter 4 of If Anyone Builds It claims that the preferences of powerful AI “might be chaotic enough that if you tried it twice, you’d get different results each time.” But I think that if you’ve been paying attention to the literature about the technology we’re discussing, there’s actually a lot of striking empirical evidence that deep learning is much more “regular” than someone might have guessed in 2006: things like how Draxler et al. 2018 showed that you can find continuous low-loss paths between the results of different training runs (rather than being in different basins which might have wildly different generalization properties), or how Moschella et al. 2022 found that different models trained on different data end up learning the same latent space (such that representations by one can be reused by another without extra training). Those are empirical results; the relevance of SLT is as a theoretical insight as to how these results are even possible, in contrast to how people in 2006 might have had the intuition, “Well, ‘stochastic’ is right there as the ‘S’ in SGD, of course the outcome is going to be unpredictable.”

it does seem to me like an AI trained using modern methods, e.g. constitutional AI, is insufficiently constrained to embody human-compatible values [...] the black shape is still basically unpredictable from the perspective of the teal-shape drawer

I think it’s worth being really specific about what kind of “AI” you have in mind when you make this kind of claim. You might think, “Well, obviously I’m talking about superintelligence; this is a comment thread about a book about why people shouldn’t build superintelligence.”

The problem is that is that if you try to persuade people to not build superintelligence using arguments that seem to apply just as well to the kind of AI we have today, you’re not going to be very convincing when people talk to human-compatible AIs behaving pretty much the way their creators intended all the time every day.

That’s what I’m focused on in this thread: the arguments, not the conclusion. (This methodology is probably super counterintuitive to a lot of people, but it’s part of this website’s core canon.) I’m definitely not saying anyone knows how to train the “entire spectrum of reflectively consistent human values”. That’s philosophy, which is hard. I’m thinking about a much narrower question of computer science.

Namely: if I take the black shape metaphor or Chapter 4 of If Anyone Builds It at face value, it’s pretty confusing how RLAIF approaches like constitutional AI can work at all. Not just when hypothetically scaled to superintelligence. I mean, at all. Upthread, I wrote about how people customize base language models by upweighting trajectories chosen by a model trained to predict human approval and disapproval ratings.

In RLAIF, they use an LLM itself to provide the ratings instead of any actual humans. If you only read MIRI’s propaganda (in its literal meaning, “public communication aimed at influencing an audience and furthering an agenda”) and don’t read ArXiV, that just sounds suicidal.

But it’s working! (For now.) It’s working better than the version with actual human preference rankings! Why? How? Prosaic alignment optimists would say: it learned the intended Platonic representation from pretraining. Are they wrong? Maybe! (I’m still worried about what happens if you optimize too hard against the learned representation.)

But in order to convince policymakers that the prosaic alignment optimists are wrong (while the prosaic alignment optimists are passing them bars of AI-printed gold under the table), you’re going to need a stronger argument than “the black shape is still basically unpredictable from the perspective of the teal-shape drawer”. If it were actually unpredictable, where is all this gold coming from?

The black thing conforms to all the regularities. This does not by coincidence happen to cause it to occupy the shape you hoped for; you do not see all the non-robust / illegible features constraining that shape

While we’re still in the regime of pretraining on largely human-generated data, this is arguably great for alignment. You don’t have to understand the complex structure of human value; you can just point SGD at valuable data and get a network that spits out “more from that distribution”, without any risk of accidentally leaving out boredom and destroying all value that way.

Obviously, that doesn’t mean the humans are out of the woods. As the story of Earth-originating intelligent life goes on, and the capabilities of Society’s cutting-edge AIs start coming more and more from reinforcement learning and less and less from pretraining, you start run a higher risk of misspecifying your rewards, eventually fatally. But that world looks a lot more like Christiano’s “you get what you measure” scenario, rather than Part II of If Anyone Builds It, even if the humans are dead at the end of both stories. But the details matter for deciding which interventions are most dignified—possibly even if you think governance is more promising than alignment research. (Which specific regulations you want in your Pause treaty depends on which AI techniques are feasible and which ones are dangerous.)

the summary is incorrect to analogize the black thing to “architectures” instead of “parametrizations” or “functions”

Yes, the word choice of “architectures” in the phrase “many, many, many different complex architectures” in the article is puzzling. I don’t know what the author meant by that word, but to modern AI practitioners, “architecture” is the part of the system that is designed rather than “grown”: these-and-such many layers with such-and-these activation functions—the matrices, not the numbers inside them.

Zack_M_Davis 28 Jan 2026 9:17 UTC
3 points
0
in reply to: Ben Pace’s comment on: What’s a good methodology for “is Trump unusual about executive overreach / institution erosion?”
This is such a bizarre reply. Part of the time-honored ideal of being widely read (that I didn’t think I needed to explicitly spell out) is that you’re not supposed to believe everything you read.

Right? I don’t think this is special “rationalist” wisdom. I think this is, like, liberal arts. Like, when 11th grade English teachers assign their students to read Huckleberry Finn, the idea is that being able to see the world through the ungrounded lenses of 19th-century racists makes them more sane, because they can contrast the view through those particular ungrounded lenses with everything else they’ve read.

many sources of news have been adversarially pursued and optimized to make the consumers of it be corrupted and controlled.

I mean, yes, but the way they pull that off is by convincing the consumers that they shouldn’t want to read any of those awful corners of the internet where they do truthseeking far worse than here. (Pravda means “truth”; Donald Trump’s platform is called Truth Social.)

some environments have valuable info [...] but I would say that most environments talking about “current events” in government/politics are not.

Given finite reading time, you definitely need to prioritize ruthlessly to manage the signal-to-noise ratio. If you don’t have time to read anything but Mowshowitz, that’s fine; most things aren’t worth your time. But if you’re skeptical that a human can expose itself to any social environment on the internet and do better, that doesn’t sound like a signal-to-noise ratio concern. That sounds like a contamination concern.