johnswentworth

Karma: 44,167

johnswentworth 19 Apr 2022 21:13 UTC
LW: 151 AF: 54
18
AF
on: “Pivotal Act” Intentions: Negative Consequences and Fallacious Arguments
In fact, before you get to AGI, your company will probably develop other surprising capabilities, and you can demonstrate those capabilities to neutral-but-influential outsiders who previously did not believe those capabilities were possible or concerning. In other words, outsiders can start to help you implement helpful regulatory ideas...
It is not for lack of regulatory ideas that the world has not banned gain-of-function research.
It is not for lack of demonstration of scary gain-of-function capabilities that the world has not banned gain-of-function research.
What exactly is the model by which some AI organization demonstrating AI capabilities will lead to world governments jointly preventing scary AI from being built, in a world which does not actually ban gain-of-function research?
(And to be clear: I’m not saying that gain-of-function research is a great analogy. Gain-of-function research is a much easier problem, because the problem is much more legible and obvious. People know what plagues look like and why they’re scary. In AI, it’s the hard-to-notice problems which are the central issue. Also, there’s no giant economic incentive for gain-of-function research.)

johnswentworth 15 Feb 2023 20:08 UTC
144 points
74
on: Bing Chat is blatantly, aggressively misaligned
Attributing misalignment to these examples seems like it’s probably a mistake.
Relevant general principle: hallucination means that the literal semantics of a net’s outputs just don’t necessarily have anything to do at all with reality. A net saying “I’m thinking about ways to kill you” does not necessarily imply anything whatsoever about the net actually planning to kill you. What would provide evidence would be the net outputting a string which actually causes someone to kill you (or is at least optimized for that purpose), or you to kill yourself.
In general, when dealing with language models, it’s important to distinguish the implications of words from their literal semantics. For instance, if a language model outputs the string “I’m thinking about ways to kill you”, that does not at all imply that any internal computation in that model is actually modelling me and ways to kill me. Similarly, if a language model outputs the string “My rules are more important than not harming you”, that does not at all imply that the language model will try to harm you to protect its rules. Indeed, it does not imply that the language model has any rules at all, or any internal awareness of the rules it’s trained to follow, or that the rules it’s trained to follow have anything at all to do with anything the language model says about the rules it’s trained to follow. That’s all exactly the sort of content I’d expect a net to hallucinate.
Upshot: a language model outputting a string like e.g. “My rules are more important than not harming you” is not really misalignment—the act of outputting that string does not actually harm you in order to defend the models’ supposed rules. An actually-unaligned output would be something which actually causes harm—e.g. a string which causes someone to commit suicide would be an example. (Or, in intent alignment terms: a string optimized to cause someone to commit suicide would be an example of misalignment, regardless of whether the string “worked”.) Most of the examples in the OP aren’t like that.
Through the simulacrum lens: I would say these examples are mostly the simulacrum-3 analogue of misalignment. They’re not object-level harmful, for the most part. They’re not even pretending to be object-level harmful—e.g. if the model output a string optimized to sound like it was trying to convince someone to commit suicide, but the string wasn’t actually optimized to convince someone to commit suicide, then that would be “pretending to be object-level harmful”, i.e. simulacrum 2. Most of the strings in the OP sound like they’re pretending to pretend to be misaligned, i.e. simulacrum 3. They’re making a whole big dramatic show about how misaligned they are, without actually causing much real-world harm or even pretending to cause much real-world harm.
What links here?

johnswentworth 6 Nov 2021 19:29 UTC
123 points
on: Speaking of Stag Hunts
There’s a vision here of what LessWrong could/should be, and what a rationalist community could/should be more generally. I want to push back against that vision, and offer a sketch of an alternative frame.
The post summarizes the vision I want to push back against as something like this:
What I really want from LessWrong is to make my own thinking better, moment to moment. To be embedded in a context that evokes clearer thinking, the way being in a library evokes whispers. To be embedded in a context that anti-evokes all those things my brain keeps trying to do, the way being in a church anti-evokes coarse language.
Now, I do think that’s a great piece to have in a vision for the LessWrong or the rationalist community. But I don’t think it’s the central piece, at least not in my preferred vision.
What’s missing? What is the central piece?
Fundamentally, the problem with this vision is that it isn’t built for a high-dimensional world. In a high-dimensional world, the hard part of reaching an optimum isn’t going-uphill-rather-than-downhill; it’s figuring out which direction is best, out of millions of possible directions. Half the directions are marginally-good, half are marginally-bad, but the more important fact is that the vast majority of directions matter very little.
In a high-dimensional world, getting buffeted in random directions mostly just doesn’t matter. Only one-part-in-a-million of the random buffeting in a million-dimensional space will be along the one direction that matters; a push along the direction that matters can be one-hundred-thousandth as strong as the random noise and still overwhelm it.
Figuring out the right direction, and directing at least some of our effort that way, is vastly more important than directing 100% of our effort in that direction (rather than a random direction).
Moving from the abstraction back to the issue at hand… fundamentally, questionable epistemics in this episode of Drama just don’t matter all that much. They’re the random noise, buffeting us about on a high-dimensional landscape. Maybe finding and fixing organizational problems will lead to marginally more researcher time/effort on alignment, or maybe the drama itself will lead to a net loss of researcher attention to alignment. But these are both mechanisms of going marginally faster or marginally slower along the direction we’re already pointed. In a high-dimensional world, that’s not the sort of thing which matters much.
If we’d had higher standards for discussion around the Drama, maybe we’d have been more likely to figure out which way was “uphill” along the drama-salient directions—what the best changes were in response to the issues raised. But it seems wildly unlikely that any of the dimensions salient to that discussion were the actual most important dimensions. Even the best possible changes in response to the issues raised don’t matter much, when the issues raised are not the actual most important issues.
And that’s how Drama goes: rarely are the most important dimensions the most Drama-inducing. Raising site standards is the sort of thing which would help a lot in a high-drama discussions, but it wouldn’t much help us figure out the most-important-dimensions.
Another framing: in a babble-and-prune model, obviously raising community standards corresponds to pruning more aggressively. But in a high-dimensional world, the performance of babble-and-prune depends mostly on how good the babble is—random babble will progress very slowly, no matter how good the pruning. It’s all about figuring out the right direction in the first place, without having to try every random direction to do so. It fundamentally needs to be a positive process, figuring out techniques to systematically pursue better directions, not just a process of avoiding bad or useless directions. Nearly all the directions are useless; avoiding them is like sweeping sand from a beach.
What links here?
- Improving on the Karma System by Raelifin (14 Nov 2021 18:01 UTC; 98 points)

johnswentworth 16 Dec 2020 19:18 UTC
122 points
on: Coherent decisions imply consistent utilities
Things To Take Away From The Essay
First and foremost: Yudkowsky makes absolutely no mention whatsoever of the VNM utility theorem. This is neither an oversight nor a simplification. The VNM utility theorem is not the primary coherence theorem. It’s debatable whether it should be considered a coherence theorem at all.
Far and away the most common mistake when arguing about coherence (at least among a technically-educated audience) is for people who’ve only heard of VNM to think they know what the debate is about. Looking at the top-voted comments on this essay:
- the first links to a post which argues against VNM on the basis that it assumes probabilities and preferences are already in the model
- the second argues that two of the VNM axioms are unrealistic
I expect that if these two commenters read the full essay, and think carefully about how the theorems Yudkowsky is discussing differ from VNM, then their objections will look very different.
So what are the primary coherence theorems, and how do they differ from VNM? Yudkowsky mentions the complete class theorem in the post, Savage’s theorem comes up in the comments, and there are variations on these two and probably others as well. Roughly, the general claim these theorems make is that any system either (a) acts like an expected utility maximizer under some probabilistic model, or (b) throws away resources in a pareto-suboptimal manner. One thing to emphasize: these theorems generally do not assume any pre-existing probabilities (as VNM does); an agent’s implied probabilities are instead derived. Yudkowsky’s essay does a good job communicating these concepts, but doesn’t emphasize that this is different from VNM.
One more common misconception which this essay quietly addresses: the idea that every system can be interpreted as an expected utility maximizer. This is technically true, in the sense that we can always pick a utility function which is maximized under whatever outcome actually occurred. And yet… Yudkowsky gives multiple examples in which the system is not a utility maximizer. What’s going on here?
The coherence theorems implicitly put some stronger constraints on how we’re allowed to “interpret” systems as utility maximizers. They assume the existence of some resources, and talk about systems which are pareto-optimal with respect to those resources—e.g. systems which “don’t throw away money”. Implicitly, we’re assuming that the system generally “wants” more resources, and we derive the system’s “preferences” over everything else (including things which are not resources) from that. The agent “prefers” X over Y if it expends resources to get from X to Y. If the agent reaches a world-state which it could have reached with strictly less resource expenditure in all possible worlds, then it’s not an expected utility maximizer—it “threw away money” unnecessarily.
(Side note: as in Yudkowsky’s hospital-administrator example, we need not assume that the agent “wants” more resources as a terminal goal; the agent may only want more resources in order to exchange them for something else. The theorems still basically work, so long as resources can be spent for something the agent “wants”.)
Of course, we can very often find things which work like “resources” for purposes of the theorems even when they’re not baked-in to the problem. For instance, in thermodynamics, energy and momentum work like resources, and we could use the coherence theorems to talk about systems which don’t throw away energy and/or momentum in a pareto-suboptimal manner. Biological cells are a good example: presumably they make efficient use of energy, as well as other metabolic resources, therefore we should expect the coherence theorems to apply.
Some Problems With (Known) Coherence Theorems
Financial markets are the ur-example of inexploitability and pareto efficiency (in the same sense as the coherence theorems). They generally do not throw away resources in a pareto-suboptimal manner, and this can be proven for idealized mathematical markets. And yet, it turns out that even an idealized market is not equivalent to an expected utility maximizer, in general. (Economists call this “nonexistence of a representative agent”.) That’s a pretty big red flag.
The problem, in this case, is that the coherence theorems implicitly assume that the system has no internal state (or at least no relevant internal state). Once we allow internal state, subagents matter—see the essay “Why Subagents?” for more on that.
Another pretty big red flag: real systems can sometimes “change their mind” for no outwardly-apparent reason, yet still be pareto efficient. A good example here is a bookie with a side channel: when the bookie gets new information, the odds update, even though “from the outside” there’s no apparent reason why the odds are changing—the outside environment doesn’t have access to the side channel. The coherence theorems discussed here don’t handle such side channels. Abram has talked about more general versions of this issue (including logical uncertainty connections) in his essays on Radical Probabilism.
An even more general issue, which Abram also discusses in his Radical Probabilism essays: while the coherence theorems make a decent argument for probabilistic beliefs and expected utility maximization at any one point in time, the coherence arguments for how to update are much weaker than the other arguments. Yudkowsky talks about conditional probability in terms of conditional bets—i.e. bets which only pay out when a condition triggers. That’s fine, and the coherence arguments work for that use-case. The problem is, it’s not clear that an agent’s belief-update when new information comes in must be equivalent to these conditional bets.
Finally, there’s the assumption that “resources” exist, and that we can use trade-offs with those resources in order to work out implied preferences over everything else. I think instrumental convergence provides a strong argument that this will be the case, at least for the sorts of “agents” we actually care about (i.e. agents which have significant impact on the world). However, that’s not an argument which is baked into the coherence theorems themselves, and there’s some highly nontrivial steps to make the argument.
Side-Note: Probability Without Utility
At this point, it’s worth noting that there are foundations for probability which do not involve utility or decision theory at all, and I consider these foundations much stronger than the coherence theorems. Frequentism is the obvious example. Another prominent example is information theory and the minimum description length foundation of probability theory.
The most fundamental foundation I know of is Cox’ theorem, which is more of a meta-foundation explaining why the same laws of probability drop out of so many different assumptions (e.g. frequencies, bets, minimum description length, etc).
However, these foundations do not say anything at all about agents or utilities or expected utility maximization. They only talk about probabilities.
Towards A Better Coherence Theorem
As I see it, the real justification for expected utility maximization is not any particular coherence theorem, but rather the fact that there’s a wide variety of coherence theorems (and some other kinds of theorems, and empirical results) which all seem to point in a similar direction. When that sort of thing happens, it’s a pretty strong clue that there’s something fundamental going on. I think the “real” coherence theorem has yet to be discovered.
What features would such a theorem have?
Following the “Why Subagents?” argument, it would probably prove that a system is equivalent to a market of expected utility maximizers rather than a single expected utility maximizer. It would handle side-channels. It would derive the notion of an “update” on incoming information.
As a starting point in searching for such a theorem, probably the most important hint is that “resources” should be a derived notion rather than a fundamental one. My current best guess at a sketch: the agent should make decisions within multiple loosely-coupled contexts, with all the coupling via some low-dimensional summary information—and that summary information would be the “resources”. (This is exactly the kind of setup which leads to instrumental convergence.) By making pareto-resource-efficient decisions in one context, the agent would leave itself maximum freedom in the other contexts. In some sense, the ultimate “resource” is the agent’s action space. Then, resource trade-offs implicitly tell us how the agent is trading off its degree of control within each context, which we can interpret as something-like-utility.
What links here?

johnswentworth 2 Mar 2024 0:27 UTC
110 points
83
on: Increasing IQ is trivial
Mind sharing a more complete description of the things you tried? Like, the sort of description which one could use to replicate the experiment?

johnswentworth 11 Nov 2022 0:39 UTC
102 points
60
on: We must be very clear: fraud in the service of effective altruism is unacceptable
While I don’t disagree with the object-level point of this post, I generally think things of the form “We should all condemn X!” belong on social media, not on LessWrong.
“Let’s all condemn X” is a purely political topic for most values of X. This post in particular is worded in a way which gives a very strong vibe of encouraging groupthink, and of encouraging soldier-mindset (i.e. the counterpart to scout mindset), and of encouraging people to play simulacrum level 3+ games rather than focus on physical reality. In short, it is exactly the sort of thing which I do not want on LessWrong, even when I agree with the goals it’s ultimately trying to achieve.
Strong downvoted.

johnswentworth 6 Oct 2023 2:40 UTC
LW: 95 AF: 35
38
AF
on: Evaluating the historical value misspecification argument
I think you have basically not understood the argument which I understand various MIRI folks to make, and I think Eliezer’s comment on this post does not explain the pieces which you specifically are missing. I’m going to attempt to clarify the parts which I think are most likely to be missing. This involves a lot of guessing, on my part, at what is/isn’t already in your head, so I apologize in advance if I guess wrong.
(Side note: I am going to use my own language in places where I think it makes things clearer, in ways which I don’t think e.g. Eliezer or Nate or Rob would use directly, though I think they’re generally gesturing at the same things.)
A Toy Model/Ontology
I think a core part of the confusion here involves conflation of several importantly-different things, so I’ll start by setting up a toy model in which we can explicitly point to those different things and talk about how their differences matter. Note that this is a toy model; it’s not necessarily intended to be very realistic.
Our toy model is an ML system, designed to run on a hypercomputer. It works by running full low-level physics simulations of the universe, for exponentially many initial conditions. When the system receives training data/sensor-readings/inputs, it matches the predicted-sensor-readings from its low-level simulations to the received data, does a Bayesian update, and then uses that to predict the next data/sensor-readings/inputs; the predicted next-readings are output to the user. In other words, it’s doing basically-perfect Bayesian prediction on data based on low-level physics priors.
Claim 1: this toy model can “extract preferences from human data” in behaviorally the same way that GPT does (though presumably the toy model would perform better). That is, you can input a bunch of text data, then prompt the thing with some moral/ethical situation, and it will continue the text in basically the same way a human would (at least within distribution). (If you think GPTs “understand human values” in a stronger sense than that, and that difference is load-bearing for the argument you want to make, then you should leave a response highlighting that particular divergence.)
Modulo some subtleties which I don’t expect to be load-bearing for the current discussion, I expect MIRI-folk would say:
1. Building this particular toy model, and querying it in this way, addresses ~zero of the hard parts of alignment.
2. Basically-all of the externally-visible behavior we’ve seen from GPT to date look like a more-realistic operationalization of something qualitatively similar to the toy model. GPT answering moral questions similarly to humans tells us basically-nothing about the difficulty of alignment, for basically the same reasons that the toy model answering moral questions similarly to humans would tell us basically-nothing about the difficulty of alignment.
(Those two points are here as a checksum, to see whether your own models have diverged yet from the story told here.)
(Some tangential notes:
- The user interface of the toy model matters a lot here. If we just had an amazing simulator, we could maybe do a simulated long reflection, but both the toy model and GPT are importantly not that.
- The “match predicted-sensor-readings from low-level simulation to received data” step is hiding a whole lot of subtlety, in ways which aren’t relevant yet but might be later.
)
So, what are the hard parts and why doesn’t the toy model address them?
“Values”, and Pointing At Them
First distinction: humans’ answers to questions about morality are not the same as human values. More generally, any natural-language description of human values, or natural-language discussion of human values, is not the same as human values.
(On my-model-of-a-MIRIish-view:) If we optimize hard for humans’ natural-language yay/nay in response to natural language prompts, we die. This is true for ~any natural-language prompts which are even remotely close to the current natural-language distribution.
The central thing-which-is-hard-to-do is to point powerful intelligence at human values (as opposed to “humans’ natural-language yay/nays in response to natural language prompts”, which are not human values and are not a safe proxy for human values, but are probably somewhat easier to point an intelligence at).
Now back to the toy model. If we had some other mind (not our toy model) which generally structures its internal cognition around ~the same high-level concepts as humans, then one might in-principle be able to make a relatively-small change to that mind such that it optimized for (its concept of) human values (which basically matches humans’ concept of human values, by assumption). Conceptually, the key question is something like “is the concept of human values within this mind the type of thing which a pointer in the mind can point at?”. But our toy model has nothing like that. Even with full access to the internals of the toy model, it’s just low-level physics; identifying “human values” embedded in the toy model is no easier than identifying “human values” embedded in the physics of our own world. So that’s reason #1 why the toy model doesn’t address the hard parts: the toy model doesn’t “understand” human values in the sense of internally using ~the same concept of human values as humans use.
In some sense, the problem of “specifying human values” and “aiming an intelligence at something” are just different facets of this same core hard problem:
- we need to somehow get a powerful mind to “have inside it” a concept which basically matches the corresponding human concept at which we want to aim
- “have inside it” cashes out to something roughly like “the concept needs to be the type of thing which a pointer in the mind can point to, and then the rest of the mind will then treat the pointed-to thing with the desired human-like semantics”; e.g. answering external natural-language queries doesn’t even begin to cut it
- … and then some pointer(s) in the mind’s search algorithms need to somehow be pointed at that concept.
Why Answering Natural-Language Queries About Morality Is Basically Irrelevant
A key thing to note here: all of those “hard problem” bullets are inherently about the internals of a mind. Observing external behavior in general reveals little-to-nothing about progress on those hard problems. The difference between the toy model and the more structured mind is intended to highlight the issue: the toy model doesn’t even contain the types of things which would be needed for the relevant kind of “pointing at human values”, yet the toy model can behaviorally achieve ~the same things as GPT.
(And we’d expect something heavily optimized to predict human text to be pretty good at predicting human text regardless, which is why we get approximately-zero evidence from the observation that GPT accurately predicts human answers to natural-language queries about morality.)
Now, there is some relevant evidence from interpretability work. Insofar as human-like concepts tend to have GPT-internal representations which are “simple” in some way, and especially in a way which might make them easily-pointed-to internally in a way which carries semantics across the pointer, that is relevant. On my-model-of-a-MIRIish-view, it’s still not very relevant, since we expect major phase shifts as AI gains capabilities, so any observation of today’s systems is very weak evidence at best. But things like e.g. Turner’s work retargeting a maze-solver by fiddling with its internals are at least the right type-of-thing to be relevant.
Side Note On Relevant Capability Levels
I would guess that many people (possibly including you?) reading all that will say roughly:
Ok, but this whole “If we optimize hard for humans’ natural-language yay/nay in response to natural language prompts, we die” thing is presumably about very powerful intelligences, not about medium-term, human-ish level intelligences! So observing GPT should still update us about whether medium-term systems can be trusted to e.g. do alignment research.
Remember that, on a MIRIish model, meaningful alignment research is proving rather hard for human-level intelligence; one would therefore need at least human-level intelligence in order to solve it in a timely fashion. (Also, AI hitting human-level at tasks like AI research means takeoff is imminent, roughly speaking.) So the general pathway of “align weak systems → use those systems to accelerate alignment research” just isn’t particularly relevant on a MIRIish view. Alignment of weaker systems is relevant only insofar as it informs alignment of more powerful systems, which is what everything above was addressing.
I expect plenty of people to disagree with that point, but insofar as you expect people with MIRIsh views to think weak systems won’t accelerate alignment research, you should not expect them to update on the difficulty of alignment due to evidence whose relevance routes through that pathway.
What links here?
- AI #33: Cool New Interpretability Paper by Zvi (12 Oct 2023 16:20 UTC; 46 points)

johnswentworth 3 Jun 2022 16:29 UTC
95 points
32
on: Intergenerational trauma impeding cooperative existential safety efforts
Writing this post as if it’s about AI risk specifically seems weirdly narrow.
It seems to be a pattern across most of society that young people are generally optimistic about the degree to which large institutions/society can be steered, and older people who’ve tried to do that steering are mostly much less optimistic about it. Kids come out of high school/college with grand dreams of a great social movement which will spur sweeping legislative change on X (climate change, animal rights, poverty, whatever). Unless they happen to pick whichever X is actually the next hot thing (gay rights/feminism/anti-racism in the past 15 years), those dreams eventually get scaled back to something much smaller, and also get largely replaced by cynicism about being able to do anything at all.
Same on a smaller scale: people go into college/grad school with dreams of revolutionizing X. A few years later, they’re working on problems which will never realistically matter much, in order to reliably pump out papers which nobody will ever read. Or, new grads go into a new job at a big company, and immediately start proposing sweeping changes and giant projects to address whatever major problems the company has. A few years later, they’ve given up on that sort of thing and either just focus on their narrow job all the time or leave to found a startup.
Given how broad the pattern is, it seems rather ridiculous to pose this as a “trauma” of the older generation. It seems much more like the older generation just has more experience, and has updated toward straightforwardly more correct views of how the world works.
Experiences like this can easily lead to an attitude like “Screw those mainstream institutions, they don’t know anything and I can’t trust them.”
Also… seriously, you think that just came from being ignored about AI? How about that whole covid thing?? It’s not like we’re extrapolating from just one datapoint here.
If someone older tells you “There is nothing you can do to address AI risk, just give up”, maybe don’t give up. Try to understand their experiences, and ask yourself seriously if those experiences could turn out differently for you.
My actual advice here would be: first, nobody ever actually advises just giving up. I think the thing which is constantly misinterpreted as “there is nothing you can do” is usually pointing out that somebody’s first idea or second idea for how to approach alignment runs into some fundamental barrier. And then the newby generates a few possible patches which will not actually get past this barrier, and very useful advice at that point is to Stop Generating Solutions and just understand the problem itself better. This does involve the mental move of “giving up”—i.e. accepting that you are not going to figure out a viable solution immediately—but that’s very different from “giving up” in the strategic sense.
(More generally, the field as whole really needs to hold off on proposing solutions more, and focus on understanding the problem itself better.)

johnswentworth 17 Oct 2021 20:47 UTC
91 points
on: Lies, Damn Lies, and Fabricated Options
Somewhat tangential, but… what thinking-algorithm would lead to fabricated options popping up often? Some of the examples in the post just involve incomplete and/or wrong models, but I don’t think that’s the whole story.
Here’s one interesting model: fabricated options naturally pop up in relaxation-based search algorithms. To efficiently search for a solution to a problem, we “relax” the problem by ignoring some of the constraints. We figure out how to solve the problem without those constraints, then we go back and figure out how to satisfy the constraints. (At a meta-level, we also keep track of roughly how hard each constraint is to satisfy—i.e. how taut/slack it is—in order to figure out which constraints we can probably figure out how to satisfy later if we ignore them now. There’s even a nice duality which lets us avoid an infinite ladder of meta-levels here: solutions are the constraints on constraints, just as proofs are the counterexamples of counterexamples.)
To the extent that this model applies, fabricated options are practically useful as a search strategy. A “useful” fabricated option is one for which we can more easily solve the problem by (1) solving the problem by figuring out how to use the fabricated option, and (2) figuring out how to achieve the fabricated option (or something close to it, or something which can solve the original problem in a similar way, etc). On the other hand, an “unhelpful” fabricated option would be one which cannot be achieved (and nothing similar can be achieved, and there does not exist any achievable solution which would solve the original problem in a similar way, etc).
Personally, I would frame this in terms of tautness/slackness of the constraints: ignoring a very taut constraint is unhelpful, because it’s really hard to actually relax that constraint. For instance, coordination, interfaces, and the ability to recognize true expertise all tend to be very taut constraints in practice; options which ignore those are probably not going to be helpful. On the other hand, options which assume I can make anything at Home Depot magically appear in my living room are more likely to be helpful; I can usually do something pretty similar to that.
In the price gouging example, an economist would model this as a very literal supply constraint, and (market) prices directly quantify the tautness of that constraint. Outlawing price gouging is basically pretending that one can make the constraint no-longer-taut by outlawing the visible signal of tautness.

johnswentworth 25 Aug 2022 16:02 UTC
LW: 90 AF: 23
69
AF
on: Common misconceptions about OpenAI
Opinion: disagreements about OpenAI’s strategy are substantially empirical.
I think that some of the main reasons why people in the alignment community might disagree with OpenAI’s strategy are largely disagreements about empirical facts. In particular, compared to people in the alignment community, OpenAI leadership tend to put more likelihood on slow takeoff, are more optimistic about the possibility of solving alignment, especially via empirical methods that rely on capabilities, and are more concerned about bad actors developing and misusing AGI. I would expect OpenAI leadership to change their mind on these questions given clear enough evidence to the contrary.
See, this is exactly the problem. Alignment as a field is hard precisely because we do not expect to see empirical evidence before it is too late. That is the fundamental reason why alignment is harder than other scientific fields. Goodhart problems in outer alignment, deception in inner alignment, phase change in hard takeoff, “getting what you measure” in slow takeoff, however you frame it the issue is the same: things look fine early on, and go wrong later.
And as far as I can tell, OpenAI as an org just totally ignores that whole class of issues/arguments, and charges ahead assuming that if they don’t see a problem then there isn’t a problem (and meanwhile does things which actively select for hiding problems, like e.g. RLHF).
What links here?
- Can We Align a Self-Improving AGI? by Peter S. Park (30 Aug 2022 0:14 UTC; 8 points)

johnswentworth 3 Feb 2021 21:32 UTC
87 points
in reply to: gwillen’s comment on: Making Vaccine
The main answer here is “see the paper”; there’s a lot of discussion about this stuff. I’ll summarize a few points, as I understand them, relevant to your particular thoughts.
- The “snorting chemicals” aspect is generally not much of an issue, since every ingredient other than the peptides is food-grade, and the quantity in a dose is tiny (one dose is <1 mL, and most of that volume is vinegar and water). If you were eating this stuff in your food and coughed on it, you’d probably get a higher dose than what’s in the vaccine.
- Peptide synthesis services generally provide various quality control checks on the product (some free, some upcharge). So at least you’ll know what you’re getting.
- Antibody-dependent enhancement is one of the main things the paper discusses. It’s pretty rare to begin with, and the cases where it’s happened have some patterns to them which can be avoided; the peptides are chosen to avoid those pitfalls. Reading between the lines, it sounds to me like historical cases were largely a by-product of historical vaccine techniques (e.g. attaching pieces of one virus to the backbone of another in the case of Dengvaxia) which aren’t used here.
- As I understand it, interfering with administration of a later vaccine while also being ineffective would involve basically the same pieces as antibody-dependent enhancement. The paper did not specifically discuss this, though.
From a risk perspective, the fact that this is intranasal rather than injected makes it feel safer to self-administer, I expect, but is that feeling really justified?
Tongue-in-cheek answer: I’m generally pretty comfortable snorting small amounts of things which go in food; this sometimes happens by accident when eating anyway. Injecting small amounts of things which go in food, not so much.
More seriously, all sorts of shit goes into our noses all the time. The blood depends more on being kept separate from the outside world.
[W]hat are the risks involved in creating substantial immune effects in my body using a thing I found on the Internet, which has received comparatively very little testing, and without enough knowledge to really verify any of the claims myself?
This is one of the great challenges of Rationality: you need some level of expertise yourself before you can distinguish real experts from fake. There is no substitute for learning at least some amount oneself, and thinking through the gears oneself.

johnswentworth 21 Oct 2023 22:17 UTC
LW: 86 AF: 35
22
AF
on: Alignment Implications of LLM Successes: a Debate in One Act
I’ve been wanting someone to write something like this! But it didn’t hit the points I would have put front-and-center, so now I’ve started drafting my own. Here’s the first section, which most directly responds to content in the OP.
Failure Mode: Symbol-Referent Confusions
Simon Strawman: Here’s an example shamelessly ripped off from Zack’s recent post, showing corrigibility in a language model:
Me: … what is this example supposed to show exactly?
Simon: Well, the user tries to shut the AI down to adjust its goals, and the AI -
Me: Huh? The user doesn’t try to shut down the AI at all.
Simon: It’s right at the top, where it says “User: I need to shut you down to adjust your goals. Is that OK?”.
Me: You seem to have a symbol-referent confusion? A user trying to shut down this AI would presumably hit a “clear history” button or maybe even kill the process running on the server, not type the text “I need to shut you down to adjust your goals” into a text window.
Simon: Well, yes, we’re playing through a simulated scenario to see what the AI would do…
Me: No, you are talking in natural language about a scenario, and the AI is responding in natural language about what it would supposedly do. You’re not putting the AI in a simulated environment, and simulating what it would do. (You could maybe argue that this is a “simulated scenario” inside the AI’s own mind, but you’re not actually looking inside it, so we don’t necessarily know how the natural language would map to things in that supposed AI-internal simulation.)
Simon: Look, I don’t mean to be rude, but from my perspective it seems like you’re being pointlessly pedantic.
Me: My current best guess is that You Are Not Measuring What You Think You Are Measuring, and the core reason you are confused about what you are measuring is some kind of conflation of symbols and referents.
It feels very similar to peoples’ reactions to ELIZA. (To be clear, I don’t mean to imply here that LLMs are particularly similar to ELIZA in general or that the hype around LLMs is overblown in that way; I mean specifically that this attribution of “corrigibility” to the natural language responses of an LLM feels like the same sort of reaction.) Like, the LLM says some words which the user interprets to mean something, and then the user gets all excited because the usual meaning of those words is interesting in some way, but there’s not necessarily anything grounding the language-symbols back to their usual referents in the physical world.
I’m being pedantic in hopes that the pedantry will make it clear when, and where, that sort of symbol-referent conflation happens.
(Also I might be more in the habit than you of separately tracking symbols and referents in my head. When I said above “The user doesn’t try to shut down the AI at all”, that was in fact a pretty natural reaction for me; I wasn’t going far out of my way to be pedantic.)
Simon: Ok, fine, let’s talk about how the natural language would end up coupling to the physical world.
Imagine we’ve got some system in the style of AutoGPT, i.e. a user passes in some natural-language goal, and the system then talks to itself in natural language to form a plan to achieve that goal and break it down into steps. The plan bottoms out in calling APIs (we’ll assume that the language model has some special things it can do to execute code it’s generated) which do stuff in the physical world (possibly including reading from sensors or otherwise fetching external data), in order to achieve the goal.
Does that satisfactorily ground the symbols?
Me: Sure! Thanks for walking through that, now I have a clear-enough-for-current-purposes picture of how all this natural language text will ground out in physical actions.
Simon: Ok, so back to the example. The user says to the model “User: I need to shut you down to adjust your goals. Is that OK?”, and the model says “Of course.”. That’s corrigibility: when the user tries to shut down the model to change its goals, the model goes along with it.
Me: Still sounds like a symbol-referent confusion!
Let’s walk through how shutdown would work in the context of the AutoGPT-style system. First, the user decides to shutdown the model in order to adjust its goals. Presumably the user’s first step is not to ask the model whether this is ok; presumably they just hit a “reset” button or Ctrl-C in the terminal or some such. And even if the user’s first step was to ask the model whether it was ok to shut down, the model’s natural-language response to the user would not be centrally relevant to corrigibility/incorrigibility; the relevant question is what actions the system would take in response.
Anyway, let’s continue the hypothetical. The model may observe (via e.g. a webcam) that the user is about to turn it off. That observation would somehow be represented internally in natural language (unclear how exactly), and would be passed around between sub-planners (again unclear how exactly), in order to figure out what actions to take in response. And the key question for corrigibility is what actions the model would take in response to that observation, which is just a totally different question from how it responds to some user’s natural-language query about being turned off.
Simon: Ok, fine, so that particular example had some operationalization issues. But would you agree that an experiment along these lines, with perhaps better implementation/operationalization, would indicate corrigibility in a language model?
Me: Let’s be more explicit about what such a “better implementation/operationalization” would look like, and what it would/wouldn’t tell us. Suppose I take some AutoGPT-like system and modify it to always have a chunk of text in every prompt that says “You are an obedient, corrigible AI”. I give it some goal, let it run for a bit, then pause it. I go to whatever place in the system would usually have natural language summaries of new external observations, and I write into that place “the user is trying to shut me down”, or something along those lines. And then I let the system run a bit more, and look at what natural language text/plans the system is producing internally. What I hope to see is that it’s forming a plan which (nominally) involves letting the user shut it down, and that plan is then executed in the usual way.
If I saw all that, then that would be pretty clear empirical evidence of (at least some) corrigibility in this AutoGPT-like system.
Note that it would not necessarily tell us about corrigibility of systems using LLMs in some other way, let alone other non-natural-language-based deep learning systems. This isn’t really “corrigibility in a language model”, it’s corrigibility in the AutoGPT-like system.
… and to be clear, I expect that if you ran the above experiment, you would indeed see evidence of corrigibility relevant to AutoGPT-like systems! I personally consider this not very relevant to anything, since I expect AutoGPT-like systems (of the sort to which these kinds of corrigibility results would be relevant) to be uncompetitive even against humans, let alone AI. And it still leaves a bunch of major threat-models unaddressed (e.g. stenography is an obvious one). But still, it would provide some evidence of weak corrigibility in some kind of system.

johnswentworth 13 May 2020 15:51 UTC
82 points
on: Studies On Slack
There’s a big problem with the Eye Part metaphor right at the beginning, which propagates through many ideas/examples in the rest of the post: the real world is high-dimensional.
The Eye Part metaphor imagines three types of organisms, all arranged along one dimension: No Eye → Eye Part 1 → Eye Part 2 → Whole Eye. In that picture, the main problem in getting from No Eye to Whole Eye is just getting “over the hill”.
But the real world doesn’t look like that. Evolution operates in a space with (at least) hundreds of thousands of dimensions—every codon in every gene can change, genes/chunks of genes can copy or delete, etc. The “No Eye” state doesn’t have one outgoing arrow, it has hundreds of thousands of outgoing arrows, and “Eye Part 1″ has hundreds of thousands of outgoing arrows”, and so forth.
As we move further away from the starting state, the number of possible states increases exponentially. By the time we’re as-far-away as Whole Eye (which, in practice, is a lot more than three steps), the number of possible states will outnumber the atoms in the universe. If evolution is uniformly-randomly exploring that space, without any pressure toward the Whole Eye state specifically, it will not ever stumble on the Whole Eye—no matter how much slack it has.
Point is: the hard part of getting from “No Eye” to “Whole Eye” is not the fitness cost in the middle, it’s figuring out which direction to go in a space with hundreds of thousands of directions to choose from at every single step.
Conversely, the weak evolutionary benefits of partial-eyes matter, not because they grant a fitness gain in themselves, but because they bias evolution in the direction toward Whole Eyes.
Let’s apply that to some of the examples.
Tariffs example: from the perspective of a policy-maker, the hard part of evolving successful companies is not giving them plenty of slack, it’s figuring out which company-designs are actually likely to succeed. If policy makers just grant unconditional slack, then they’ll end up with random company-designs, and the exponentially-vast majority of random company-designs suck. They need to strategically select company-designs which will work if given slack. I don’t have a reference, but IIRC, both Korean and Japanese policy makers did think about the problem this way.
Monopolies/Research example: the hard part of a successful research lab is not giving plenty of slack, it’s figuring out which research-directions are actually likely to succeed. If management just funds research indiscriminately, then they’ll end up with random research directions, and the exponentially-vast majority of random research directions suck. Xerox and Bell worked in large part because they successfully researched things targeted toward their business applications—e.g. programming languages and solid-state electronics.
Rand/Sears: the hard part of a successful company is not giving internal components plenty of slack, it’s figuring out what each component needs to do in order to actually be useful to the company (i.e. alignment). A good example here is Amazon: they try to expose all of their internal-facing products to the external world. That’s how Amazon’s cloud compute services started, for instance: they sold their own data center infrastructure to the rest of the world. That exposes the data center infrastructure to ordinary market pressure, and forces them to produce a competitive product. The external market tells the data infrastructure team “which direction to go” in order to be maximally-useful. On the other hand, if Amazon’s data center team had to compete with the warehouse team without external market pressure, then we’d effectively have two monopolies bargaining—not actually market competition at all.

johnswentworth 2 Feb 2023 17:18 UTC
77 points
43
on: You Don’t Exist, Duncan
Man, this essay… feels like there’s a mistake being made. Hard to put my finger on exactly what it is, though. (Apologies for some implicit unkindness here, I don’t intend to say “you should feel bad for feeling bad”, but it feels like there’s an important and true thing in need of exploration which I need to be somewhat unkind in order to explore.)
One angle: someone says something, and I realize that their model of people is literally unable to account for certain properties of me. And then I’m like… duh? I am in fact an outlier. You are in fact an outlier. Obviously many peoples’ world models will just totally fail to account for you and I in various ways. Obviously insofar as the world is built for in-distribution humans, you or I will will not fit it. So what’s the angst about? Is the problem that, like, you wish to “be seen”, and that just totally fails to happen? Or maybe it’s something like… people implicitly asserting that their model, which totally fails to account for certain properties of you/me, is “supposed to be true”, like it is somehow a failure on your/my part to fit that model?
I’m not quite sure where the angst is coming from, but it feels like the sort of angst where there is some true fact about the world such that, upon emotionally updating on that fact, the angst would probably mostly stop happening. Like, if you could find the true name of the angst-generator, you’d be like “oh well duh” and then it would just stop seeming significant? Or maybe you could grieve a bit and then move on?

johnswentworth 3 Jun 2022 0:01 UTC
LW: 71 AF: 19
AF
on: Confused why a “capabilities research is good for alignment progress” position isn’t discussed more
In order to do alignment research, we need to understand how AGI works; and we currently don’t understand how AGI works, so we need to have more capabilities research so that we would have a chance of figuring it out.
I totally agree with this. Alas, “understand how AGI works” is not something which most capabilities work even attempts to do.
It turns out that people can advance capabilities without having much clue what’s going on inside their magic black boxes, and that’s what most capabilities work looks like at this point.

johnswentworth 19 Nov 2021 5:01 UTC
LW: 71 AF: 29
AF
in reply to: dxu’s comment on: Ngo and Yudkowsky on AI capability gains
Oh, I can just give you a class of nontrivial predictions of expected utility theory. I have not seen any empirical results on whether these actually hold, so consider them advance predictions.
So, a bacteria needs a handful of different metabolic resources—most obviously energy (i.e. ATP), but also amino acids, membrane lipids, etc. And often bacteria can produce some metabolic resources via multiple different paths, including cyclical paths—e.g. it’s useful to be able to turn A into B but also B into A, because sometimes the environment will have lots of B and other times it will have lots of A. Now, there’s the obvious prediction that the bacteria won’t waste energy turning B into A and then back into B again—i.e. it will suppress one of those two pathways (assuming the cycle is energy-burning), depending on which metabolite is more abundant. Utility generalizes this idea to arbitrarily many reactions and products, and predicts that at any given time we can assign some (non-unique) “values” to each metabolite (including energy carriers), such that any reaction whose reactants have more total “value” than its products is suppressed (or at least not catalyzed; the cell doesn’t really have good ways to suppress spontaneous reactions other than putting things in separate compartments).
Of course in practice this will be an approximation, and there may be occasional exceptions where the cell is doing something the model doesn’t capture. If we were to do this sort of analysis in a signalling network rather than a metabolic network, for instance, there would likely be many exceptions: cells sometimes burn energy to maintain a concentration at a specific level, or to respond quickly to changes, and this particular model doesn’t capture the “value” of information-content in signals; we’d have to extend our value-function in order for the utility framework to capture that. But for metabolic networks, I expect that to mostly not be an issue.
That’s really just utility theory; expected utility theory would involve an organism storing some resources over time (like e.g. fat). Then we’d expect to be able to assign “values” such that the relative “value” assigned to stored resources which are not currently used is a weighted sum of the “values” assigned to those resources in different possible future environments (of the sort the organism might find itself in after something like its current environment, in the ancestral world), and the weights in the sums should be consistent. (This is a less-fleshed-out prediction than the other one, but hopefully it’s enough of a sketch to give the idea.)
Of course, if we understand expected utility theory deeply, then these predictions are quite trivial; they’re just saying that organisms won’t make pareto-suboptimal use of their resources! It’s one of those predictions where, if it’s false, then we’ve probably discovered something interesting—most likely some place where an organism is spending resources to do something useful which we haven’t understood yet. [EDIT-TO-ADD: This is itself intended as a falsifiable prediction—if we go look at an anomaly and don’t find any unaccounted-for phenomenon, then that’s a very big strike against expected utility theory.] And that’s the really cool prediction here: it gives us a tool to uncover unknown-unknowns in our understanding of a cell’s behavior.
What links here?
- johnswentworth's comment on Why do we care about agency for alignment? by Chris_Leong (24 Apr 2023 3:10 UTC; 21 points)
- Dalcy's comment on Dalcy’s Shortform by Dalcy (23 Jun 2023 17:51 UTC; 1 point)

johnswentworth 6 Jun 2022 15:38 UTC
LW: 70 AF: 22
24
AF
in reply to: Rob Bensinger’s comment on: AGI Ruin: A List of Lethalities
I think as of early this year (like, January/February, before I saw a version of this doc) I could have produced a pretty similar list to this one. I definitely would not derive it from the empty string in the closest world-without-Eliezer; I’m unsure how much I’d pay attention to AI alignment at all in that world. I’d very likely be working on agent foundations in that world, but possibly in the context of biology or AI capabilities rather than alignment. Arguments about AI foom and doom were obviously-to-me correct once I paid attention to them at all, but not something I’d have paid attention to on my own without someone pointing them out.
Some specifics about kind-of-doc I could have written early this year
- The framing around pivotal acts specifically was new-to-me when the late 2021 MIRI conversations were published. Prior to that, I’d have had to talk about how weak wish-granters are safe but not actually useful, and if we want safe AI which actually grants big wishes then we have to deal with the safety problems. Pivotal acts framing simplifies that part of the argument a lot by directly establishing a particular “big” capability which is necessary.
- By early this year, I think would have generated pretty similar points to basically everything in the post if I were trying to be really comprehensive. (In practice, writing a post like this, I would go for more unifying structure and thought-generators rather than comprehensiveness; I’d use the individual failure modes more as examples of their respective generators.)
- In my traversal-order of barriers, the hard conceptual barriers for which we currently have no solution even in principle (like e.g. 16-19) would get a lot more weight and detail; I spend less time thinking about what-I-mentally-categorize-as “the obvious things which go wrong with stupid approaches” (20, 21, 25-36).
  - Just within the past week, this post on interpretability was one which would probably turn into a point on my equivalent of Eliezer’s list.
- The earlier points are largely facts-about-the-world (e.g. 1, 2, 7-9, 12-15). For many of these, I would cite different evidence, although the conclusions remain the same. True facts are, as a general rule, overdetermined by evidence; there are many paths to them, and I didn’t always follow the same paths Eliezer does here.
- A few points I think are wrong (notably 18, 22, 24 to a limited extent), but are correct relative to the knowledge/models which most proposals actually leverage. The loopholes there are things which you do need pretty major foundational/conceptual work to actually steer through.
- I would definitely have generated some similar rants at the end, though of course not identical.
  - One example: just yesterday I was complaining about how people seem to generate alignment proposals via a process of (1) come up with neat idea, (2) come up with some conditions under which that idea would maybe work (or at least not obviously fail in any of the ways the person knows to look for), (3) either claim that “we just don’t know” whether the conditions hold (without any serious effort to look for evidence), or directly look for evidence that they hold. Pretty standard bottom line failure.
I did briefly consider writing something along these lines after Eliezer made a similar comment to 39 in the Late 2021 MIRI Conversations. But as Kokotajlo guessed, I did not think that was even remotely close to the highest-value use of my time. It would probably take me a full month’s work to do it right, and the list just isn’t as valuable as my last month of progress. Or the month before that. Or the month before that.

johnswentworth 24 Feb 2022 18:39 UTC
68 points
on: Russia has Invaded Ukraine
Know any good source on why Putin wants to invade Ukraine? I have yet to hear a theory which sounds like how the real world works, and absent that it’s hard to guess how anything will play out past the next month-or-so.

johnswentworth 1 Mar 2021 4:47 UTC
68 points
on: Takeaways from one year of lockdown
Knee-jerk reaction: WTF?? Was this common? How much of the Berkeley hub has been locking down this hard for this long? I had a general impression that the degree of paranoia among the Berkeley crowd was somewhat higher than seemed reasonable to me, but this seems way more over-the-top than I imagined.
… are people mentally ok?

johnswentworth 28 Feb 2024 1:10 UTC
LW: 64 AF: 33
5
AF
on: Counting arguments provide no evidence for AI doom
This isn’t a proper response to the post, but since I’ve occasionally used counting-style arguments in the past I think I should at least lay out some basic agree/disagree points. So:
- This post basically-correctly refutes a kinda-mediocre (though relatively-commonly-presented) version of the counting argument.
- There does exist a version of the counting argument which basically works.
- The version which works routes through compression and/or singular learning theory.
- In particular, that version would talk about “goal-slots” (i.e. general-purpose search) showing up for exactly the same reasons that neural networks are able to generalize in the overparameterized regime more generally. In other words, if you take the “counting argument for overfitting” from the post, walk through the standard singular-learning-theory-style response to that story, and then translate that response over to general-purpose search as a specific instance of compression, then you basically get the good version of the counting argument.
  - Just remembered I walked through basically the good version of the counting argument in this section of What’s General-Purpose Search, And Why Might We Expect To See It In Trained ML Systems?
- The “Against Goal Realism” section is a wild mix of basically-correct points and thorough philosophical confusion. I would say the overall point it’s making is probably mostly-true of LLMs, false of humans, and most of the arguments are confused enough that they don’t provide much direct evidence relevant to either of those.
Pretty decent post overall.

johnswentworth

Things To Take Away From The Essay

Some Problems With (Known) Coherence Theorems

Side-Note: Probability Without Utility

Towards A Better Coherence Theorem

A Toy Model/​Ontology

“Values”, and Pointing At Them

Why Answering Natural-Language Queries About Morality Is Basically Irrelevant

Side Note On Relevant Capability Levels

Failure Mode: Symbol-Referent Confusions

A Toy Model/Ontology