My name is Alex Turner. I’m a research scientist at Google DeepMind on the Scalable Alignment team. My views are strictly my own; I do not represent Google. Reach me at alex[at]turntrout.com
TurnTrout
“Whelp, people are spiky. Often, the things that are inconvenient about people are tightly entwined with the things that are valuable about them. Often people can change, but not in many of the obvious ways you might naively think.” So I’m sort of modeling Nate (or Eliezer, although he doesn’t sound as relevant) as sort of a fixed cognitive resource, without too much flexibility on how they can deploy that resource.
I perceive some amount of “giving up” on maintaining social incentives in this comment. I think that’s a mistake, especially when the people are in positions of great power and status in this community.
I think the quoted passage advances an attitude which, in general, allows community invasion by malefactors and other bad actors. Social norms are present for a reason. I think it’s reasonable and healthy to expect people to engage in a respectful and professional manner.
If some individual (like Nate) finds these norms costly for some reason, then that shouldn’t mean “banishment” or “conclude they have bad intent” so much as—at the minimum—”they should clearly communicate their non-respectful/-kind alternative communication protocols beforehand, and they should help the other person maintain their boundaries; if not, face the normal consequences for being rude, like ‘proportional loss of social regard’.”[1]
I, personally, have been on the receiving end of (what felt to me like) a Nate-bulldozing, which killed my excitement for engaging with the MIRI-sphere, and also punctured my excitement for doing alignment theory. (Relatedly, Eliezer doing the “state non-obvious takes in an obvious tone, and decline to elaborate” thing which Thomas mentioned earlier.) Nate did later give his condolences for the bulldozing and for not informing me of his communication style before the conversation, which I appreciated.
But from what I’ve heard, a lot of people have had emotionally bad experiences talking with Nate about alignment research.
- ^
I think “communicate before interacting” still runs into failures like “many people (including myself) aren’t sufficiently good at maintaining their own boundaries, such that they actually back out if it starts feeling bad.”
Plus, someone might sunk-cost themselves about the conversation and the marginal cost of additional emotional damage.
- 10 Oct 2023 20:06 UTC; 11 points) 's comment on Related Discussion from Thomas Kwa’s MIRI Research Experience by (
- 11 Oct 2023 7:30 UTC; 6 points) 's comment on Related Discussion from Thomas Kwa’s MIRI Research Experience by (
- ^
I appreciate that your proposal makes a semblance of an effort to prevent AGI ruin, but you’re missing an obvious loophole by which AGI could weasel into our universe: humans imagining what it would be like to have technology.
If a person is allowed to think about technology, they are allowed to think about malign superintelligences. Not only could a malign superintelligence acausally blackmail a person (even if they don’t have clothes), but the AI could mind-hack the person into becoming the AI’s puppet. Then you basically have a malign superintelligence puppeteering a human living on our “safe” and “technology-free” planet.
I therefore conclude that even if we implemented your proposal, it would be sadly and hilariously inadequate. However, I applaud you for at least trying to look like you were trying to try to stave off AGI ruin.
Rationality exercise: Take a set of Wikipedia articles on topics which trainees are somewhat familiar with, and then randomly select a small number of claims to negate (negating the immediate context as well, so that you can’t just syntactically discover which claims were negated).
By the time they are born, infants can recognize and have a preference for their mother’s voice suggesting some prenatal development of auditory perception.
-> modified to
Contrary to early theories, newborn infants are not particularly adept at picking out their mother’s voice from other voices. This suggests the absence of prenatal development of auditory perception.
Sometimes, trainees will be given a totally unmodified article. For brevity, the articles can be trimmed of irrelevant sections.
Benefits:
Addressing key rationality skills. Noticing confusion; being more confused by fiction than fact; actually checking claims against your models of the world.
If you fail, either the article wasn’t negated skillfully (“5 people died in 2021” → “4 people died in 2021″ is not the right kind of modification), you don’t have good models of the domain, or you didn’t pay enough attention to your confusion.
Either of the last two are good to learn.
Scalable across participants. Many people can learn from each modified article.
Scalable across time. Once a modified article has been produced, it can be used repeatedly.
Crowdsourcable. You can put out a bounty for good negated articles, run them in a few control groups, and then pay based on some function of how good the article was. Unlike original alignment research or CFAR technique mentoring, article negation requires skills more likely to be present outside of Rationalist circles.
I think the key challenge is that the writer must be able to match the style, jargon, and flow of the selected articles.
- Looking back on my alignment PhD by 1 Jul 2022 3:19 UTC; 319 points) (
- Bayesian updating in real life is mostly about understanding your hypotheses by 1 Jan 2024 0:10 UTC; 62 points) (
- 13 Sep 2023 20:32 UTC; 4 points) 's comment on Exercise: Solve “Thinking Physics” by (
OK, sure. First, I updated down on alignment difficulty after reading your lethalities post, because I had already baked in the expected-EY-quality doompost into my expectations. I was seriously relieved that you hadn’t found any qualitatively new obstacles which might present deep challenges to my new view on alignment.
Here’s one stab[1] at my disagreement with your list: Human beings exist, and our high-level reasoning about alignment has to account for the high-level alignment properties[2] of the only general intelligences we have ever found to exist ever. If ontological failure is such a nasty problem in AI alignment, how come very few people do terrible things because they forgot how to bind their “love” value to configurations of atoms? If it’s really hard to get intelligences to care about reality, how does the genome do it millions of times each day?
Taking an item from your lethalities post:
19… More generally, there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment—to point to latent events and objects and properties in the environment, rather than relatively shallow functions of the sense data and reward.
There is a guaranteed-to-exist mechanistic story for how the human genome solves lethality no.19, because people do reliably form (at least some of) their values around their model of reality. (For more on what I mean by this, see this comment.) I think the genome probably does solve this lethality using loss functions and relatively crude reward signals, and I think I have a pretty good idea of how that happens.
I haven’t made a public post out of my document on shard theory yet, because idea inoculation. Apparently, the document isn’t yet written well enough to yank people out of their current misframings of alignment. Maybe the doc has clicked for 10 people. Most readers trip on a miscommunication, stop far before they can understand the key insights, and taper off because it seems like Just Another Speculative Theory. I apparently don’t know how to credibly communicate that the theory is at the level of actually really important to evaluate & critique ASAP, because time is slipping away. But I’ll keep trying anyways.
- ^
I’m attempting this comment in the hopes that it communicates something. Perhaps this comment is still unclear, in which case I ask the reader’s patience for improved future communication attempts.
- ^
Like
1. “Human beings tend to bind their terminal values to their model of reality”, or
2. “Human beings reliably navigate ontological shifts. Children remain excited about animals after learning they are made out of cells. Physicists don’t stop caring about their family because they can model the world in terms of complex amplitudes.”
- Looking back on my alignment PhD by 1 Jul 2022 3:19 UTC; 319 points) (
- Many arguments for AI x-risk are wrong by 5 Mar 2024 2:31 UTC; 153 points) (
- Unpacking “Shard Theory” as Hunch, Question, Theory, and Insight by 16 Nov 2022 13:54 UTC; 31 points) (
- 15 Dec 2022 23:16 UTC; 7 points) 's comment on wrapper-minds are the enemy by (
- ^
While this is an interesting piece of work, I have a bunch of concerns. They aren’t about the methodology, which seems sound, or about the claims made in the paper directly (the claims seemed carefully couched). It’s more about the overall presentation and the reaction to it.
First, the basic methodology is
We train an AI directly to do X only in context Y. It does X in context Y. Standard techniques are not able to undo this without knowing the context Y. Furthermore, the AI seems to reason consequentially about how to do X after we directly trained it to do so.
My main update was “oh I guess adversarial training has more narrow effects than I thought” and “I guess behavioral activating contexts are more narrowly confining than I thought.” My understanding is that we already know that backdoors are hard to remove. I don’t know what other technical updates I’m supposed to make?
EDIT: To be clear, I was somewhat surprised by some of these results; I’m definitely not trying to call this trite or predictable. This paper has a lot of neat gems which weren’t in prior work.
Second, suppose I ran experiments which showed that after I finetuned an AI to be nice in certain situations, it was really hard to get it to stop being nice in those situations without being able to train against those situations in particular. I then said “This is evidence that once a future AI generalizes to be nice, modern alignment techniques aren’t able to uproot it. Alignment is extremely stable once achieved”
I think lots of folks (but not all) would be up in arms, claiming “but modern results won’t generalize to future systems!” And I suspect that a bunch of those same people are celebrating this result. I think one key difference is that this is paper claims pessimistic results, and it’s socially OK to make negative updates but not positive ones; and this result fits in with existing narratives and memes. Maybe I’m being too cynical, but that’s my reaction.
Third, while this seems like good empirical work and the experiments seem quite well-run, and definitely have new things to offer over past work—This is qualitatively the kind of update I could have gotten from any of a range of papers on backdoors, as long as one has the imagination to generalize from “it was hard to remove a backdoor for toxic behavior” to more general updates about the efficacy and scope of modern techniques against specific threat models. So insofar as some people think this is a uniquely important result for alignment, I would disagree with that. (But I still think it’s good work, in an isolated setting.)
I am worried that, due to the presentation, this work can have a “propaganda”-ish feel to it, which artificially inflates the persuasiveness of the hard data presented. I’m worried about this affecting policy by unduly increasing concern.
Fourth, I have a bunch of dread about the million conversations I will have to have with people explaining these results. I think that predictably, people will update as if they saw actual deceptive alignment, as opposed to a something more akin to a “hard-coded” demo which was specifically designed to elicit the behavior and instrumental reasoning the community has been scared of. I think that people will predictably
treat this paper as “kinda proof that deceptive alignment is real” (even though you didn’t claim that in the paper!), and
that we’ve observed it’s hard to uproot deceptive alignment (even though “uprooting a backdoored behavior” and “pushing back against misgeneralization” are different things), and
conclude that e.g. “RLHF is doomed”, which I think is not licensed by these results, but I have seen at least one coauthor spreading memes to this effect, and
fail to see the logical structure of these results, instead paying a ton of attention to the presentation and words around the actual results. People do this all the time, from “the point of RL is to maximize reward” to the “‘predictive’ loss functions train ’predictors” stuff, people love to pay attention to the English window-dressing of results.
So, yeah, I’m mostly dreading the amount of explanation and clarification this will require, with people predictably overupdating from these results and getting really worried about stuff, and possibly policymakers making bad decisions because of it.
ETA: Softened language to clarify that I believe this paper is novel in a bunch of ways, and am more pushing back on some overupdates I think I’ve seen.
- On Anthropic’s Sleeper Agents Paper by 17 Jan 2024 16:10 UTC; 54 points) (
- Inducing Unprompted Misalignment in LLMs by 19 Apr 2024 20:00 UTC; 37 points) (
- 4 Mar 2024 16:04 UTC; 6 points) 's comment on Counting arguments provide no evidence for AI doom by (
Yup, I’ve been disappointed with how unkindly Eliezer treats people sometimes. Bad example to set.
EDIT: Although I note your comment’s first sentence is also hostile, which I think is also bad.
We have heard that Conjecture misrepresent themselves in engagement with the government, presenting themselves as experts with stature in the AIS community, when in reality they are not.
What does it mean for Conjecture to be “experts with stature in the AIS community”? Can you clarify what metrics comprise expertise in AIS—are you dissatisfied with their demonstrated grasp of alignment work, or perhaps their research output, or maybe something a little more qualitative?
Basically, this excerpt reads like a crisp claim of common knowledge (“in reality”) but the content seems more like a personal judgment call by the author(s).
In an alternate universe, someone wrote a counterpart to There’s No Fire Alarm for Artificial General Intelligence:
Okay, let’s be blunt here. I don’t think most of the discourse about alignment being really hard is being generated by models of machine learning at all. I don’t think we’re looking at wrong models; I think we’re looking at no models.
I was once at a conference where there was a panel full of famous AI alignment luminaries, and most of the luminaries were nodding and agreeing with each other that of course AGI alignment is really hard and unaddressed by modern alignment research, except for two famous AI luminaries who stayed quiet and let others take the microphone.
I got up in Q&A and said, “Okay, you’ve all told us that alignment is hard. But let’s be more concrete and specific. I’d like to know what’s the least impressive task which cannot be done by a ‘non-agentic’ system, that you are very confident cannot be done safely and non-agentically in the next two years.”
There was a silence.
Eventually, one person ventured a reply, spoken in a rather more tentative tone than they’d been using to pronounce that SGD would internalize coherent goals into language models. They named “Running a factory competently.”
A few months after that panel, there was unexpectedly a big breakthrough on LLM/management integration.
The point is the silence that fell after my question, and that eventually I only got one reply, spoken in tentative tones. When I asked for concrete feats that were impossible in the next two years, I think that that’s when the luminaries on that panel switched to trying to build a mental model of future progress in AI alignment, asking themselves what they could or couldn’t predict, what they knew or didn’t know. And to their credit, most of them did know their profession well enough to realize that forecasting future boundaries around a rapidly moving field is actually really hard, that nobody knows what will appear on arXiv next month, and that they needed to put wide credibility intervals with very generous upper bounds on how much progress might take place twenty-four months’ worth of arXiv papers later.
(Also, Rohin Shah was present, so they all knew that if they named something insufficiently impossible, Rohin would have DeepMind go and do it.)
The question I asked was in a completely different genre from the panel discussion, requiring a mental context switch: the assembled luminaries actually had to try to consult their rough, scarce-formed intuitive models of progress in AI alignment and figure out what future experiences, if any, their model of the field definitely prohibited within a two-year time horizon. Instead of, well, emitting socially desirable verbal behavior meant to kill that darned optimism around AGI alignment and get some predictable applause from the audience.
I’ll be blunt: I don’t think the confident doom-and-gloom is entangled with non-social reality. If your model has the extraordinary power to say what internal motivational structures SGD will entrain into scaled-up networks, then you ought to be able to say much weaker things that are impossible in two years, and you should have those predictions queued up and ready to go rather than falling into nervous silence after being asked.
Relatedly: to imagine the AI starting to succeed at those long-horizon tasks without imagining it starting to have more wants/desires (in the “behaviorist sense” expanded upon below) is, I claim, to imagine a contradiction—or at least an extreme surprise.
This seems like a great spot to make some falsifiable predictions which discriminate your particular theory from the pack. (As it stands, I don’t see a reason to buy into this particular chain of reasoning.)
AIs will increasingly be deployed and tuned for long-term tasks, so we can probably see the results relatively soon. So—do you have any predictions to share? I predict that AIs can indeed do long-context tasks (like writing books with foreshadowing) without having general, cross-situational goal-directedness.[1]
I have a more precise prediction:
AIs can write novels with at least 50% winrate against a randomly selected novel from a typical American bookstore, as judged by blinded human raters or LLMs which have at least 70% agreement with human raters on reasonably similar tasks.
Credence: 70%; resolution date: 12/1/2025
Conditional on that, I predict with 85% confidence that it’s possible to do this with AIs which are basically as tool-like as GPT-4. I don’t know how to operationalize that in a way you’d agree to.
(I also predict that on 12/1/2025, there will be a new defense offered for MIRI-circle views, and a range of people still won’t update.)
- ^
I expect most of real-world “agency” to be elicited by the scaffolding directly prompting for it (e.g. setting up a plan/critique/execute/summarize-and-postmortem/repeat loop for the LLM), and for that agency to not come from the LLM itself.
- How do you feel about LessWrong these days? [Open feedback thread] by 5 Dec 2023 20:54 UTC; 101 points) (
- How do you feel about LessWrong these days? [Open feedback thread] by 5 Dec 2023 20:54 UTC; 101 points) (
- 1 Jan 2024 18:04 UTC; 4 points) 's comment on Bayesian updating in real life is mostly about understanding your hypotheses by (
- ^
Deceptive alignment seems to only be supported by flimsy arguments. I recently realized that I don’t have good reason to believe that continuing to scale up LLMs will lead to inner consequentialist cognition to pursue a goal which is roughly consistent across situations. That is: a model which not only does what you ask it to do (including coming up with agentic plans), but also thinks about how to make more paperclips even while you’re just asking about math homework.
Aside: This was kinda a “holy shit” moment, and I’ll try to do it justice here. I encourage the reader to do a serious dependency check on their beliefs. What do you think you know about deceptive alignment being plausible, and why do you think you know it? Where did your beliefs truly come from, and do those observations truly provide
I agree that conditional on entraining consequentialist cognition which has a “different goal” (as thought of by MIRI; this isn’t a frame I use), the AI will probably instrumentally reason about whether and how to deceptively pursue its own goals, to our detriment.
I contest that there’s very little reason to expect “undesired, covert, and consistent-across-situations inner goals” to crop up in LLMs to begin with. An example alternative prediction is:
LLMs will continue doing what they’re told. They learn contextual goal-directed behavior abilities, but only apply them narrowly in certain contexts for a range of goals (e.g. think about how to win a strategy game). They also memorize a lot of random data (instead of deriving some theory which simply explains its historical training data a la Solomonoff Induction).
Not only is this performant, it seems to be what we actually observe today. The AI can pursue goals when prompted to do so, but it isn’t pursuing them on its own. It basically follows instructions in a reasonable way, just like GPT-4 usually does.
Why should we believe the “consistent-across-situations inner goals → deceptive alignment” mechanistic claim about how SGD works? Here are the main arguments I’m aware of:
Analogies to evolution (e.g. page 6 of Risks from Learned Optimization)
I think these loose analogies provide basically no evidence about what happens in an extremely different optimization process (SGD to train LLMs).
Counting arguments: there are more unaligned goals than aligned goals (e.g. as argued in How likely is deceptive alignment?)
These ignore the importance of the parameter->function map. (They’re counting functions when they need to be counting parameterizations.) Classical learning theory made the (mechanistically) same mistake in predicting that overparameterized models would fail to generalize.
I also basically deny the relevance of the counting argument, because I don’t buy the assumption of “there’s gonna be an inner ‘objective’ distinct from inner capabilities; let’s make a counting argument about what that will be.”
Speculation about simplicity bias: SGD will entrain consequentialism because that’s a simple algorithm for “getting low loss”
But we already know that simplicity bias in the NN prior can be really hard to reason about.
I think it’s unrealistic to imagine that we have the level of theoretical precision to go “it’ll be a future training process and the model is ‘getting selected for low loss’, so I can now make this very detailed prediction about the inner mechanistic structure.”[1]
I falsifiably predict that if you try to use this kind of logic or counting argument today to make falsifiable predictions about unobserved LLM generalization, you’re going to lose Bayes points left and right.
Without deceptive alignment/agentic AI opposition, a lot of alignment threat models ring hollow. No more adversarial steganography or adversarial pressure on your grading scheme or worst-case analysis or unobservable, nearly unfalsifiable inner homonculi whose goals have to be perfected.
Instead, I think that we enter the realm of tool AI[2] which basically does what you say.[3] I think that world’s a lot friendlier, even though there are still some challenges I’m worried about—like an AI being scaffolded into pursuing consistent goals. (I think that’s a very substantially different risk regime, though)
- ^
(Even though this predicted mechanistic structure doesn’t have any apparent manifestation in current reality.)
- ^
Tool AI which can be purposefully scaffolded into agentic systems, which somewhat handles objections from Amdahl’s law.
- ^
This is what we actually have today, in reality. In these setups, the agency comes from the system of subroutine calls to the LLM during e.g. a plan/critique/execute/evaluate loop a la AutoGPT.
- Many arguments for AI x-risk are wrong by 5 Mar 2024 2:31 UTC; 153 points) (
- How do you feel about LessWrong these days? [Open feedback thread] by 5 Dec 2023 20:54 UTC; 101 points) (
- 16 Jan 2024 2:12 UTC; 3 points) 's comment on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training by (
- 1 Jan 2024 19:47 UTC; 2 points) 's comment on TurnTrout’s shortform feed by (
Some arguments which Eliezer advanced in order to dismiss neural networks,[1] seem similar to some reasoning which he deploys in his modern alignment arguments.
Compare his incorrect mockery from 2008:
But there is just no law which says that if X has property A and Y has property A then X and Y must share any other property. “I built my network, and it’s massively parallel and interconnected and complicated, just like the human brain from which intelligence emerges! Behold, now intelligence shall emerge from this neural network as well!” And nothing happens. Why should it?
with his claim in Alexander and Yudkowsky on AGI goals:
[Alexander][14:36]
Like, we’re not going to run evolution in a way where we naturally get AI morality the same way we got human morality, but why can’t we observe how evolution implemented human morality, and then try AIs that have the same implementation design? [Yudkowsky][14:37]
Not if it’s based on anything remotely like the current paradigm, because nothing you do with a loss function and gradient descent over 100 quadrillion neurons, will result in an AI coming out the other end which looks like an evolved human with 7.5MB of brain-wiring information and a childhood.
Like, in particular with respect to “learn ‘don’t steal’ rather than ‘don’t get caught’.”
I agree that 100 quadrillion artificial neurons + loss function won’t get you a literal human, for trivial reasons. The relevant point is his latter claim: “in particular with respect to “learn ‘don’t steal’ rather than ‘don’t get caught’.”″
I think this is a very strong conclusion, relative to available data. I think that a good argument for it would require a lot of technical, non-analogical reasoning about the inductive biases of SGD on large language models. But, AFAICT, Eliezer rarely deploys technical reasoning that depends on experimental results or ML theory. He seems to prefer strongly-worded a priori arguments that are basically analogies.
In the above two quotes of his,[3] I perceive a common thread of
human intelligence/alignment comes from a lot of factors; you can’t just ape one of the factors and expect the rest to follow; to get a mind which thinks/wants as humans do, that mind must be as close to a human as humans are to each other.
But why is this true? You can just replace “human intelligence” with “avian flight”, and the argument might sound similarly plausible a priori.
ETA: The invalid reasoning step is in the last clause (“to get a mind...”). If design X exhibits property P, that doesn’t mean that design Y must be similar to X in order to exhibit property P.
ETA: Part of this comment was about EY dismissing neural networks in 2008. It seems to me that the cited writing supports that interpretation, and it’s still my best guess (see also DirectedEvolution’s comments). However, the quotes are also compatible with EY merely criticizing invalid reasons for expecting neural networks to work. I should have written that part of this comment more carefully, and not claimed observation (“he did dismiss”) when I only had inference (“sure seems like he dismissed”).
I think the rest of my point stands unaffected (EY often advances vague arguments that are analogies, or a priori thought experiments).
ETA 2: I’m now more confident in my read. Eliezer said this directly:
I’m no fan of neurons; this may be clearer from other posts.
- ^
It’s this kind of apparent misprediction which has, over time, made me take less seriously Eliezer’s models of intelligence and alignment. See also e.g. the cited GAN mis-retrodiction. This change led me to flag / rederive all of my beliefs about rationality/optimization for a while.
(At least, his 2008-era models seemed faulty to the point of this misprediction, and it doesn’t seem to me that this part of his models has changed much, though I claim no intimate non-public knowledge of his beliefs; just operating on my impressions here.)
- ^
See also Failure By Analogy:
Wasn’t it in some sense reasonable to have high hopes of neural networks? After all, they’re just like the human brain, which is also massively parallel, distributed, asynchronous, and -
Hold on. Why not analogize to an earthworm’s brain, instead of a human’s?
A backprop network with sigmoid units… actually doesn’t much resemble biology at all. Around as much as a voodoo doll resembles its victim. The surface shape may look vaguely similar in extremely superficial aspects at a first glance. But the interiors and behaviors, and basically the whole thing apart from the surface, are nothing at all alike. All that biological neurons have in common with gradient-optimization ANNs is… the spiderwebby look.
And who says that the spiderwebby look is the important fact about biology? Maybe the performance of biological brains has nothing to do with being made out of neurons, and everything to do with the cumulative selection pressure put into the design.
- ^
Originally, this comment included:
So, here are two claims which seem to echo the positions Eliezer advances:
1. “A large ANN doesn’t look enough like a human brain to develop intelligence.” → wrong (see GPT-4)
2. “A large ANN doesn’t look enough like a human brain to learn ‘don’t steal’ rather than ‘don’t get caught’” → (not yet known)I struck this from the body because I think (1) misrepresents his position. Eliezer is happy to speculate about non-anthropomorphic general intelligence (see e.g. That Alien Message). Also, I think this claim comparison does not name my real objection here, which is better advanced by the updated body of this comment.
- 24 Apr 2023 22:55 UTC; 22 points) 's comment on Contra Yudkowsky on AI Doom by (
- ^
A semi-formalization of shard theory. I think that there is a surprisingly deep link between “the AIs which can be manipulated using steering vectors” and “policies which are made of shards.”[1] In particular, here is a candidate definition of a shard theoretic policy:
A policy has shards if it implements at least two “motivational circuits” (shards) which can independently activate (more precisely, the shard activation contexts are compositionally represented).
By this definition, humans have shards because they can want food at the same time as wanting to see their parents again, and both factors can affect their planning at the same time! The maze-solving policy is made of shards because we found activation directions for two motivational circuits (the cheese direction, and the top-right direction):
On the other hand, AIXI is not a shard theoretic agent because it does not have two motivational circuits which can be activated independently of each other. It’s just maximizing one utility function. A mesa optimizer with a single goal also does not have two motivational circuits which can go on and off in an independent fashion.
This definition also makes obvious the fact that “shards” are a matter of implementation, not of behavior.
It also captures the fact that “shard” definitions are somewhat subjective. In one moment, I might model someone is having a separate “ice cream shard” and “cookie shard”, but in another situation I might choose to model those two circuits as a larger “sweet food shard.”
So I think this captures something important. However, it leaves a few things to be desired:
What, exactly, is a “motivational circuit”? Obvious definitions seem to include every neural network with nonconstant outputs.
Demanding a compositional representation is unrealistic since it ignores superposition. If dimensions are compositional, then they must be pairwise orthogonal. Then a transformer can only have shards, which seems obviously wrong and false.
That said, I still find this definition useful.
I came up with this last summer, but never got around to posting it. Hopefully this is better than nothing.
- ^
Shard theory reasoning led me to discover the steering vector technique extremely quickly. This link would explain why shard theory might help discover such a technique.
Getting a shape into the AI’s preferences is different from getting it into the AI’s predictive model.
It seems like you think that human preferences are only being “predicted” by GPT-4, and not “preferred.” If so, why do you think that?
I commonly encounter people expressing sentiments like “prosaic alignment work isn’t real alignment, because we aren’t actually getting the AI to care about X.” To which I say: How do you know that? What does it even mean for that claim to be true or false? What do you think you know, and why do you think you know it? What empirical knowledge of inner motivational structure could you be leveraging to make these claims, such that you are far more likely to make these claims in worlds where the claims are actually true?
(On my pessimistic days, I wonder if this kind of claim gets made because humans write suggestive phrases like “predictive loss function” in their papers, next to the mathematical formalisms.)
- Dreams of AI alignment: The danger of suggestive names by 10 Feb 2024 1:22 UTC; 91 points) (
- 23 Oct 2023 17:31 UTC; 21 points) 's comment on Alignment Implications of LLM Successes: a Debate in One Act by (
This may seem like a gross or weirdly personal question but I think it’s actually quite important.
I’d like to express social approval of this kind of question going on this site. I see no reason why discussing the menstrual cycle should be any more taboo than discussing the REM cycle.
I find it concerning that you felt the need to write “This is not at all a criticism of the way this post was written. I am simply curious about my own reaction to it” (and still got downvoted?).
For my part, I both believe that this post contains valuable content and good arguments, and that it was annoying / rude / bothersome in certain sections.
Reading this post made me more optimistic about alignment and AI. My suspension of disbelief snapped; I realized how vague and bad a lot of these “classic” alignment arguments are, and how many of them are secretly vague analogies and intuitions about evolution.
While I agree with a few points on this list, I think this list is fundamentally misguided. The list is written in a language which assigns short encodings to confused and incorrect ideas. I think a person who tries to deeply internalize this post’s worldview will end up more confused about alignment and AI, and urge new researchers to not spend too much time trying to internalize this post’s ideas. (Definitely consider whether I am right in my claims here. Think for yourself. If you don’t know how to think for yourself, I wrote about exactly how to do it! But my guess is that deeply engaging with this post is, at best, a waste of time.[1])
I think this piece is not “overconfident”, because “overconfident” suggests that Lethalities is simply assigning extreme credences to reasonable questions (like “is deceptive alignment the default?”). Rather, I think both its predictions and questions are not reasonable because they are not located by good evidence or arguments. (Example: I think that deceptive alignment is only supported by flimsy arguments.)
I personally think Eliezer’s alignment worldview (as I understand it!) appears to exist in an alternative reality derived from unjustified background assumptions.[2] Given those assumptions, then sure, Eliezer’s reasoning steps are probably locally valid. But I think that in reality, most of this worldview ends up irrelevant and misleading because the background assumptions don’t hold.
I think this kind of worldview (socially attempts to) shield itself from falsification by e.g. claiming that modern systems “don’t count” for various reasons which I consider flimsy. But I think that deep learning experiments provide plenty of evidence on alignment questions.
But, hey, why not still include this piece in the review? I think it’s interesting to know what a particular influential person thought at a given point in time.
- ^
Related writing of mine: Some of my disagreements with List of Lethalities, Inner and outer alignment decompose one hard problem into two extremely hard problems.
Recommended further critiques of this worldview: Evolution is a bad analogy for AGI: inner alignment, Evolution provides no evidence for the sharp left turn, My Objections to “We’re All Gonna Die with Eliezer Yudkowsky”.
- ^
Since Eliezer claims to have figured out so many ideas in the 2000s, his assumptions presumably were locked in before the advent of deep learning. This constitutes a “bottom line.”
- Many arguments for AI x-risk are wrong by 5 Mar 2024 2:31 UTC; 153 points) (
- 3 Jan 2024 23:51 UTC; 8 points) 's comment on Does LessWrong make a difference when it comes to AI alignment? by (
- ^
For the last two years, typing for 5+ minutes hurt my wrists. I tried a lot of things: shots, physical therapy, trigger-point therapy, acupuncture, massage tools, wrist and elbow braces at night, exercises, stretches. Sometimes it got better. Sometimes it got worse.
No Beat Saber, no lifting weights, and every time I read a damn book I would start translating the punctuation into Dragon NaturallySpeaking syntax.
Text: “Consider a bijection ”
My mental narrator: “Cap consider a bijection space dollar foxtrot colon cap x backslash tango oscar cap y dollar”
Have you ever tried dictating a math paper in LaTeX? Or dictating code? Telling your computer “click” and waiting a few seconds while resisting the temptation to just grab the mouse? Dictating your way through a computer science PhD?
And then.… and then, a month ago, I got fed up. What if it was all just in my head, at this point? I’m only 25. This is ridiculous. How can it possibly take me this long to heal such a minor injury?
I wanted my hands back—I wanted it real bad. I wanted it so bad that I did something dirty: I made myself believe something. Well, actually, I pretended to be a person who really, really believed his hands were fine and healing and the pain was all psychosomatic.
And… it worked, as far as I can tell. It totally worked. I haven’t dictated in over three weeks. I play Beat Saber as much as I please. I type for hours and hours a day with only the faintest traces of discomfort.
What?
- Lessons I’ve Learned from Self-Teaching by 23 Jan 2021 19:00 UTC; 346 points) (
- The “mind-body vicious cycle” model of RSI & back pain by 9 Jun 2022 12:30 UTC; 76 points) (
- 5 Sep 2021 17:23 UTC; 8 points) 's comment on Should I treat pain differently if it’s “all in my head?” by (
I think it’s pretty obvious.
Julia, Luke, Scott, and Eliezer know each other very well.
Exactly three months ago, they all happened to consult their mental simulations of each other for advice on their respective problems, at the same time.
Recognizing the recursion that would result if they all simulated each other simulating each other simulating each other… etc, they instead searched over logically-consistent universe histories, grading each one by expected utility.
Since each of the four has a slightly different utility function, they of course acausally negotiated a high-utility compromise universe-history.
This compromise history involves seemingly acausal blog post attribution cycles. There’s no (in-universe, causal) reason why those effects are there. It’s just the history that got selected.
The moral of the story is: by mastering rationality and becoming Not Wrong like we are today, you can simulate your friends to arbitrary precision. This saves you anywhere between $15-100/month on cell phone bills.
I think this is, unfortunately, true. One reason people might feel this way is because they view LessWrong posts through a social lens. Eliezer posts about how doomed alignment is and how stupid everyone else’s solution attempts are, that feels bad, you feel sheepish about disagreeing, etc.
But despite understandably having this reaction to the social dynamics, the important part of the situation is not the social dynamics. It is about finding technical solutions to prevent utter ruination. When I notice the status-calculators in my brain starting to crunch and chew on Eliezer’s posts, I tell them to be quiet, that’s not important, who cares whether he thinks I’m a fool. I enter a frame in which Eliezer is a generator of claims and statements, and often those claims and statements are interesting and even true, so I do pay attention to that generator’s outputs, but it’s still up to me to evaluate those claims and statements, to think for myself.
If Eliezer says everyone’s ideas are awful, that’s another claim to be evaluated. If Eliezer says we are doomed, that’s another claim to be evaluated. The point is not to argue Eliezer into agreement, or to earn his respect. The point is to win in reality, and I’m not going to do that by constantly worrying about whether I should shut up.
If I’m wrong on an object-level point, I’m wrong, and I’ll change my mind, and then keep working. The rest is distraction.