An objection I didn’t have time for in the above piece is something like “but what about Occam, though, and k-complexity? Won’t you most likely get the simple, boring, black shape, if you constrain it as in the above?”
This is why I’m concerned about deleterious effects of writing for the outgroup: I’m worried you end up optimizing your thinking for coming up with eloquent allegories to convey your intuitions to a mass audience, and end up not having time for the actual, non-allegorical explanation that would convince subject-matter experts (whose support would be awfully helpful in the desperate push for a Pause treaty).
I think we have a lot of intriguing theory and evidence pointing to a story where the reason neural networks generalize is because the parameter-to-function mapping is not a one-to-one correspondence, and is biased towards simple functions (as Occam and Solomonoff demand): to a first approximation, SGD is going to find the simplest function that fits the training data (because simple functions correspond to large “basins” of approximately equal loss which are easy for SGD to find because they use fewer parameters or are more robust to some parameters being wrong), even though the network architecture is capable of representing astronomically many other functions that also fit the training data but have more complicated behavior elsewhere.
But if that story is correct, then “But what about Occam” isn’t something you can offhandedly address as an afterthought to an allegory about how misalignment is the default because there are astronomically many functions that fit the training data. Whether the simplest function is misaligned (as posited by List of Lethalities #20) is the thing you have to explain!
We do not have the benefit that a breeding program done on dogs or humans has, of having already “pinned down” a core creature with known core traits and variation being laid down in a fairly predictable manner. There’s only so far you can “stretch,” if you’re taking single steps at a time from the starting point of “dog” or “human.”
If anything, the alignment case for SGD looks a lot better than that for selective breeding, because we get to specify as many billions and billions of input–output pairs for our network to approximate as we want (with the misalignment risk being that, as you say, if we don’t know how to choose the right data, the network might not generalize the way we want). Imagine trying to breed a dog to speak perfect English the way LLMs do!
LW is giving me issues and I’m having a hard time getting to and staying on the page to reply; I don’t know how good my further engagement will be, as a result.
if we don’t know how to choose the right data, the network might not generalize the way we want
I want to be clear that I think the only sane prior is on “we don’t know how to choose the right data.” Like, I don’t think this is reasonably an “if.” I think the burden of proof is on “we’ve created a complete picture and constrained all the necessary axes,” à la cybersecurity, and that the present state of affairs with regards to LLM misalignment (and all the various ways that it keeps persisting/that things keep squirting sideways) bears this out. The claim is not “impossible/hopeless,” but “they haven’t even begun to make a case that would be compelling to someone actually paying attention.”
(iiuc, people like Paul Christiano, who are far more expert than me and definitely qualify as “actually paying attention,” find the case more plausible/promising, not compelling. I don’t know of an intellectual with grounded expertise whom I respect who is like “we’re definitely good, here, and I can tell you why in concrete specifics.” The people who are confident are clearly hand-waving, and the people who are not hand-waving are at best tentatively optimistic. re: but your position is hand-wavey, too, Duncan—I think a) much less so, and b) burden of proof should be on “we know how to do this safely” not “exhaustively demonstrate that it’s not safe.”)
I am interested in an answer to Joe’s reply, which seems to me like the live conversational head.
To be clear, I agree that the situation is objectively terrifying and it’s quite probable that everyone dies. I gave a copy of If Anyone Builds It to two math professors of my acquaintance at San Francisco State University (and gave $1K to MIRI) because, in that context, conveying the fact that we’re in danger was all I had bandwidth for (and I didn’t have a better book on hand for that).
But in the context of my own writing, everyone who’s paying attention to me already knows about existential risk; I want my words to be focused on being rigorous and correct, not scaring policymakers and the public (notwithstanding that policymakers and the public should in fact be scared).
To the end of being rigorous and correct, I’m claiming that the “each of these black shapes is basically just as good at passing that particular test” story isn’t a good explanation of why alignment is hard (notwithstanding that alignment is in fact hard), because of the story about deep net architectures being biased towards simple functions.
I don’t think “well, I’m pitching to middle schoolers” saves it. If the actual problem is that we don’t know what training data would imply the behavior we want, rather than the outcomes of deep learning being intrinsically super-chaotic—which would be an entirely reasonable thing to suspect if it’s 2005 and you’re reasoning abstractly about optimization without having any empirical results to learn from—then you should be talking about how we don’t know what teal shape to draw, not that we might get a really complicated black shape for all we know.
I am of course aware that in the political arena, the thing I’m doing here would mark me as “not a team player”. If I agree with the conclusion that superintelligence is terrifying, why would I critique an argument with that conclusion? That’s shooting my own side’s soldiers! I think it would be patronizing for me to explain what the problem with that is; you already know.
I do not see you as failing to be a team player re: existential risk from AI.
I do see you as something like … making a much larger update on the bias toward simple functions than I do. Like, it feels vaguely akin to … when someone quotes Ursula K. LeGuin’s opinion as if that settles some argument with finality?
I think the bias toward simple functions matters, and is real, and is cause for marginal hope and optimism, but “bias toward” feels insufficiently strong for me to be like “ah, okay, then the problem outlined above isn’t actually a problem.”
I do not, to be clear, believe that my essay contains falsehoods that become permissible because they help idiots or children make inferential leaps. I in fact thought the things that I said in my essay were true (with decently high confidence), and I still think that they are true (with slightly reduced confidence downstream of stuff like the link above).
(You will never ever ever ever ever see me telling someone a thing I know to be false because I believe that it will result in them outputting a correct belief or a correct behavior; if I do anything remotely like that I will headline explicitly that that’s what I’m doing, with words like “The following is a lie, but if you pretend it’s true for a minute you might have a true insight downstream of it.”)
(That link should take you to the subheading “Written April 2, 2022.”)
I think that we don’t know what teal shape to draw, and that drawing the teal shape perfectly would not be sufficient on its own. In future writing I’ll try to twitch those two threads a little further apart.
“bias toward” feels insufficiently strong for me to be like “ah, okay, then the problem outlined above isn’t actually a problem.”
You’re right; Steven Byrnes wrote me a really educational comment today about what the correct goal-counting argument looks like, which I need to think more about; I just think it’s really crucial that this is fundamentally an argument about generalization and inductive biases, which I think is being obscured in the black-shape metaphor when you write that “each of these black shapes is basically just as good at passing that particular test” as if it didn’t matter how complex the shape is.
(I don’t think talking to middle schoolers about inductive biases is necessarily hopeless; consider a box behind a tree.)
cause for marginal hope and optimism
I think the temptation to frame technical discussions in terms of pessimism vs. optimism is itself a political distortion that I’m trying to avoid. (Apparently not successfully, if I’m coming off as a voice of marginal hope and optimism.)
You wrote an analogy that attempts to explain a reason why it’s hard to make neural networks do what we want; I’m arguing that the analogy is misleading. That disagreement isn’t about whether the humans survive. It’s about what’s going on with neural networks, and the pedagogy of how to explain it. Even if I’m right, that doesn’t mean the humans survive: we could just be dead for other reasons. But as you know, what matters in rationality is the arguments, not the conclusions; not only are bad arguments for a true conclusion still bad, even suboptimal pedagogy for a true lesson is still suboptimal.
I do not, to be clear, believe that my essay contains falsehoods that become permissible because they help idiots or children make inferential leaps [...] You will never ever ever ever ever see me telling someone a thing I know to be false because I believe that it will result in them outputting a correct belief or a correct behavior
I want to first acknowledge strongly that yep, we are mostly on the same side about getting a much better future than everyone dying to AIs.
But in the context of my own writing, everyone who’s paying attention to me already knows about existential risk;
I note this is not necessarily true for MIRI; we are trying very hard on purpose to reach and inform more people.
I want my words to be focused on being rigorous and correct, not scaring policymakers and the public (notwithstanding that policymakers and the public should in fact be scared).
The two can be compatible!
I perceive at least two separate critiques, and I want to address them both without cross-contamination. (Please correct me if these miss the mark.)
Hypothesis 1: Maybe MIRI folks have wrong world-models (possibly due to insufficient engagement with sophisticated disagreement).
Hypothesis 2: Maybe MIRI folks are prioritizing their arguments badly for actually stopping the AI race.
Regarding Hypothesis 1, there’s a tradeoff between refining and polishing one’s world-model, and acting upon that world-model to try to accomplish things.
Speaking only for myself, there are many possible things I could be writing or saying, and only finite time to write or say them in. For the moment, I mostly want my words to be focused on (productively) scaring policymakers and the public, because they should in fact be scared.
This obviously does not preclude writing for and talking with the ingroup, nor continuing to refine and polish my own world-model.
But...well, I feel like I’ve mostly hit diminishing returns on that, both when it comes to updating my own models and when it comes to updating those of others like me. So the balance of time-spent naturally tips towards outreach.
To borrow from your comment below, in regards to Hypothesis 2...
if we want that Pause treaty, we need to find the ironclad arguments that convince skeptical experts, not just appeal to intuition.
...for one thing, I’m not sure how true this is? Policymakers and the public can sometimes both be swayed by appeals to intuition. Skeptical experts can be really hard to convince. Especially after the Nth iteration of debate has passed and a lot of ideas have congealed.
Again, there’s a tradeoff here, a matter of how much time one spends making cases to audiences of various levels of informed or uninformed skepticism. I’m not sure what the right balance is, but for myself at least, it’s probably not a primary focus on convincing Paul Christiano of things. Tactical priorities can differ from person to person, of course.
Caveat 1: Again, I speak for myself here. I admittedly have much less context on the decades-long back-and-forth than some of my colleagues.
Caveat 2: No matter who I’m trying to convince, I do want my arguments to rest on a solid foundation. If an interlocutor digs deep, the argument-hole they unearth should hold water. To, uh, rather butcher a metaphor.
But this just rounds back to my response to Hypothesis 1 - thanks to the magic of the Internet (and Lightcone Infrastructure in particular) you can always find someone with a sophisticated critique to level at your supposedly solid foundation. At some point you do have to take your best guess about what’s true and robust and correct according to your current world-model, then go and try to share it outside the crucible of LessWrong forums.
With all that being said, sure, let’s talk world-models. (With, again, the caveat that this is all my own limited take as someone who spent most of the 2010s doing reliability engineering and not alignment research.)
To the end of being rigorous and correct, I’m claiming that the “each of these black shapes is basically just as good at passing that particular test” story isn’t a good explanation of why alignment is hard (notwithstanding that alignment is in fact hard), because of the story about deep net architectures being biased towards simple functions.
I think I follow your argument that one might say “we don’t know how to draw the teal thing” instead. But this seems more quibble than crux. I don’t think you addressed Duncan’s core point, which is that “we don’t know how to draw the teal thing” is the correct prior? (i.e. we don’t know how to select training data in a way that constrains an AI to learn to explicitly, primarily, and robustly value human flourishing.)
And if in fact we don’t know how to draw the right metaphorical teal thing, then the metaphorical black thing could take on various shapes that appear weird and complicated to us, but that actually reflect an underlying simplicity of which we are unaware. So it doesn’t seem wrong to claim that the black thing could take some (apparently) weird and (apparently) complex shape, given the assumption that we can’t draw a sufficiently constraining teal thing.
More broadly, I think I’m missing some important context, or just failing to follow your logic. I don’t see how a bias towards simple functions implies a convergence towards nonlethal aims. We don’t know what would be the simplest functions that approximate current or future training data. Why believe they would converge on something conveniently safe for us? [1]
From the papers you cite, I can see how one would conclude that AIs will be efficient, but I don’t see how they imply that AIs will be nice.
In the aforementioned spirit of rigor, I’m trying to avoid saying “human values” because those might not be good enough either. Many humans do not prioritize conscious flourishing! An ASI that doesn’t hold conscious wellbeing as its highest priority likely kills everyone as a side effect of optimizing for other ends, etc. etc.
This obviously does not preclude writing for and talking with the ingroup, nor continuing to refine and polish my own world-model. But...well, I feel like I’ve mostly hit diminishing returns on that
I mean, before concluding that you’ve hit diminishing returns, have you looked at one of the standard textbooks on deep learning, like Prince 2023 or Bishop and Bishop 2024? I don’t think I’m suggesting this out of pointless gatekeeping. I actually unironically think if you’re devoting your life to a desperate campaign to get world powers to ban a technology, it’s helpful to have read a standard undergraduate textbook about the thing you’re trying to ban.
We don’t know what would be the simplest functions that approximate current or future training data. Why believe they would converge on something conveniently safe for us?
I mean, you can get a pretty good idea what the simplest function that approximates the data is like by, you know, looking at the data. (In slogan form, the model is the dataset.) Thus, language models—not hypothetical future superintelligences which don’t exist yet, but the actual technology that people are working on today—seem pretty safe for basically the same reason that text from the internet is safe: you’re sampling from the webtext distribution in a customized way.
(In more detail: you use gradient descent to approximate a “next token prediction” function of internet text. To make it more useful, we want to customize it away from the plain webtext distribution. To help automate that work, we train a “reward model”: basically, you start with a language model, but instead of the unembedding matrix which translates the residual stream to token probabilities, you tack on a layer that you train to predict human thumbs-up/thumbs-down ratings. Then you generate more samples from your base model, and use the output of your reward model to decide what gradient updates to do on them—with a Kullback–Leibler constraint to make sure you don’t update so far as to do something that it would be wildly unlikely for the original base model to do. It’s the same gradients you would get from adding more data to the pretraining set, except that the data is coming from the model itself rather than webtext, and the reward model puts a “multiplier” on the gradient: high reward is like training on that completion a bunch of times, and negative reward is issuing gradient updates in the opposite direction, to do less of that.)
That doesn’t mean future systems will be safe. At some point in the future, when you have AIs training other AIs on AI-generated data too fast for humans to monitor, you can’t just eyeball the data and feel confident that it’s not doing something you don’t want to happen. If your reward model accidentally reinforces the wrong things, then you get more of the wrong things. Importantly, this is a different threat model than “you don’t get what you train for”. In order to react to that threat in a dignified way, I want people to have read the standard undergraduate textbooks and be thinking about how to do better safety engineering in a way that’s oriented around the empirical details. Maybe we die either way, but I intend to die as a computer scientist.
I am in favor of learning more programming! During the two years I spent pivoting from reliability engineering, I did in fact attempt some hands-on machine learning code. My brain isn’t shaped in such a way that reading textbooks confers meaningful coding skills—I have to Actually Do the Thing—but I did try Actually Doing the Thing, reading and all.
I later facilitated BlueDot’s alignment and governance courses, and went through their reading material several times over in the process.
I now face a tradeoff between learning more ML, which is doable but extremely time-consuming, and efforts to convince policymakers not to let labs build ASI. It seems overwhelmingly overdetermined that my (marginal) time is best spent on the second thing. I see my primary comparative advantage as attempting to buy more time for developing solutions that might actually save us.
...which does unfortunately mean it’s going to take me a while to properly digest your argument-from-dataset-approximation. Doesn’t mean I won’t try.
Even attempting to take it as given, though, I’m confused by your conclusion, because you seem to be simultaneously saying “[language models approximating known datasets we can squint at] is a reason we know current systems are safe” and “this reason will not generalize to ASI” and “this answers the quoted question of why [ASI] would converge on something conveniently safe for us”.
At some point in the future, when you have AIs training other AIs on AI-generated data too fast for humans to monitor, you can’t just eyeball the data and feel confident that it’s not doing something you don’t want to happen.
Isn’t this the default path? Don’t most labs’ plans to build ASI run through massive use of AI-generated data? Even if I accept the premise that you can confidently assure safety by eyeballing data today, this doesn’t do much to reassure me if you then agree that it doesn’t generalize.
So I’m still not seeing how this supports the crux that “implement everyone’s CEV” (or, whichever alternative goalset you consider safe) is likely the simplest [function that approximates the datasets that will be used to create ASI].[1]
(Also, at this point I kind of want to taboo ‘dataset’ because it feels like a very overloaded term.)
There has been a miscommunication. I’m not saying CEV or ASI alignment is easy. This thread started because I was critiquing the analogy about teal and black shapes in the article “Deadly By Default”, because the analogy taken at face value lends itself to a naïve counting argument of the form, “There are any number of AIs that could perform well in training, so who knows which one we’d end up with?!” I’m claiming that that argument as stated is wrong (although some more sophisticated counting argument could go through), because inductive biases are really important.
Maybe if you’re just trying to scare politicians and the public, “inductive biases are really important” doesn’t come up on your radar, but it’s pretty fundamental for, um, actually understanding the AI alignment problem humanity is facing!
It seems to me that the question of whether [safe or intended goalset] is [the simplest function] is extremely relevant to the question of whether the argument as stated is wrong.
As I understand things right now, we seem to generally agree that:
An entity chooses data (teal thing) they think represents the shape they want an AI to be (black thing).
Gradient descent seeks simple functions that approximate the data.
The [test / selected data] are (probably) insufficient to constrain the resulting shape to what the makers intend.
The AI (probably) grows into a shape the maker(s) did not intend.
You said (emphasis added):
To the end of being rigorous and correct, I’m claiming that the “each of these black shapes is basically just as good at passing that particular test” story isn’t a good explanation of why alignment is hard (notwithstanding that alignment is in fact hard), because of the story about deep net architectures being biased towards simple functions.
You seem to be saying “because deep learning privileges simple functions, the claim that many different AIs could pass our test is false.” I don’t see how that follows, because:
And if in fact we don’t know how to draw the right metaphorical teal thing, then the metaphorical black thing could take on various shapes that appear weird and complicated to us, but that actually reflect an underlying simplicity of which we are unaware. So it doesn’t seem wrong to claim that the black thing could take some (apparently) weird and (apparently) complex shape, given the assumption that we can’t draw a sufficiently constraining teal thing.
It is still the case that many different [AIs / simple functions] could [pass the test / approximate the dataset]. When we move the argument one level deeper, the original claim still holds true. Maybe I’m still just misunderstanding, though.
...or maybe you are only saying that the explanation as written is bad at taking readers from (1) to (4) because it does not explicitly mention (2), i.e. not technically wrong but still a bad explanation. In that case it seems we’d agree that (2) seems like a relevant wrinkle, and that writing (3) with “selected data” instead of “test” adds useful and correct nuance. But I don’t see how it makes Duncan’s summary either untrue or misleading, because eliding it doesn’t change (1) or (4).
...or maybe you are saying it was a bad explanation for you, and for readers with your level of sophistication and familiarity with the arguments, and thus a bad answer to your original question. Which is...kinda fair? In that case I suppose you’d be saying “Ah, I notice you are making an assumption there, and I agree with the assumption, but failing to address it is bad form and I’m worried about what that failure implies.” (I’ll hold off on addressing this argument until I know whether you’d actually endorse it.)
...or maybe you also flatly disagree with (3)? Like, you disagree with Duncan’s
I want to be clear that I think the only sane prior is on “we don’t know how to choose the right data.” Like, I don’t think this is reasonably an “if.” I think the burden of proof is on “we’ve created a complete picture and constrained all the necessary axes,” à la cybersecurity, and that the present state of affairs with regards to LLM misalignment (and all the various ways that it keeps persisting/that things keep squirting sideways) bears this out. The claim is not “impossible/hopeless,” but “they haven’t even begun to make a case that would be compelling to someone actually paying attention.”
...and in that case, excellent! We have surfaced a true crux. And it makes perfect sense from that perspective to say “the metaphor is wrong” because, from that perspective, one of its key assumptions is false.
Importantly, though, that looks to me like an object-level disagreement, and not one that reflects bad epistemics, except insofar as one believes that any disagreement must be the result of bad epistemics.
maybe you are only saying that the explanation as written is bad at taking readers from (1) to (4) because it does not explicitly mention (2), i.e. not technically wrong but still a bad explanation. [...] But I don’t see how it makes Duncan’s summary either untrue or misleading, because eliding it doesn’t change (1) or (4).
But bad explanations are wrong, untrue, and misleading.
Suppose the one comes to you and says, “All squares are quadrilaterals; all rectangles are quadrilaterals; therefore, all squares are rectangles.” That argument is wrong—”technically” wrong, if you prefer. It doesn’t matter that the conclusion is true. It doesn’t even matter that the premises are also true. It’s just wrong.
Okay, but why is it wrong though? I still haven’t seen a convincing case for that! It sure looks to me like, given an assumption which I still feel confused about whether you share, the conclusion does in fact follow from the premises, even in metaphor form.
I am open to the case that it’s a bad argument. If it is in fact a bad argument then that’s a legitimate criticism. But from my perspective you have not adequately spelled out how “deep nets favor simple functions” implies it’s a bad argument.
You said, “I don’t see how [not mentioning inductive biases] makes Duncan’s summary either untrue or misleading, because eliding it doesn’t change (1) [we choose “teal shape” data to grow the “black shape” AI] or (4) [we don’t get the AI we want].” But the point of the broken syllogism in the grandparent is that it’s not enough for the premise to be true and the conclusion to be true; the conclusion has to follow from the premise.
The context of the teal/black shape analogy in the article is an explanation of how “modern AIs aren’t really designed so much as grown or evolved” with the putative consequence that “there are many, many, many different complex architectures that are consistent with behaving ‘properly’ in the training environment, and most of them don’t resemble the thing the programmers had in mind”.
Set aside the question of superintelligence for the moment. Is this true as a description of “modern AIs”, e.g., image classifiers? That’s not actually clear to me.
It is true that adversarially robust image classification isn’t a solved problem, despite efforts: it’s usually possible (using the same kind of gradient-based optimization used to train the classifiers themselves) to successfully search for “adversarial examples” that machines classify differently than humans, which isn’t what the programmers had in mind.
But Ilyas et al. 2019 famously showed that adversarial examples are often due to “non-robust” features that are doing predictive work, but which are counterintuitive to humans. That would be an example of our data pointing at, as you say, an “underlying simplicity of which we are unaware”.
I’m saying that’s a different problem than a counting argument over putative “many, many, many different complex architectures that are consistent with behaving ‘properly’ in the training environment”, which is what the black/teal shape analogy seems to be getting at. (There are many, many, many different parametrizations that are consistent with behaving properly in training, but I’m claiming that the singular learning theory story explains why that might not be a problem, if they all compute similar functions.)
Thank you for attempting to spell this out more explicitly. If I understand correctly, you are saying singular learning theory suggests that AIs with different architectures will converge on a narrow range of similar functions that best approximate the training data.
With less confidence, I understand you to be claiming that this convergence implies that (in the context of the metaphor) a given [teal thing / dataset] may reliably produce a particular shape of [black thing / AI].
So (my nascent Zack model says) the summary is incorrect to analogize the black thing to “architectures” instead of “parametrizations” or “functions”, and more importantly incorrect to claim that the black shape’s many degrees of freedom imply it will take a form its developers did not intend. (Because, by SLT, most shapes converge to some relatively simple function approximator.)
But...it does seem to me like an AI trained using modern methods, e.g. constitutional AI, is insufficiently constrained to embody human-compatible values even given the stated interpretation of SLT. Or in other words, the black shape is still basically unpredictable from the perspective of the teal-shape drawer. I’m not sure you disagree with that?
As an exercise in inferential gap-crossing, I want to try to figure out what minimum change to the summary / metaphor would make it relatively unobjectionable to you.
Attempting to update the analogy in my own model, it would go something like: You draw a [teal thing / dataset]. You use it to train the [black thing / AI]. There are underlying regularities in your dataset, some of which are legible to you as a human and some of which are not. The black thing conforms to all the regularities. This does not by coincidence happen to cause it to occupy the shape you hoped for; you do not see all the non-robust / illegible features constraining that shape. You end up with [weird shape] instead of [simple shape you were aiming for.]
A more skeptical Zack-model in my head says “No, actually, you don’t end up with [weird shape] at all. SLT says you can get [shape which robustly includes the entire spectrum of reflectively consistent human values] because that’s the function being approximated, the underlying structure of the data.” I dunno if this is an accurate Zack-model.
(I am running into the limited bandwith of text here, and will also DM a link to schedule a conversation if you’re so inclined).
Sorry, I don’t want to accidentally overemphasize SLT in particular, which I am not an expert in. I think what’s at issue is how predictable deep learning generalization is: what kind of knowledge would be necessary in order to “get what you train for”?
This isn’t obvious from first principles. Given a description of SGD and the empirical knowledge of 2006, you could imagine it going either way. Maybe we live in a “regular” computational universe, where the AI you get depends on your architecture and training data according to learnable principles that can be studied by the usual methods of science in advance of the first critical try, but maybe it’s a “chaotic” universe where you can get wildly different outcomes depending on the exact path taken by SGD.
A lot of MIRI’s messaging, such as the black shape metaphor, seems to assume that we live in a chaotic universe, as when Chapter 4 of If Anyone Builds It claims that the preferences of powerful AI “might be chaotic enough that if you tried it twice, you’d get different results each time.” But I think that if you’ve been paying attention to the literature about the technology we’re discussing, there’s actually a lot of striking empirical evidence that deep learning is much more “regular” than someone might have guessed in 2006: things like how Draxler et al. 2018 showed that you can find continuous low-loss paths between the results of different training runs (rather than being in different basins which might have wildly different generalization properties), or how Moschella et al. 2022 found that different models trained on different data end up learning the same latent space (such that representations by one can be reused by another without extra training). Those are empirical results; the relevance of SLT is as a theoretical insight as to how these results are even possible, in contrast to how people in 2006 might have had the intuition, “Well, ‘stochastic’ is right there as the ‘S’ in SGD, of course the outcome is going to be unpredictable.”
it does seem to me like an AI trained using modern methods, e.g. constitutional AI, is insufficiently constrained to embody human-compatible values [...] the black shape is still basically unpredictable from the perspective of the teal-shape drawer
I think it’s worth being really specific about what kind of “AI” you have in mind when you make this kind of claim. You might think, “Well, obviously I’m talking about superintelligence; this is a comment thread about a book about why people shouldn’t build superintelligence.”
That’s what I’m focused on in this thread: the arguments, not the conclusion. (This methodology is probably super counterintuitive to a lot of people, but it’s part of this website’s core canon.) I’m definitely not saying anyone knows how to train the “entire spectrum of reflectively consistent human values”. That’s philosophy, which is hard. I’m thinking about a much narrower question of computer science.
Namely: if I take the black shape metaphor or Chapter 4 of If Anyone Builds It at face value, it’s pretty confusing how RLAIF approaches like constitutional AI can work at all. Not just when hypothetically scaled to superintelligence. I mean, at all. Upthread, I wrote about how people customize base language models by upweighting trajectories chosen by a model trained to predict human approval and disapproval ratings.
But in order to convince policymakers that the prosaic alignment optimists are wrong (while the prosaic alignment optimists are passing them bars of AI-printed gold under the table), you’re going to need a stronger argument than “the black shape is still basically unpredictable from the perspective of the teal-shape drawer”. If it were actually unpredictable, where is all this gold coming from?
The black thing conforms to all the regularities. This does not by coincidence happen to cause it to occupy the shape you hoped for; you do not see all the non-robust / illegible features constraining that shape
While we’re still in the regime of pretraining on largely human-generated data, this is arguably great for alignment. You don’t have to understand the complex structure of human value; you can just point SGD at valuable data and get a network that spits out “more from that distribution”, without any risk of accidentally leaving out boredom and destroying all value that way.
Obviously, that doesn’t mean the humans are out of the woods. As the story of Earth-originating intelligent life goes on, and the capabilities of Society’s cutting-edge AIs start coming more and more from reinforcement learning and less and less from pretraining, you start run a higher risk of misspecifying your rewards, eventually fatally. But that world looks a lot more like Christiano’s “you get what you measure” scenario, rather than Part II of If Anyone Builds It, even if the humans are dead at the end of both stories. But the details matter for deciding which interventions are most dignified—possibly even if you think governance is more promising than alignment research. (Which specific regulations you want in your Pause treaty depends on which AI techniques are feasible and which ones are dangerous.)
the summary is incorrect to analogize the black thing to “architectures” instead of “parametrizations” or “functions”
Yes, the word choice of “architectures” in the phrase “many, many, many different complex architectures” in the article is puzzling. I don’t know what the author meant by that word, but to modern AI practitioners, “architecture” is the part of the system that is designed rather than “grown”: these-and-such many layers with such-and-these activation functions—the matrices, not the numbers inside them.
Wiktionary at the link marks that usage as dated, I wouldn’t call it “its literal meaning”. I believe the closest current single word would be “advocacy”.
Whether the simplest function is misaligned (as posited by List of Lethalities #20) is the thing you have to explain!
Is this in fact a crux for you? If you were largely convinced that the simplest functions found by gradient descent in the current paradigm would not remotely approximate human values, to what extent would this shift your odds of the current paradigm getting everyone killed?
I mean, yes. What else could I possibly say? Of course, yes.
In the spirit of not trying to solve the entire alignment problem at once, I find it hard to be too specific about to what extent how my odds would shift without a more specific question. (I think LLMs are doing a pretty good job of knowing and doing what I mean, which implies some form of knowledge of “human values”, but it’s only a natural-language instruction-follower; it’s not supposed to be a sovereign superintelligence, which looks vastly harder and I would rather people not do that for a long time.) Show me the ArXiv paper about inductive biases that I’m supposed to be updating on, and I’ll tell you how much more terrified I am (above my baseline of “already pretty terrified, actually”).
This is why I’m concerned about deleterious effects of writing for the outgroup: I’m worried you end up optimizing your thinking for coming up with eloquent allegories to convey your intuitions to a mass audience, and end up not having time for the actual, non-allegorical explanation that would convince subject-matter experts (whose support would be awfully helpful in the desperate push for a Pause treaty).
I think we have a lot of intriguing theory and evidence pointing to a story where the reason neural networks generalize is because the parameter-to-function mapping is not a one-to-one correspondence, and is biased towards simple functions (as Occam and Solomonoff demand): to a first approximation, SGD is going to find the simplest function that fits the training data (because simple functions correspond to large “basins” of approximately equal loss which are easy for SGD to find because they use fewer parameters or are more robust to some parameters being wrong), even though the network architecture is capable of representing astronomically many other functions that also fit the training data but have more complicated behavior elsewhere.
But if that story is correct, then “But what about Occam” isn’t something you can offhandedly address as an afterthought to an allegory about how misalignment is the default because there are astronomically many functions that fit the training data. Whether the simplest function is misaligned (as posited by List of Lethalities #20) is the thing you have to explain!
But you must realize that this sounds remarkably like the safety case for the current AI paradigm of LLMs + RLHF/RFAIF/RLVR! That is, the reason some people think that current-paradigm AI looks relatively safe is because they think that the capabilities of LLMs come from approximating the pretraining distribution, and RLHF/RFAIF/RLVR merely better elicits those capabilities by upweighting the rewarded trajectories (as evidenced by base models outperforming RL-trained models in pass@k evaluations for k in the hundreds or thousands) rather than discovering new “alien” capabilities from scratch.
If anything, the alignment case for SGD looks a lot better than that for selective breeding, because we get to specify as many billions and billions of input–output pairs for our network to approximate as we want (with the misalignment risk being that, as you say, if we don’t know how to choose the right data, the network might not generalize the way we want). Imagine trying to breed a dog to speak perfect English the way LLMs do!
LW is giving me issues and I’m having a hard time getting to and staying on the page to reply; I don’t know how good my further engagement will be, as a result.
I want to be clear that I think the only sane prior is on “we don’t know how to choose the right data.” Like, I don’t think this is reasonably an “if.” I think the burden of proof is on “we’ve created a complete picture and constrained all the necessary axes,” à la cybersecurity, and that the present state of affairs with regards to LLM misalignment (and all the various ways that it keeps persisting/that things keep squirting sideways) bears this out. The claim is not “impossible/hopeless,” but “they haven’t even begun to make a case that would be compelling to someone actually paying attention.”
(iiuc, people like Paul Christiano, who are far more expert than me and definitely qualify as “actually paying attention,” find the case more plausible/promising, not compelling. I don’t know of an intellectual with grounded expertise whom I respect who is like “we’re definitely good, here, and I can tell you why in concrete specifics.” The people who are confident are clearly hand-waving, and the people who are not hand-waving are at best tentatively optimistic. re: but your position is hand-wavey, too, Duncan—I think a) much less so, and b) burden of proof should be on “we know how to do this safely” not “exhaustively demonstrate that it’s not safe.”)
I am interested in an answer to Joe’s reply, which seems to me like the live conversational head.
To be clear, I agree that the situation is objectively terrifying and it’s quite probable that everyone dies. I gave a copy of If Anyone Builds It to two math professors of my acquaintance at San Francisco State University (and gave $1K to MIRI) because, in that context, conveying the fact that we’re in danger was all I had bandwidth for (and I didn’t have a better book on hand for that).
But in the context of my own writing, everyone who’s paying attention to me already knows about existential risk; I want my words to be focused on being rigorous and correct, not scaring policymakers and the public (notwithstanding that policymakers and the public should in fact be scared).
To the end of being rigorous and correct, I’m claiming that the “each of these black shapes is basically just as good at passing that particular test” story isn’t a good explanation of why alignment is hard (notwithstanding that alignment is in fact hard), because of the story about deep net architectures being biased towards simple functions.
I don’t think “well, I’m pitching to middle schoolers” saves it. If the actual problem is that we don’t know what training data would imply the behavior we want, rather than the outcomes of deep learning being intrinsically super-chaotic—which would be an entirely reasonable thing to suspect if it’s 2005 and you’re reasoning abstractly about optimization without having any empirical results to learn from—then you should be talking about how we don’t know what teal shape to draw, not that we might get a really complicated black shape for all we know.
I am of course aware that in the political arena, the thing I’m doing here would mark me as “not a team player”. If I agree with the conclusion that superintelligence is terrifying, why would I critique an argument with that conclusion? That’s shooting my own side’s soldiers! I think it would be patronizing for me to explain what the problem with that is; you already know.
I do not see you as failing to be a team player re: existential risk from AI.
I do see you as something like … making a much larger update on the bias toward simple functions than I do. Like, it feels vaguely akin to … when someone quotes Ursula K. LeGuin’s opinion as if that settles some argument with finality?
I think the bias toward simple functions matters, and is real, and is cause for marginal hope and optimism, but “bias toward” feels insufficiently strong for me to be like “ah, okay, then the problem outlined above isn’t actually a problem.”
I do not, to be clear, believe that my essay contains falsehoods that become permissible because they help idiots or children make inferential leaps. I in fact thought the things that I said in my essay were true (with decently high confidence), and I still think that they are true (with slightly reduced confidence downstream of stuff like the link above).
(You will never ever ever ever ever see me telling someone a thing I know to be false because I believe that it will result in them outputting a correct belief or a correct behavior; if I do anything remotely like that I will headline explicitly that that’s what I’m doing, with words like “The following is a lie, but if you pretend it’s true for a minute you might have a true insight downstream of it.”)
(That link should take you to the subheading “Written April 2, 2022.”)
I think that we don’t know what teal shape to draw, and that drawing the teal shape perfectly would not be sufficient on its own. In future writing I’ll try to twitch those two threads a little further apart.
You’re right; Steven Byrnes wrote me a really educational comment today about what the correct goal-counting argument looks like, which I need to think more about; I just think it’s really crucial that this is fundamentally an argument about generalization and inductive biases, which I think is being obscured in the black-shape metaphor when you write that “each of these black shapes is basically just as good at passing that particular test” as if it didn’t matter how complex the shape is.
(I don’t think talking to middle schoolers about inductive biases is necessarily hopeless; consider a box behind a tree.)
I think the temptation to frame technical discussions in terms of pessimism vs. optimism is itself a political distortion that I’m trying to avoid. (Apparently not successfully, if I’m coming off as a voice of marginal hope and optimism.)
You wrote an analogy that attempts to explain a reason why it’s hard to make neural networks do what we want; I’m arguing that the analogy is misleading. That disagreement isn’t about whether the humans survive. It’s about what’s going on with neural networks, and the pedagogy of how to explain it. Even if I’m right, that doesn’t mean the humans survive: we could just be dead for other reasons. But as you know, what matters in rationality is the arguments, not the conclusions; not only are bad arguments for a true conclusion still bad, even suboptimal pedagogy for a true lesson is still suboptimal.
This is good, but I think not saying false things turns out to be a surprisingly low bar, because the selection of which true things you communicate (and which true things you even notice) can have a large distortionary effect if the audience isn’t correcting for it.
I want to first acknowledge strongly that yep, we are mostly on the same side about getting a much better future than everyone dying to AIs.
I note this is not necessarily true for MIRI; we are trying very hard on purpose to reach and inform more people.
The two can be compatible!
I perceive at least two separate critiques, and I want to address them both without cross-contamination. (Please correct me if these miss the mark.)
Hypothesis 1: Maybe MIRI folks have wrong world-models (possibly due to insufficient engagement with sophisticated disagreement).
Hypothesis 2: Maybe MIRI folks are prioritizing their arguments badly for actually stopping the AI race.
Regarding Hypothesis 1, there’s a tradeoff between refining and polishing one’s world-model, and acting upon that world-model to try to accomplish things.
Speaking only for myself, there are many possible things I could be writing or saying, and only finite time to write or say them in. For the moment, I mostly want my words to be focused on (productively) scaring policymakers and the public, because they should in fact be scared.
This obviously does not preclude writing for and talking with the ingroup, nor continuing to refine and polish my own world-model.
But...well, I feel like I’ve mostly hit diminishing returns on that, both when it comes to updating my own models and when it comes to updating those of others like me. So the balance of time-spent naturally tips towards outreach.
To borrow from your comment below, in regards to Hypothesis 2...
...for one thing, I’m not sure how true this is? Policymakers and the public can sometimes both be swayed by appeals to intuition. Skeptical experts can be really hard to convince. Especially after the Nth iteration of debate has passed and a lot of ideas have congealed.
Again, there’s a tradeoff here, a matter of how much time one spends making cases to audiences of various levels of informed or uninformed skepticism. I’m not sure what the right balance is, but for myself at least, it’s probably not a primary focus on convincing Paul Christiano of things. Tactical priorities can differ from person to person, of course.
Caveat 1: Again, I speak for myself here. I admittedly have much less context on the decades-long back-and-forth than some of my colleagues.
Caveat 2: No matter who I’m trying to convince, I do want my arguments to rest on a solid foundation. If an interlocutor digs deep, the argument-hole they unearth should hold water. To, uh, rather butcher a metaphor.
But this just rounds back to my response to Hypothesis 1 - thanks to the magic of the Internet (and Lightcone Infrastructure in particular) you can always find someone with a sophisticated critique to level at your supposedly solid foundation. At some point you do have to take your best guess about what’s true and robust and correct according to your current world-model, then go and try to share it outside the crucible of LessWrong forums.
With all that being said, sure, let’s talk world-models. (With, again, the caveat that this is all my own limited take as someone who spent most of the 2010s doing reliability engineering and not alignment research.)
I think I follow your argument that one might say “we don’t know how to draw the teal thing” instead. But this seems more quibble than crux. I don’t think you addressed Duncan’s core point, which is that “we don’t know how to draw the teal thing” is the correct prior? (i.e. we don’t know how to select training data in a way that constrains an AI to learn to explicitly, primarily, and robustly value human flourishing.)
And if in fact we don’t know how to draw the right metaphorical teal thing, then the metaphorical black thing could take on various shapes that appear weird and complicated to us, but that actually reflect an underlying simplicity of which we are unaware. So it doesn’t seem wrong to claim that the black thing could take some (apparently) weird and (apparently) complex shape, given the assumption that we can’t draw a sufficiently constraining teal thing.
More broadly, I think I’m missing some important context, or just failing to follow your logic. I don’t see how a bias towards simple functions implies a convergence towards nonlethal aims. We don’t know what would be the simplest functions that approximate current or future training data. Why believe they would converge on something conveniently safe for us? [1]
From the papers you cite, I can see how one would conclude that AIs will be efficient, but I don’t see how they imply that AIs will be nice.
In the aforementioned spirit of rigor, I’m trying to avoid saying “human values” because those might not be good enough either. Many humans do not prioritize conscious flourishing! An ASI that doesn’t hold conscious wellbeing as its highest priority likely kills everyone as a side effect of optimizing for other ends, etc. etc.
I mean, before concluding that you’ve hit diminishing returns, have you looked at one of the standard textbooks on deep learning, like Prince 2023 or Bishop and Bishop 2024? I don’t think I’m suggesting this out of pointless gatekeeping. I actually unironically think if you’re devoting your life to a desperate campaign to get world powers to ban a technology, it’s helpful to have read a standard undergraduate textbook about the thing you’re trying to ban.
I mean, you can get a pretty good idea what the simplest function that approximates the data is like by, you know, looking at the data. (In slogan form, the model is the dataset.) Thus, language models—not hypothetical future superintelligences which don’t exist yet, but the actual technology that people are working on today—seem pretty safe for basically the same reason that text from the internet is safe: you’re sampling from the webtext distribution in a customized way.
(In more detail: you use gradient descent to approximate a “next token prediction” function of internet text. To make it more useful, we want to customize it away from the plain webtext distribution. To help automate that work, we train a “reward model”: basically, you start with a language model, but instead of the unembedding matrix which translates the residual stream to token probabilities, you tack on a layer that you train to predict human thumbs-up/thumbs-down ratings. Then you generate more samples from your base model, and use the output of your reward model to decide what gradient updates to do on them—with a Kullback–Leibler constraint to make sure you don’t update so far as to do something that it would be wildly unlikely for the original base model to do. It’s the same gradients you would get from adding more data to the pretraining set, except that the data is coming from the model itself rather than webtext, and the reward model puts a “multiplier” on the gradient: high reward is like training on that completion a bunch of times, and negative reward is issuing gradient updates in the opposite direction, to do less of that.)
That doesn’t mean future systems will be safe. At some point in the future, when you have AIs training other AIs on AI-generated data too fast for humans to monitor, you can’t just eyeball the data and feel confident that it’s not doing something you don’t want to happen. If your reward model accidentally reinforces the wrong things, then you get more of the wrong things. Importantly, this is a different threat model than “you don’t get what you train for”. In order to react to that threat in a dignified way, I want people to have read the standard undergraduate textbooks and be thinking about how to do better safety engineering in a way that’s oriented around the empirical details. Maybe we die either way, but I intend to die as a computer scientist.
I am in favor of learning more programming! During the two years I spent pivoting from reliability engineering, I did in fact attempt some hands-on machine learning code. My brain isn’t shaped in such a way that reading textbooks confers meaningful coding skills—I have to Actually Do the Thing—but I did try Actually Doing the Thing, reading and all.
I later facilitated BlueDot’s alignment and governance courses, and went through their reading material several times over in the process.
I now face a tradeoff between learning more ML, which is doable but extremely time-consuming, and efforts to convince policymakers not to let labs build ASI. It seems overwhelmingly overdetermined that my (marginal) time is best spent on the second thing. I see my primary comparative advantage as attempting to buy more time for developing solutions that might actually save us.
...which does unfortunately mean it’s going to take me a while to properly digest your argument-from-dataset-approximation. Doesn’t mean I won’t try.
Even attempting to take it as given, though, I’m confused by your conclusion, because you seem to be simultaneously saying “[language models approximating known datasets we can squint at] is a reason we know current systems are safe” and “this reason will not generalize to ASI” and “this answers the quoted question of why [ASI] would converge on something conveniently safe for us”.
Isn’t this the default path? Don’t most labs’ plans to build ASI run through massive use of AI-generated data? Even if I accept the premise that you can confidently assure safety by eyeballing data today, this doesn’t do much to reassure me if you then agree that it doesn’t generalize.
So I’m still not seeing how this supports the crux that “implement everyone’s CEV” (or, whichever alternative goalset you consider safe) is likely the simplest [function that approximates the datasets that will be used to create ASI].[1]
(Also, at this point I kind of want to taboo ‘dataset’ because it feels like a very overloaded term.)
Brackets, because I’m not even sure this is representing you right. Possibly it should be [function reflected by the dataset] or some other thing.
There has been a miscommunication. I’m not saying CEV or ASI alignment is easy. This thread started because I was critiquing the analogy about teal and black shapes in the article “Deadly By Default”, because the analogy taken at face value lends itself to a naïve counting argument of the form, “There are any number of AIs that could perform well in training, so who knows which one we’d end up with?!” I’m claiming that that argument as stated is wrong (although some more sophisticated counting argument could go through), because inductive biases are really important.
Maybe if you’re just trying to scare politicians and the public, “inductive biases are really important” doesn’t come up on your radar, but it’s pretty fundamental for, um, actually understanding the AI alignment problem humanity is facing!
I notice I’m still confused about your argument.
It seems to me that the question of whether [safe or intended goalset] is [the simplest function] is extremely relevant to the question of whether the argument as stated is wrong.
As I understand things right now, we seem to generally agree that:
An entity chooses data (teal thing) they think represents the shape they want an AI to be (black thing).
Gradient descent seeks simple functions that approximate the data.
The [test / selected data] are (probably) insufficient to constrain the resulting shape to what the makers intend.
The AI (probably) grows into a shape the maker(s) did not intend.
You said (emphasis added):
You seem to be saying “because deep learning privileges simple functions, the claim that many different AIs could pass our test is false.” I don’t see how that follows, because:
It is still the case that many different [AIs / simple functions] could [pass the test / approximate the dataset]. When we move the argument one level deeper, the original claim still holds true. Maybe I’m still just misunderstanding, though.
...or maybe you are only saying that the explanation as written is bad at taking readers from (1) to (4) because it does not explicitly mention (2), i.e. not technically wrong but still a bad explanation. In that case it seems we’d agree that (2) seems like a relevant wrinkle, and that writing (3) with “selected data” instead of “test” adds useful and correct nuance. But I don’t see how it makes Duncan’s summary either untrue or misleading, because eliding it doesn’t change (1) or (4).
...or maybe you are saying it was a bad explanation for you, and for readers with your level of sophistication and familiarity with the arguments, and thus a bad answer to your original question. Which is...kinda fair? In that case I suppose you’d be saying “Ah, I notice you are making an assumption there, and I agree with the assumption, but failing to address it is bad form and I’m worried about what that failure implies.” (I’ll hold off on addressing this argument until I know whether you’d actually endorse it.)
...or maybe you also flatly disagree with (3)? Like, you disagree with Duncan’s
...and in that case, excellent! We have surfaced a true crux. And it makes perfect sense from that perspective to say “the metaphor is wrong” because, from that perspective, one of its key assumptions is false.
Importantly, though, that looks to me like an object-level disagreement, and not one that reflects bad epistemics, except insofar as one believes that any disagreement must be the result of bad epistemics.
But bad explanations are wrong, untrue, and misleading.
Suppose the one comes to you and says, “All squares are quadrilaterals; all rectangles are quadrilaterals; therefore, all squares are rectangles.” That argument is wrong—”technically” wrong, if you prefer. It doesn’t matter that the conclusion is true. It doesn’t even matter that the premises are also true. It’s just wrong.
Okay, but why is it wrong though? I still haven’t seen a convincing case for that! It sure looks to me like, given an assumption which I still feel confused about whether you share, the conclusion does in fact follow from the premises, even in metaphor form.
I am open to the case that it’s a bad argument. If it is in fact a bad argument then that’s a legitimate criticism. But from my perspective you have not adequately spelled out how “deep nets favor simple functions” implies it’s a bad argument.
You said, “I don’t see how [not mentioning inductive biases] makes Duncan’s summary either untrue or misleading, because eliding it doesn’t change (1) [we choose “teal shape” data to grow the “black shape” AI] or (4) [we don’t get the AI we want].” But the point of the broken syllogism in the grandparent is that it’s not enough for the premise to be true and the conclusion to be true; the conclusion has to follow from the premise.
The context of the teal/black shape analogy in the article is an explanation of how “modern AIs aren’t really designed so much as grown or evolved” with the putative consequence that “there are many, many, many different complex architectures that are consistent with behaving ‘properly’ in the training environment, and most of them don’t resemble the thing the programmers had in mind”.
Set aside the question of superintelligence for the moment. Is this true as a description of “modern AIs”, e.g., image classifiers? That’s not actually clear to me.
It is true that adversarially robust image classification isn’t a solved problem, despite efforts: it’s usually possible (using the same kind of gradient-based optimization used to train the classifiers themselves) to successfully search for “adversarial examples” that machines classify differently than humans, which isn’t what the programmers had in mind.
But Ilyas et al. 2019 famously showed that adversarial examples are often due to “non-robust” features that are doing predictive work, but which are counterintuitive to humans. That would be an example of our data pointing at, as you say, an “underlying simplicity of which we are unaware”.
I’m saying that’s a different problem than a counting argument over putative “many, many, many different complex architectures that are consistent with behaving ‘properly’ in the training environment”, which is what the black/teal shape analogy seems to be getting at. (There are many, many, many different parametrizations that are consistent with behaving properly in training, but I’m claiming that the singular learning theory story explains why that might not be a problem, if they all compute similar functions.)
Thank you for attempting to spell this out more explicitly. If I understand correctly, you are saying singular learning theory suggests that AIs with different architectures will converge on a narrow range of similar functions that best approximate the training data.
With less confidence, I understand you to be claiming that this convergence implies that (in the context of the metaphor) a given [teal thing / dataset] may reliably produce a particular shape of [black thing / AI].
So (my nascent Zack model says) the summary is incorrect to analogize the black thing to “architectures” instead of “parametrizations” or “functions”, and more importantly incorrect to claim that the black shape’s many degrees of freedom imply it will take a form its developers did not intend. (Because, by SLT, most shapes converge to some relatively simple function approximator.)
But...it does seem to me like an AI trained using modern methods, e.g. constitutional AI, is insufficiently constrained to embody human-compatible values even given the stated interpretation of SLT. Or in other words, the black shape is still basically unpredictable from the perspective of the teal-shape drawer. I’m not sure you disagree with that?
As an exercise in inferential gap-crossing, I want to try to figure out what minimum change to the summary / metaphor would make it relatively unobjectionable to you.
Attempting to update the analogy in my own model, it would go something like: You draw a [teal thing / dataset]. You use it to train the [black thing / AI]. There are underlying regularities in your dataset, some of which are legible to you as a human and some of which are not. The black thing conforms to all the regularities. This does not by coincidence happen to cause it to occupy the shape you hoped for; you do not see all the non-robust / illegible features constraining that shape. You end up with [weird shape] instead of [simple shape you were aiming for.]
A more skeptical Zack-model in my head says “No, actually, you don’t end up with [weird shape] at all. SLT says you can get [shape which robustly includes the entire spectrum of reflectively consistent human values] because that’s the function being approximated, the underlying structure of the data.” I dunno if this is an accurate Zack-model.
(I am running into the limited bandwith of text here, and will also DM a link to schedule a conversation if you’re so inclined).
Sorry, I don’t want to accidentally overemphasize SLT in particular, which I am not an expert in. I think what’s at issue is how predictable deep learning generalization is: what kind of knowledge would be necessary in order to “get what you train for”?
This isn’t obvious from first principles. Given a description of SGD and the empirical knowledge of 2006, you could imagine it going either way. Maybe we live in a “regular” computational universe, where the AI you get depends on your architecture and training data according to learnable principles that can be studied by the usual methods of science in advance of the first critical try, but maybe it’s a “chaotic” universe where you can get wildly different outcomes depending on the exact path taken by SGD.
A lot of MIRI’s messaging, such as the black shape metaphor, seems to assume that we live in a chaotic universe, as when Chapter 4 of If Anyone Builds It claims that the preferences of powerful AI “might be chaotic enough that if you tried it twice, you’d get different results each time.” But I think that if you’ve been paying attention to the literature about the technology we’re discussing, there’s actually a lot of striking empirical evidence that deep learning is much more “regular” than someone might have guessed in 2006: things like how Draxler et al. 2018 showed that you can find continuous low-loss paths between the results of different training runs (rather than being in different basins which might have wildly different generalization properties), or how Moschella et al. 2022 found that different models trained on different data end up learning the same latent space (such that representations by one can be reused by another without extra training). Those are empirical results; the relevance of SLT is as a theoretical insight as to how these results are even possible, in contrast to how people in 2006 might have had the intuition, “Well, ‘stochastic’ is right there as the ‘S’ in SGD, of course the outcome is going to be unpredictable.”
I think it’s worth being really specific about what kind of “AI” you have in mind when you make this kind of claim. You might think, “Well, obviously I’m talking about superintelligence; this is a comment thread about a book about why people shouldn’t build superintelligence.”
The problem is that is that if you try to persuade people to not build superintelligence using arguments that seem to apply just as well to the kind of AI we have today, you’re not going to be very convincing when people talk to human-compatible AIs behaving pretty much the way their creators intended all the time every day.
That’s what I’m focused on in this thread: the arguments, not the conclusion. (This methodology is probably super counterintuitive to a lot of people, but it’s part of this website’s core canon.) I’m definitely not saying anyone knows how to train the “entire spectrum of reflectively consistent human values”. That’s philosophy, which is hard. I’m thinking about a much narrower question of computer science.
Namely: if I take the black shape metaphor or Chapter 4 of If Anyone Builds It at face value, it’s pretty confusing how RLAIF approaches like constitutional AI can work at all. Not just when hypothetically scaled to superintelligence. I mean, at all. Upthread, I wrote about how people customize base language models by upweighting trajectories chosen by a model trained to predict human approval and disapproval ratings.
In RLAIF, they use an LLM itself to provide the ratings instead of any actual humans. If you only read MIRI’s propaganda (in its literal meaning, “public communication aimed at influencing an audience and furthering an agenda”) and don’t read ArXiV, that just sounds suicidal.
But it’s working! (For now.) It’s working better than the version with actual human preference rankings! Why? How? Prosaic alignment optimists would say: it learned the intended Platonic representation from pretraining. Are they wrong? Maybe! (I’m still worried about what happens if you optimize too hard against the learned representation.)
But in order to convince policymakers that the prosaic alignment optimists are wrong (while the prosaic alignment optimists are passing them bars of AI-printed gold under the table), you’re going to need a stronger argument than “the black shape is still basically unpredictable from the perspective of the teal-shape drawer”. If it were actually unpredictable, where is all this gold coming from?
While we’re still in the regime of pretraining on largely human-generated data, this is arguably great for alignment. You don’t have to understand the complex structure of human value; you can just point SGD at valuable data and get a network that spits out “more from that distribution”, without any risk of accidentally leaving out boredom and destroying all value that way.
Obviously, that doesn’t mean the humans are out of the woods. As the story of Earth-originating intelligent life goes on, and the capabilities of Society’s cutting-edge AIs start coming more and more from reinforcement learning and less and less from pretraining, you start run a higher risk of misspecifying your rewards, eventually fatally. But that world looks a lot more like Christiano’s “you get what you measure” scenario, rather than Part II of If Anyone Builds It, even if the humans are dead at the end of both stories. But the details matter for deciding which interventions are most dignified—possibly even if you think governance is more promising than alignment research. (Which specific regulations you want in your Pause treaty depends on which AI techniques are feasible and which ones are dangerous.)
Yes, the word choice of “architectures” in the phrase “many, many, many different complex architectures” in the article is puzzling. I don’t know what the author meant by that word, but to modern AI practitioners, “architecture” is the part of the system that is designed rather than “grown”: these-and-such many layers with such-and-these activation functions—the matrices, not the numbers inside them.
Wiktionary at the link marks that usage as dated, I wouldn’t call it “its literal meaning”. I believe the closest current single word would be “advocacy”.
Is this in fact a crux for you? If you were largely convinced that the simplest functions found by gradient descent in the current paradigm would not remotely approximate human values, to what extent would this shift your odds of the current paradigm getting everyone killed?
I mean, yes. What else could I possibly say? Of course, yes.
In the spirit of not trying to solve the entire alignment problem at once, I find it hard to be too specific about to what extent how my odds would shift without a more specific question. (I think LLMs are doing a pretty good job of knowing and doing what I mean, which implies some form of knowledge of “human values”, but it’s only a natural-language instruction-follower; it’s not supposed to be a sovereign superintelligence, which looks vastly harder and I would rather people not do that for a long time.) Show me the ArXiv paper about inductive biases that I’m supposed to be updating on, and I’ll tell you how much more terrified I am (above my baseline of “already pretty terrified, actually”).