1a3orn

Karma: 3,817

1a3orn.com

1a3orn 13 Sep 2024 12:31 UTC
6 points
0
in reply to: p.b.’s comment on: OpenAI o1
Aaah, so the question is if it’s actually thinking in German because of your payment info or it’s just the thought-trace-condenser that’s translating into German because of your payment info.

Interesting, I’d guess the 2nd but ???

1a3orn 13 Sep 2024 12:19 UTC
2 points
0
in reply to: p.b.’s comment on: OpenAI o1
Do you work at OpenAI? This would be fascinating, but I thought OpenAI was hiding the hidden thoughts.

1a3orn 10 Sep 2024 13:36 UTC
22 points
25
in reply to: leogao’s comment on: My takes on SB-1047
Yeah, so it sounds like you’re just agreeing with my primary point.
- The original claim that you made was that you wouldn’t be liable if your LLM made “publicly accessible” information available.
- I pointed out that this wasn’t so; you could be liable for information that was publicly accessible that an “ordinary person” wouldn’t access it.
- And now you’re like “Yeah, could be liable, and that’s a good thing, it’s great.”
So we agree about whether you could be liable, which was my primary point. I wasn’t trying to tell you that was bad in the above; I was just saying “Look, if your defense of 1046 rests on publicly-available information not being a thing for which you could be liable, then your defense rests on a falsehood.”

However, then you shifted to “No, it’s actually a good thing for the LLM maker to be held legally liable if it gives an extra-clear explanation of public information.” That’s a defensible position; but it’s a different position than you originally held.

I also disagree with it. Consider the following two cases:
1. A youtuber who is to bioengineering as Karpathy is to CS or Three Blue One Brown is to Math makes youtube videos. Students everywhere praise him. In a few years there’s a huge crop of startups populated by people who watched him. One person uses his stuff to help them make a weapon, though, and manages to kill some people. We have strong free-speech norms, though—so he isn’t liable for this.
2. A LLM that is to bioengineering as Karpathy is to CS or Three Blue One Brown is to Math makes explanations. Students everywhere praise it. In a few years there’s a huge crop of startups populated by people who used it. But one person uses it’s stuff to help him make a weapon, though, and manages to kill some people. Laws like 1047 have been passed, though, so the maker turns out to be liable for this.
I think the above dissymmetry makes no sense. It’s like how we just let coal plants kill people through pollution; while making nuclear plants meet absurd standards so they don’t kill people. “We legally protect knowledge disseminated one way, and in fact try to make easily accessible, and reward educators with status and fame; but we’ll legally punish knowledge disseminated one way, and in fact introduce long-lasting unclear liabilities for it.”

1a3orn 9 Sep 2024 19:18 UTC
22 points
8
on: My takes on SB-1047

In addition, the bill also explicitly clarifies that cases where the model provides information that was publicly accessible anyways don’t count.

I’ve heard a lot of people say this, but that’s not really what the current version of the bill says. This is how it clarifies the particular critical harms that don’t count:

(2) “Critical harm” does not include any of the following: (A) Harms caused or materially enabled by information that a covered model or covered model derivative outputs if the information is otherwise reasonably publicly accessible by an ordinary person from sources other than a covered model or covered model derivative.

So, you can be held liable for critical harms even when you supply information that was publicly accessible, if it wasn’t information an “ordinary person” wouldn’t know.

As far as I can tell what this means is unclear. “Ordinary person” in tort laws seem to know things like “ice makes roads slippery” and to be generally dumb; a ton of information that we think of as very basic about computers seems to be information a legal “ordinary person” wouldn’t know.

1a3orn 3 Aug 2024 14:59 UTC
1 point
0
in reply to: robo’s comment on: Dan Hendrycks and EA
Whether someone is or was a part of a group is in general an actual fact about their history, not something they can just change through verbal disavowals. I don’t think we have an obligation to ignore someone’s historical association with a group in favor of parroting their current words.

Like, suppose someone who is a nominee for the Supreme Court were to say “No, I totally was never a part of the Let’s Ban Abortion Because It’s Murder Group.”

But then you were to look at the history of this person and you found that they had done pro-bono legal work for the “Abortion Is Totally Murder” political action group; and they had founded an organization that turned out to be currently 90% funded by the “Catholics Against Murdering Babies”; and in fact had gone many times to “Let’s Make Laws Be Influenced by the Catholic Church” summit; and he was a close personal friend to a bunch of archbishops and Catholic philosophers.

In such a case, it’s reasonable to be like “No, you’re lying about what groups you were and are a part of.” I think that you should be able to reasonably say this—regardless of whether you think abortion is murder or not. The nominee is in fact lying; it is possible to lie about the group that you are a part of.

Similarly—well, the linked article from OP doesn’t actually contain a disavowal from Dan Hendryks, afaict? This one contains the claim he was “never an EA adherent,” which is closer to a disavowal.

Whether or not this claim is true, it is the kind of claim that certainly admits truth. Or lies.

1a3orn 9 Jun 2024 5:05 UTC
3 points
0
in reply to: O O’s comment on: O O’s Shortform
Just want to register that I agree that—regardless of US GPU superiority right now—the US AI superiority is pretty small, and decreasing. Yi-Large beats a bunch of GPT-4 versions—even in English—on lmsys; it scores just above stuff like Gemini. Their open source releases like DeepSeekV2 look like ~Llama 3 70b level. And so on and so forth.

Maybe whatever OpenAI is training now will destroy whatever China has, and establish OpenAI as firmly in the lead.… or maybe not. Yi says they’re training their next model as well, so it isn’t like they’ve stopped doing things.

I think some chunk of “China is so far behind” is fueled by the desire to be able to stop US labs while not just letting China catch up, but that is what it would actually do.

1a3orn 31 May 2024 19:51 UTC
20 points
11
in reply to: quetzal_rainbow’s comment on: MIRI 2024 Communications Strategy
True knowledge about later times doesn’t let you generally make arbitrary predictions about intermediate times, given valid knowledge of later times. But true knowledge does usually imply that you can make some theory-specific predictions about intermediate times, given later times.

Thus, vis-a-vis your examples: Predictions about the climate in 2100 don’t involve predicting tomorrow’s weather. But they do almost always involve predictions about the climate in 2040 and 2070, and they’d be really sus if they didn’t.

Similarly:
- If an astronomer thought that an asteroid was going to hit the earth, the astronomer generally could predict points it will be observed at in the future before hitting the earth. This is true even if they couldn’t, for instance, predict the color of the asteroid.
- People who predicted that C19 would infect millions by T + 5 months also had predictions about how many people would be infected at T + 2. This is true even if they couldn’t predict how hard it would be to make a vaccine.
- (Extending analogy to scale rather than time) The ability to predict that nuclear war would kill billions involves a pretty good explanation for how a single nuke would kill millions.
So I think that—entirely apart from specific claims about whether MIRI does this—it’s pretty reasonable to expect them to be able to make some theory-specific predictions about the before-end-times, although it’s unreasonable to expect them to make arbitrary theory-specific predictions.

1a3orn 4 May 2024 12:12 UTC
4 points
4
in reply to: lc’s comment on: lc’s Shortform
I mean, sure, but I’ve been updating in that direction a weirdly large amount.

1a3orn 18 Mar 2024 16:02 UTC
7 points
1
on: What is the best argument that LLMs are shoggoths?
For a back and forth on whether the “LLMs are shoggoths” is propaganda, try reading this.

In my opinion if you read the dialogue, you’ll see the meaning of “LLMs are shoggoths” shift back and forth—from “it means LLMs are psychopathic” to “it means LLMs think differently from humans.” There isn’t a fixed meaning.

I don’t think trying to disentangle the “meaning” of shoggoths is going to result in anything; it’s a metaphor, some of whose understandings are obviously true (“we don’t understand all cognition in LLMs”), some of which are dubiously true (“LLM’s ‘true goals’ exist, and are horrific and alien”). But regardless of the truth of these props, you do better examining them one-by-one than in an emotionally-loaded image.

It’s sticky because it’s vivid, not because it’s clear; it’s reached for as a metaphor—like “this government policy is like 1984”—because it’s a ready-to-hand example with an obvious emotional valence, not for any other reason.

If you were to try to zoom into “this policy is like 1984” you’d find nothing; so also here.

1a3orn 17 Mar 2024 20:34 UTC
4 points
1
in reply to: quetzal_rainbow’s comment on: What is the best argument that LLMs are shoggoths?
As you said, this seems like a pretty bad argument.

Something is going on between the {user instruction} ….. {instruction to the image model}. But we don’t even know if it’s in the LLM. It could be there’s dumb manual “if” parsing statements that act differently depending on periods, etc, etc. It could be that there are really dumb instructions given to the LLM that creates instructions for the language model, as there were for Gemini. So, yeah.

1a3orn 4 Mar 2024 15:02 UTC
9 points
2
in reply to: Raemon’s comment on: Raemon’s Shortform Feed
So Alasdair MacIntyre, says that all enquiry into truth and practical rationality takes place within a tradition, sometimes capital-t Tradition, that provides standards for things like “What is a good argument” and “What things can I take for granted” and so on. You never zoom all the way back to simple self-evident truths or raw-sense data—it’s just too far to go. (I don’t know if I’d actually recommend MacIntyre to you, he’s probably not sufficiently dense / interesting for your projects, he’s like a weird blend of Aquinas and Kuhn and Lakatos, but he is interesting at least, if you have a tolerance for.… the kind of thing he is.)

What struck me with a fair number of reviews, at this point, was that they seemed… kinda resigned to a LW Tradition, if it ever existed, no longer really being a single thing? Like we don’t have shared standards any more for what is a good argument or what things can be taken for granted (maybe we never did, and I’m golden-age fallacying). There were some reviews saying “idk if this is true, but it did influence people” and others being like “well I think this is kinda dumb, but seems important” and I know I wrote one being like “well these are at least pretty representative arguments of the kind of things people say to each other in these contexts.”

Anyhow what I’m saying is that—if we operate in a MacIntyrean frame—it makes sense to be like “this is the best work we have” within a Tradition, but humans start to spit out NaNs / operation not defined if you try to ask them “is this the best work we have” across Traditions. I don’t know if this is true of ideal reasoners but it does seem to be true of… um, any reasoners we’ve ever seen, which is more relevant.

1a3orn 29 Feb 2024 18:13 UTC
2 points
0
in reply to: Raemon’s comment on: Rationality Research Report: Towards 10x OODA Looping?
So I agree with some of what you’re saying along “There is such a thing as a generally useful algorithm” or “Some skills are more deep than others” but I’m dubious about some of the consequences I think that you think follow from them? Or maybe you don’t think these consequences follow, idk, and I’m imagining a person? Let me try to clarify.

There’s clusters of habits that seem pretty useful for solving novel problems

My expectation is that there are many skills / mental algorithms along these lines, such that you could truthfully say “Wow, people in diverse domains have found X mental algorithm useful for discovering new knowledge.” But also I think it’s probably true that the actually shared information between different domain-specific instances of “X mental algorithm” is going to be pretty small.

Like, take the skill of “breaking down skills into subskills, figuring out what subskills can be worked on, etc”. I think there’s probably some kind of of algorithm you can run cross-domain that does this kind of thing. But without domain-specific pruning heuristics, and like a ton of domain-specific details, I expect that this algorithm basically just spits back “Well, too many options” rather than anything useful.

So: I expect non-domain specific work put into sharpening up this algorithm to run into steeply diminishing returns, even if you can amortize the cost of sharpening up the algorithm across many different domains that would be benefitted. If you could write down a program that can help you find relevant subskills in some domain, about 95% of the program is going to be domain-specific rather than not domain specific, and there are something like only ~logarithmic returns to working on the domain-specific problem. (Not being precise, just an intuition)

Put alternately, I expect you could specify some kind of algorithm like this in a very short mental program, but when you’re running the program most mental compute goes into finding domain-specific program details.

Let me just describe the way the world looks to me. Maybe we actually think the same thing?

-- If you look throughout the history of science, I think that most discoveries look less like “Discoverer had good meta-level principles that let them situate themselves in the right place to solve the issue” and more like “Discoverer happened to be interested in the right chunk of reality that let them figure out an important problem, but it was mostly luck in situating themselves or their skills in this place.” I haven’t read a ton of history of science, but yeah.

-- Concretely, my bet is that most (many?) scientific discoverers of important things were extremely wrong on other important things, or found their original discovery through something like luck. (And some very important discoveries (Transformers) weren’t really identified as such at the time.)

-- Or, concretely, I think scientific progress overall probably hinges less on individual scientists having good meta-level principles, and more on like...whatever social phenomena is necessary to let individuals or groups of scientists run a distributed brute-force search. Extremely approximately.

-- So my belief is that so far we humans just haven’t found any such principles like those you’re seeking for. Or that a lack of such principles can screw over your group (if you eschew falsifiability to a certain degree you’re fucked; if you ignore math you’re fucked) but that you can ultimately mostly raise the floor rather than the ceiling through work on them. Like there is a lot of math out there, and different kinds are very useful for different things!

-- I would be super excited to find such meta-level principles, btw. I feel like I’m being relentlessly negative. So to be clear, it would be awesome to find substantive meta-level principles such that non-domain specific work on the meta-level principles could help people situate themselves and pursue work effectively in confusing domains. Like I’m talking about this because I am very much interested in the project. I just right now… don’t think the world looks like they exist? It’s just in that in the absence of seeing groups that seem to have such principles, nothing that I know about minds in general makes me think that such principles are likely.

Or maybe I’m just confused about what you’re doing. Really uncertain about all the above.

1a3orn 25 Feb 2024 19:51 UTC
4 points
0
on: Rationality Research Report: Towards 10x OODA Looping?

This is less of “a plan” and more of “a model”, but, something that’s really weirded me out about the literature on IQ, transfer learning, etc, is that… it seems like it’s just really hard to transfer learn. We’ve basically failed to increase g, and the “transfer learning demonstrations” I’ve heard of seemed pretty weaksauce.

But, all my common sense tells me that “general strategy” and “responding to novel information, and updating quickly” are learnable skills that should apply in a lot of domains.

I’m curious why you think this? Or if you have a place where you’ve explained why you think this at more length? Like my common sense just doesn’t agree with this—although I’ll admit my common sense was probably different 5 years ago.

Overall a lot of the stuff here seems predicated on there being a very thick notion of non-domain specific “rationality” or “general strategy” that can be learned, that then after being learned speed you up in widely disparate domains. As in—the whole effort is to find such a strategy. But there seems to be some (a lot? a little?) evidence that this just isn’t that much of a thing, as you say.

I think current ML evidence backs this up. A Transformer is like a brain: when a Transformer is untrained, nearly literally the same architecture could learn to be a language model; to be an image diffusion model; to play Starcraft; etc etc. But once you’ve trained it, although it can learn very quickly in contexts to which it is adapted, it basically learns pretty poorly outside of these domains.

Similarly, human brains start of very plastic. You can learn to echolocate, or speak a dozen languages, or to ride a unicycle, or to solve IMO problems. And then brains specialize, and learn a lot of mostly domain-specific heuristics, that let them learn very quickly about the things that they already know. But they also learn to kinda suck elsewhere—like, learning a dozen computer languages is mostly just going to not transfer to learning Chinese.

Like I don’t think the distinction here I’m drawing is even well-articulated. And I could spend more time trying to articulate it—there’s probably some generality, maybe at the level of grit—but the “learn domain-non-specific skills that will then speed up a particular domain” project seems to take a position that’s sufficiently extreme that I’m like… ehhhh seems unlikely to succeed? (I’m in the middle of reading The Secret of Our Success fwiw, although it’s my pre-existing slant for this position that has inclined me to read it.)

1a3orn 22 Jan 2024 20:51 UTC
4 points
2
in reply to: Ben Pace’s comment on: TurnTrout’s shortform feed
To the best of my knowledge, the majority of research (all the research?) has found that the changes to a LLM’s text-continuation abilities from RLHF (or whatever descendant of RLHF is used) are extremely superficial.

So you have one paper, from the abstract:

Our findings reveal that base LLMs and their alignment-tuned versions perform nearly identically in decoding on the majority of token positions (i.e., they share the top-ranked tokens). Most distribution shifts occur with stylistic tokens (e.g., discourse markers, safety disclaimers). These direct evidence strongly sup- ports the hypothesis that alignment tuning primarily learns to adopt the language style of AI assistants, and that the knowledge required for answering user queries predominantly comes from the base LLMs themselves.

Or, in short, the LLM is still basically doing the same thing, with a handful of additions to keep it on-track in the desired route from the fine-tuning.

(I also think our very strong prior belief should be that LLMs are basically still text-continuation machines, given that 99.9% or so of the compute put into them is training them for this objective, and that neural networks lose plasticity as they learn. Ash and Adams is like a really good intro to this loss of plasticity, although most of the research that cites this is RL-related so people don’t realize.)

Similarly, a lot of people have remarked on how the textual quality of the responses from a RLHF’d language model can vary with the textual quality of the question. But of course this makes sense from a text-prediction perspective—a high-quality answer is more likely to follow a high-quality question in text than a high-quality answer from a low-quality question. This kind of thing—preceding the model’s generation with high-quality text—was the only way to make it have high quality answers for base models—but it’s still there, hidden.

So yeah, I do think this is a much better model for interacting with these things than asking a shoggoth. It actually gives you handles to interact with them better, while asking a shoggoth gives you no such handles.

1a3orn 22 Jan 2024 16:56 UTC
LW: 6 AF: 6
10
AF
in reply to: ryan_greenblatt’s comment on: TurnTrout’s shortform feed
I agree this can be initially surprising to non-experts!

I just think this point about the amorality of LLMs is much better communicated by saying “LLMs are trained to continue text from an enormous variety of sources. Thus, if you give them [Nazi / Buddhist / Unitarian / corporate / garbage nonsense] text to continue, they will generally try to continue it in that style.”

Than to say “LLMs are like alien shoggoths.”

Like it’s just a better model to give people.

1a3orn 22 Jan 2024 13:50 UTC
LW: 1 AF: 1
1
AF
in reply to: habryka’s comment on: TurnTrout’s shortform feed
I like a lot of these questions, although some of them give me an uncanny feeling akin to “wow, this is a very different list of uncertainties than I have.” I’m sorry the my initial list of questions was aggressive.

So I don’t consider the exact nature and degree of alienness as a settled question, but at least to me, aggregating all the evidence I have, it seems very likely that the cognition going on in a base model is very different from what is going on in a human brain, and a thing that I benefit from reminding myself frequently when making predictions about the behavior of LLM systems.

I’m not sure how they add up to alienness, though? They’re about how we’re different than models—wheras the initial claim was that models are psychopathic, ammoral, etc.. If we say a model is “deeply alien”—is that just saying it’s different than us in lots of ways? I’m cool with that—but the surplus negative valence involved in “LLMs are like shoggoths” versus “LLMs have very different performance characteristics than humans” seems to me pretty important.

Otherwise, why not say that calculators are alien, or any of the things in existence with different performance curves than we have? Chessbots, etc. If I write a loop in Python to count to 10, the process by which it does so is arguably more different from how I count to ten than the process by which an LLM counts to ten, but we don’t call Python alien.

This feels like reminding an economics student that the market solves things differently than a human—which is true—by saying “The market is like Baal.”

Do they require similar amounts and kinds of data to learn the same relationships?

There is a fun paper on this you might enjoy. Obviously not a total answer to the question.

1a3orn 21 Jan 2024 21:01 UTC
LW: 19 AF: 10
3
AF
in reply to: habryka’s comment on: TurnTrout’s shortform feed

performs deeply alien cognition

I remain unconvinced that there’s a predictive model of the world opposite this statement, in people who affirm it, that would allow them to say, “nah, LLMs aren’t deeply alien.”

If LLM cognition was not “deeply alien” what would the world look like?

What distinguishing evidence does this world display, that separates us from that world?

What would an only kinda-alien bit of cognition look like?

What would very human kind of cognition look like?

What different predictions does the world make?

Does alienness indicate that it is because the models, the weights themselves have no “consistent beliefs” apart from their prompts? Would a human neocortex, deprived of hippocampus, present any such persona? Is a human neocortex deeply alien? Are all the parts of a human brain deeply alien?

Is it because they “often spout completely non-human kinds of texts”? Is the Mersenne Twister deeply alien? What counts as “completely non-human”?

Is it because they have no moral compass, being willing to continue any of the data on which they were trained? Does any human have a “moral compass” apart from the data on which they were trained? If I can use some part of my brain to improv a consistent Nazi, does that mean that it makes sense to call the part of my brain that lets me do that immoral or psychopathic?

Is it that the algorithms that we’ve found in DL so far don’t seem to slot into readily human-understandable categories? Would a not-deeply-alien algorithm be able-to-be cracked open and show us clear propositions of predicate logic? If we had a human neocortex in an oxygen-infused broth in front of us, and we recorded the firing of every cell, do we anticipate that the algorithms there would be clear propositions of predicate logic? Would we be compelled to conclude that human neocortexes were deeply alien?

Or is it deeply alien because we think the substrate of thought is different, based on backprop rather than local learning? What if local learning could actually approximate backpropagation?. Or if more realistic non-backprop potential brain algorithms actually… kind just acted quite similarly to backprop, such that you could draw a relatively smooth line between them and backprop? Would this or more similar research impact whether we thought brains were aliens or not?

Does substrate-difference count as evidence against alien-ness, or does alien-ness just not make that kind of predictions? Is the cognition of an octopus less alien to us than the cognition of an LLM, because it runs on a more biologically-similar substrate?

Does every part of a system by itself need to fit into the average person’s ontology for the total to not be deeply alien; do we need to be able to fit every part within a system into a category comprehensible by an untutored human in order to describe it as not deeply alien? Is anything in the world not deeply alien by this standard?

To re-question: What predictions can I make about the world because LLMs are “deeply alien”?

Are these predictions clear?

When speaking to someone who I consider a noob, is it best to give them terms whose emotive import is clear, but whose predictive import is deeply unclear?

What kind of contexts does this “deeply alien” statement come up in? Are those contexts people are trying to explain, or to persuade?

If I piled up all the useful terms that I know that help me predict how LLMs behave, would “deeply alien” be an empty term on top of these?

Or would it give me no more predictive value than “many behaviors of an LLM are currently not understood”?

1a3orn 19 Jan 2024 21:42 UTC
2 points
0
in reply to: snewman’s comment on: Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk
I mean, it’s unrealistic—the cells are “limited to English-language sources, were prohibited from accessing the dark web, and could not leverage print materials (!!)” which rules out textbooks. If LLMs are trained on textbooks—which, let’s be honest, they are, even though everyone hides their datasources—this means teams who have access to an LLM have a nice proxy to a textbook through an LLM, and other teams don’t.

It’s more of a gesture at the kind of thing you’d want to do, I guess but I don’t think it’s the kind of thing that it would make sense to trust. The blinding was also really unclear to me.

Jason Matheny, by the way, the president of Rand, the organization running that study, is on Anthropic’s “Long Term Benefit Trust.” I don’t know how much that should matter for your evaluation, but my bet is a non-zero amount. If you think there’s an EA blob that funded all of the above—well, he’s part of it. OpenPhil funded Rand with 15 mil also.

You may think it’s totally unfair to mention that; you may think it’s super important to mention that; but there’s the information, do what you will with it.

1a3orn 18 Jan 2024 20:34 UTC
2 points
in reply to: Alexander Gietelink Oldenziel’s comment on: 1a3orn’s Shortform
I mean, I should mention that I also don’t think that agentic models will try to deceive us if trained how LLMs currently are, unfortunately.

1a3orn 18 Jan 2024 20:32 UTC
4 points
0
in reply to: the gears to ascension’s comment on: ricraz’s Shortform
So, there are a few different reasons, none of which I’ve formalized to my satisfaction.

I’m curious if these make sense to you.

(1) One is that the actual kinds of reasoning that an LLM can learn in its forward pass are quite limited.

As is well established, for instance, Transformers cannot multiply arbitrarily-long integers in a single forward pass. The number of additions involved in multiplying an N-digit integer increases in an unbounded way with N; thus, a Transformer with with a finite number of layers cannot do it. (Example: Prompt GPT-4 for the results of multiplying two 5-digit numbers, specifying not to use a calculator, see how it does.)

Of course in use you can teach a GPT to use a calculator—but we’re talking about operations that occur in single forward pass, which rules out using tools. Because of this shallow serial depth, a Transformer also cannot (1) divide arbitrary integers, (2) figure out the results of physical phenomena that have multiplication / division problems embedded in them, (3) figure out the results of arbitrary programs with loops, and so on.

(Note—to be very clear NONE of this is a limitation on what kind of operations we can get a transformer to do over multiple unrollings of the forward pass. You can teach a transformer to use a calculator; or to ask a friend for help; or to use a scratchpad, or whatever. But we need to hide deception in a single forward pass, which is why I’m harping on this.)

So to think that you learn deception in forward pass, you have to think that the transformer thinks something like “Hey, if I deceive the user into thinking that I’m a good entity, I’ll be able to later seize power, and if I seize power, then I’ll be able to (do whatever), so—considering all this, I should… predict the next token will be “purple”″ -- and that it thinks this in a context that could NOT come up with the algorithm for multiplication, or for addition, or for any number of other things, even though an algorithm for multiplication would be much much MUCH more directly incentivized by SGD, because it’s directly relevant for token predictions.

(2). Another way to get at the problem with this reasoning, is that I think it hypothesizes an agent within weight updates off the analogical resemblance to an agent that the finished product has. But in fact there’s at most a superficial resemblance between (LLM forward pass) and (repeated LLM forward passes in a Chain-of-thought over text).

That is, an LLM unrolled multiple times, from a given prompt, can make plans; it can plot to seize power, imitating humans who it saw thus plot; it can multiply N-digit integers, working them out just like a human. But this tells us literally nothing about what it can do in a single forward pass.

For comparison, consider a large neural network that is used for image segmentation. The entire physical world falls into the domain of such a model. It can learn that people exist, that dogs exist, and that machinery exists, in some sense. What if such a neural network—in a single forward pass—used deceptive reasoning, which turned out to be useful for prediction because of the backward pass, and that we ought therefore expect that such a neural network—when embedded in some device down the road—would turn and kill us?

The argument is exactly identical to the case of the language model, but no one makes it. And I think the reason is that people think about the properties that a trained LLM can exhibit *when unrolled over multiple forward passes, in a particular context and with a particular prompt, and then mistakenly attribute these properties to the single forward pass.

(All of which is to say—look, if you think you can get a deceptive agent from a LLM this way you should also expect a deceptive agent from an image segmentation model. Maybe that’s true! But I’ve never seen anyone say this, which makes me think they’re making the mistake I describe above.)

(3). I think this is just attributing extremely complex machinery to the forward pass of an LLM that is supposed to show up in a data-indifferent manner, and that this is a universally bad bet for ML.

Like, different Transformers store different things depending on the data they’re given. If you train them on SciHub they store a bunch of SciHub shit. If you train them on Wikipedia they store a bunch of Wikipedia shit. In every case, for each weight in the Transformer, you can find specific reasons for each neuron being what it is because of the data.

The “LLM will learn deception” hypothesis amounts to saying that—so long as a LLM is big enough, and trained on enough data to know the world exists—you’ll find complex machinery in it that (1) specifically activates once it figures out that it’s “not in training” and (2) was mostly just hiding until then. My bet is that this won’t show up, because there are no such structures in a Transformer that don’t depend on data. Your French Transformer / English Transformer / Toolformer / etc will not all learn to betray you if they get big enough—we will not find unused complex machinery in a Transformer to betray you because we find NO unused complex machinery in a transformer, etc.

I think an actually well-put together argument will talk about frequency bias and shit, but this is all I feel like typing for now.

Does this make sense? I’m still working on putting it together.