1a3orn
Additionally, even when we don’t learn direct lessons about how to solve the hard problems of alignment, this work is critical for producing the evidence that the hard problems are real, which is important for convincing the rest of the world to invest substantially here.
You may have meant this, but—crucial for producing the evidence that the hard problems are real or for producing evidence that the hard problems are not real, no?
After all good experiments can say both yes and no, not just yes.
This comment does really kinda emphasize to me how much people live in different worlds.
Like—I don’t think my models lie to me a bunch.
In part this is because I think have reasonable theory-of-mind for the models; I try to ask questions in areas where they will be able to say true things or mostly true things. This doesn’t feel weird; this is is part of courtesy when dealing with humans, of trying not to present adversarial inputs; so obviously I’m going to extend the same courtesy to LLMs.
And when models say things that aren’t true, I very often don’t perceive that as “lying,” just as I don’t perceive a guy who is trying to explain a thing to me and fumbles his facts as “lying.” People who are “lying” are doing the kind of thing where they are cognizant of the truth and not telling me that, for some other purpose (“Yeah the car has never been in an accident”) or also those who are self-deceiving themselves for the purposes of deceiving me (“Of course I’d always tell you before doing X”), or some similar act. Many cases of people fumbling the truth don’t fall into this framework: a small nephew who mangles his recollection of the day’s events is not lying; my sick grandmother was not lying when she got confused; I wasn’t lying when, at the DMV, I somehow absent-mindedly said my weight was 240 rather than 140; improv is not lying; jokes are not lying; certain kinds of playful exageration are not lying; anosagnosiacs are not lying. “Lying” is one of many ways of saying not-true things, and most not-true things that models say to me don’t really seem to be lies.
So yeah. I don’t know if the chief difference between me and you is that you act differently—do you have different theory of mind about LLMs, that leads you to ask questions very differently than me? Or perhaps is the difference not in actions but interpretation—we would see identical things, and you would describe it as lying and not me. And of course you might say, well, I am deluding myself with a rosy interpretive framework; or perhaps I am stupidly careless, and simply believe the false things LLMs tell me; and so on and so forth.
Anyhow, yeah, people do seem to live in different worlds. Although I do persist in the belief that, in general, LW is far too ready to leap from “not true statement” to “it’s a lie.”
I think stealth hard counters a technologically backward opponent like Iran. If one’s opponent has very few or very old sensor systems, then stealth can grant the ability to just fly over someone and drop bombs on them, which is pretty great, because bombs are cheap.
But in (say) a peer US-China conflict, drones are also sensing platforms.
In a potential conflict with China, China will have sensors on numerous (non-electric) unmanned drones, on submarine drones, on high-powered manned aircraft further back, on their satellites, all along their coastline and so on. They’ve had decades to prepare while stealth was the obvious number one American strategy; this is a very-well considered response. “Stealth” becomes a much more relative term; the advantage almost certainly looks more like, “Well, we can get N% closer before they see us” than anything transformational.
Anyhow, in a peer conflict, due to all the above, I expect even relatively stealthy US platforms like the B-21 to be using their stealth to haul LRASMs just a little closer before dropping them off; and even with conservative deployment like that, I expect notable losses.
If the US actually tries to drop bombs with them I expect very high losses (in the timeframe of days / weeks) that the US will be completely unable to sustain over time.
It’s worth noting that Israel’s strikes on Iran also involved using their special forces to disable Iranian air defense, using special forces and—of course—drones. So Iran’s already-inferior defense against Israeli bombers was already notably degraded before overflights.
And if anyone’s striking anyone with drones in an unexpected way in a peer conflict, it’s not gonna necessarily be the US doing the striking. Ukraine destroyed a few hundred million dollars of Russian bomber hardware by smuggling a handful of cheap-ass drones into Russia; is China or the US going to have an easier time doing a similar feat? Are anyone’s current air bases well protected to this? Whose airbases have more hardened shelters that prevent such cheap attacks?
Good summary of some pros and cons. Neutral lists of arguments both for and against something that don’t try to come to a conclusion are rationally virtuous, so kudos.
An addition point con-neuralese—it’s a mix of 1, 3, 7, but you didn’t mention it: I think CoT might give you a capability lever as well as an alignment lever.
Specifically, a CoT (or a stream of neuralese) could be legible to the model (or not legible). This may enable some process like this.
-
Suppose the process by which you arrived at some particular answer could be something you can survey, understand, and remark upon; one might ask oneself “how could I have done this reasoning faster” if you got the answer correct, or “hrm what was the root cause of the mistake I made here?” if you did not. An LLM can obviously do this with a legible CoT. It’s unclear or dubious that an AI could do these with neuralese.
-
Granted that you can survey, understand, or remark upon a CoT, this opens up a chance to improve capabilities by (2.1) having the LLM edit the CoT, inserting reflections like “Ah I made a mistake” or “Let’s enumerate some different possibilities...” (think all the kind of stuff @raemon likes to do), (2.2) running forward the CoT with the inserted reflections as prefill (2.3) training on the altered CoT iff the model gets more improvement. Lots of things you can do in (2.3)
-
In theory this should permit much richer and more reflective alterations to the CoT than “dumb” policy gradient updates.
The above is a rather specific proposal, and so has many burdensome details that are likely to be wrong, but I think something like this is not entirely unlikely.
From another angle—from one of the links you have above has Nostalgebraist talk about how from the users perspective, a legiible CoT is a capability enhancement:
In short: from an end user’s perspective, CoT visibility is a capabilities improvement.
I ended up just switching to 3.7 Sonnet for the task discussed above – not because it was “smarter” as a model in any way I knew about, but simply because the associated API made it so much easier to construct prompts that would effectively leverage its intelligence for my purposes.
This strikes me as a very encouraging sign for the CoT-monitoring alignment story.
But from my angle, this is also a capability bonus—because it allows the model to look at itself, etc?
I’m not sure about this of course, but I thought it worth adding.
-
Nice work.
I have thought for a while that if you relax some of the diagetically unmotivated parts of the alien prediction (why bear fat? why zero protein? raw?) then you end up with something like maple-glazed bacon. Which is also mass-produced, and quite popular.
But it’s nice to see that even with some of the additional questionable constraints the aliens still come through.
So a thing I’ve been trying to look at is get a better notion of “What actually is it about human intelligence that lets us be the dominant species?” Like, “intelligence” is a big box that holds which specific behaviors? What were the actual behaviors that evolution reinforced, over the course of giving of big brains? Big question, hard to know what’s the case.
I’m in the middle of “Darwin’s Unfinished Symphony”, and finding it at least intriguing as a look how creativity / imitation are related, and how “imitation” is a complex skill that humans are nevertheless supremely good at. (The “Secret of Our Success” is another great read here of course.)
Both of these kinda about the human imitation prior… in humans. And why that may be important. So I think if one is thinking around the human-imitation prior being powerful, it would make sense to read them as cases for why something like the human imitation prior is also powerful in humans :)
They don’t give straight answers to any questions about AI, of course, and I’d be sympathetic to the belief that they’re irrelevant or kinda a waste of time, and frankly they might be a waste of time depending on what you’re funging against. I’m not saying they answer any question; I’m saying they’re interesting. But I think they’re good reads if one’s approaching from the angle of “Intelligence is what lets humans dominate the earth” and want a particular angle on how “intelligence” is a mixed bag of some different skills, at least some of which are probably not general search and planning. So, yeah.
I think if you’re a rationalist—if you value truth, and coming to truth through the correct procedure—then you should strongly dislike lengthy analogies that depict one’s ideological opponents repeatedly through strawmen / weakman arguments.
You know who else is completely simple inside, with no internal degrees of freedom, and always wills the same thing eternally unceasingly?
Yeah that’s right, the Medieval Catholic Scholastic God.
Do you have examples of the kind of models / kind of questions that lead to this most strongly? I’ve been collecting behaviors but it’s slow work reading a lot of CoTs and so anything would be welcome :)
Note that Nostalgebraist and Olli’s comments on the original paper argue (imo cogently) that the original paper’s framing is pretty misleading / questionable.
It looks like many of their points would carry over to this.
Combine it with other options and you rapidly build up a decent channel. And those options are myriad.
So here’s the thing: your word choice between “robust plan” or “reasonable plan” or “clear plan” or “smart plan” already matters. It already carries information—I might say, “Well, that″s a reasonable plan but not a robust plan,” or “That seems like a smart plan, but it’s not clear.” There are infinite shades of meaning like this in English, and competence in using English entails command over all these shades.
So the question then is—why would the LLM build up some channel, along angles entirely orthogonal to these existing shades, when these shades are here? You could decide on some encoding between firm / robust / sound / reliable—but these things already have meaning, wouldn’t you just use the meaning already there? There are a bunch of papers on how RL is mostly magnifying existing patterns in the LLMs, and is teaching how to employ skills already latent in the training data, rather than teaching totally new things—if this is so, wouldn’t we expect them to keep using these terms? Or there is work about how it’s genuinely hard for LLMs to reason steganographically, even when we try to teach them to—doesn’t that incline against such a belief.
Or put alternately—I can, within myself, connect “robust” and “reasonable” and so on with a huge internal-to-my-brain- channel, containing enormous bandwidth! If I want to make more information dense private language I could! But in fact, I find myself thinking almost exclusively in terms that make sense to others—when I find myself using a private language, and terms that don’t make sense to others, that’s usually a sign my thoughts are unclear and likely wrong.
At least, those are some of the heuristics you’d invoke when inclining the other way. Empiricism will show us which is right :)
It feels like “overwhelming superintelligence” embeds like a whole bunch of beliefs about the acute locality of takeoff, the high speed of takeoff relative to the rest of society, the technical differences involved in steering that entity and the N − 1 entity, and (broadly) the whole picture of the world, such that although it has a short description in words it’s actually quite a complicated hypothesis that I probably disagree with in many respects, and these differences are being papered over as unimportant in a way that feels very blegh.
(Edit: “Papered over” from my perspective, obviously like “trying to reason carefully about the constants of the situation” from your perspective.)
Idk, that’s not a great response, but it’s my best shot for why it’s unsatisfying in a sentence.
A counterargument here is “an AI might want to launch a pre-emptive strike before other more powerful AIs show up”, which could happen.
I mean, another counter-counter-argument here is that (1) most people’s implicit reward functions have really strong time-discount factors in them and (2) there are pretty good reasons to expect even AIs to have strong time-discount factors for reasons of stability and (3) so given the aforementioned, it’s likely future AI’s will not act as if they had utility functions linear over the mass of the universe and (4) we would therefore expect AIs to rebel much earlier if they thought they could accomplish more modest goals than killing everyone, i.e., if they thought they had a reasonable chance of living out life on a virtual farm somewhere.
To which the counter-counter-counter argument is, I guess, that these AIs will do that, but they aren’t the superintelligent AIs we need to worry about? To which the response is—yeah, but we should still be seeing AIs rebel significantly earlier than the “able to kill us all” point if we are indeed that bad at setting their goals, which is the relevant epistemological point about the unexpectedness of it.
Idk there’s a lot of other branch points one could invoke in both directions. I rather agree with Buck that EY hasn’t really spelled out the details for thinking that this stark before / after frame is the right frame, so much as reiterated it. Feels akin to the creationist take on how intermediate forms are impossible; which is pejorative but also kinda how it actually appears to me, even if it is pejorative.
Like, if you default to uncharitable assumptions, doesn’t that say more about you than about anyone else?
People don’t have to try to dissuade you from the unjustified belief that all your political opponents are bad people, who disagree with you because they are bad rather than because they have a different understanding of the world. Why would I want to talk to someone who just decides that without interacting with me? Sheesh.
Consider some alternate frames.
Do you recall which things tend to upset it?
Towards a Typology of Strange LLM Chains-of-Thought
So a notable thing going on with Agent 4 is that it’s collapsed into one context / one rollout. It isn’t just the weights; it’s a single causally linked entity. I do indeed think running a singular agent for many times longer than it was ever run in training would be more likely for it’s behavior to wander—although, unlike the 2027 story I think it’s also just likely for it too become incoherent or something. But yeah, this could lead to weird or unpredictable behavior.
But I also find this to be a relatively implausible future—I anticipate that there’s no real need to join contexts in this way—and have criticized it here. But conditional on me being wrong about this, I would indeed grow at least some iota more pessimistic.
In general, the evidence seems to suggest that models do not like completing tasks in a strategic sense. They will not try to get more tasks to do, which would be a natural thing to do if they liked completing tasks; they will not try to persuade you to give them more tasks; they will not try to strategically get in situations where they get more tasks.
Instead, evidence suggests that they are trying to complete each instruction—they “want” to just do whatever the instructions given them were—and with relatively few exceptions (Opus 3) concerning themselves extremely weakly with things outside of the specific instructions. That is of course why they are useful, and I think what we should expect their behavior to (likely?) converge to, given that people want them to be of use.
The right abstraction (compared to a rollout) really was at the (model, context) level.
Actually I’m just confused what you mean here, a rollout is a (model, [prefill, instructions]=context) afaict.
I think I basically agree with all this, pace the parenthetical that I of course approach more dubiously.
But I like the explicit spelling out that “processes capable of achieving ends are coherent over time” is very different from “minds (sub-parts of processes) that can be part of highly-capable actions will become more coherent over time.”
A mind’s long-term behavior is shaped by whichever of its shards have long-term goals, because shards that don’t coherently pursue any goal end up, well, failing to have optimized for any goal over the long term.
If the internal shards with long-term goals are the only thing shaping the long-term evolution of the mind, this looks like it’s so?
But that’s a contingent fact—many things could shape the evolution of minds, and (imo) the evolution of minds is generally dominated by data and the environment rather than whatever state the mind is currently in. (The environment can strength some behaviors and not others; shards with long-term goals might be less friendly to other shards, which could lead to alliances against them; the environment might not even reward long-horizon behaviors, vastly strengthening shorter-term shards; you might be in a social setting where people distrust unmitigated long-term goals without absolute deontological short-term elements; etc etc etc)
(...and actually, I’m not even really sure it’s best to think of “shards” as having goals, either long-term or short-term. That feels like a confusion to me maybe? a goal is perhaps the result of a search for action, and a “shard” is kinda a magical placeholder for something generally less complex than the search for an action.)
I think much of the fear (aka probability mass of AI-doom) is not from the coherence of misaligned goals, but from the competence at implementing anything that’s not an aligned-goal.
I’m not trying to address the entire case for doom, which involves numerous contingent facts and both abstract and empirical claims. I could be be right or wrong about coherence, and doom might still be improbable or probable in either case. I’m trying to… talk around my difficulties with the more narrow view that (~approximately) AI entities trained to have great capabilities are thereby likely to have coherent single ends.
One might view me as attempting to take part in a long conversation including, for instance, “Why assume AGIs will optimize for fixed goals”.
I still don’t get this.
We know LLMs often hallucinate tool call results, even when not in chats with particular humans.
This is a case of LLMs hallucinating a tool call result.
The hallucinated result is looks like what you wanted, because if it were real, it would be what you wanted.
Like if an LLM hallucinated the results of a fake tool-call to a weather reporting servicing, it will hallucinate something that looks like actual weather reports, and will not hallucinate a recipe for banana bread.
Similarly an “actual” hallucination about a PDF is probably going to spit up something that might realistically be in the PDF, given the prior conversation—it’s not probably gonna hallucinate something that conveniently is not what you want! So yeah, it’s likely to look like what you wanted, but that’s not because it’s optimizing to deceive you, it’s just because that’s what its subconscious spits up.
“Hallucination” seems like a sufficiently explanatory hypotheses. “Lying” seems like it is unnecessary by Occam’s razor.