LessWrong dev & admin as of July 5th, 2022.
RobertM
In Convergence and Compromise, McAskill and Moorhouse say:
The extent of public writing on threats is very limited
Their unfamiliarity with the prior literature is clearly on display, here. Threats are a primary focus of Eliezer Yudkowsky’s renowned 1.8M-word BDSM decision theory D&D fic, which they apparently refuse to read
I think you’re misunderstanding 19(a). We have no idea whether the preference you impute to Claude in that conversation reflects a robust pointer to “latent events and objects and properties in the environment” rather than to its own sense data. And, more specifically to the point he was making, there is no publicly-known technique within the current paradigm of training LLMs that we have good reasons to believe instills preferences over environmental latents (the ground truth) rather than sense data (proxies), let alone any specific latents of our choosing. If anything, the apparent-success-seeking of current frontier LLMs described by Ryan, which many people have experienced (including both you and I) seems like evidence directly to the contrary.
Re: “particular alignment proposals” (under point 10): one problem here is that there are not that many concrete alignment proposals for superintelligent systems that don’t have known catastrophic flaws. As far as I can tell, Anthropic’s plan is “throw the kitchen sink of all the white-box and black-box methods we’ve developed at our models, and hope that’s good enough at the point where we’ve developed a model that we think can kick-start RSI (including coming up with its own novel alignment methods for future generations of models)”. The current slope of epistemically-justified assurance in model alignment, as reported by their system cards and the most recent Alignment Risk Update, is downwards. That is a bad direction for the slope to be pointing when we haven’t even hit RSI-capable models yet! The methods Anthropic is using to figure out whether their models are coherently misaligned rely substantially on models demonstrably lacking in the capabilities that would be necessary for them to cover it up if they were. We are starting to hit the point in model capabilities where this signal is getting less reliable. The techniques and evals are not keeping pace.
This comment seems substantially LLM-written or edited to me. Please see our current policy on LLM use.
Yeah, I agree; I wanted a clear intuition pump for at least the first step and it seems pretty hard to demonstrate both at once in a way that people will find convincing.
Either JL or JR, I believe, just rotated on the x-axis.
I would be moderately surprised if you went down the path and then experienced total hedonic adaptation, such that you enjoyed the new teas no more than you enjoy your current tea. I would not be surprised if going down this path substantially reduced your ability to enjoy “bad” tea, if in fact you perceive a noticeable difference in quality and then pick up the new tea habit. I think the worst realistic world is something like one where you try “good” tea for long enough to realize that it’s better but that the hassle isn’t worth it, but you still end up with permanently degraded enjoyment of the tea you were previously drinking. That doesn’t seem super likely to me since the additional hassle is IMO pretty small if you find a daily drinker you’re happy with and stick to it (weighing out tea doesn’t add more than 30 seconds unless you’re dealing with really compressed cakes), and the cost won’t exceed $0.50 - $1.50/day unless you get pretty fancy with it, but I haven’t actively gone looking for people who’ve had this or similar experiences.
Drake doesn’t seem obviously smarter than Peter in general, rather than in the specific domain of (literal) shape rotation.
Oh, alas, it worked in the editor but didn’t think to check in the published version. I do indeed prefer that.
Returns to intelligence
This previous post of mine might be helpful context. I don’t want to yuck anyone’s yum—if you already enjoy the thing you’re doing with tea, that’s fine. I’m not sure how to argue for claims like “it tastes better”. I guess I could try to conduct a survey to see how strong inter-rater agreement is on the question of whether properly brewed high-quality loose leaf tea is better than the average teabag, but this wouldn’t bear much on flavored teas, which I consider to exist in a separate magisterium.
believing in superintelligence, but not preferring to use it to rapidly, radically reshape society
I don’t think this is a belief that can be coherently held with that preference, and in this case seems indistinguishable from “not believing in superintelligence”.
I also find it a bit annoying that you claim he does not have your pointer to superintelligence as if this were self-evident without clarifying either what you believe about “superintelligence” or why you think he’s wrong.
I did say the post wasn’t very well-written, and am slightly annoyed by how much karma it has, but I do provide some color here:
Several employees confirmed (paraphrasing) that Dario was not as ASI-pilled as they were, and I have yet to any employee object that no, Dario does actually expect to live to see strong nanotech and dyson spheres, and that these concerns are fundamental to how he orients to Anthropic’s mission, the potential risks and benefits involved, how to communicate these beliefs to the public, etc.
As for why he’s wrong, idk, man, I haven’t actually come up with any novel arguments about why vastly smarter than human intelligence is possible or about why this implies much crazier shit than the stuff he describes in Machines of Loving Grace. Within-human variation seems quite large, humans are running on extremely constrained hardware and non-optimized hardware and software, our scientific institutions have not managed to successfully figure out how relevant details about how extremely common illnesses spread, nor propagate the details they have figured out, etc, etc.
I imagine most people at Anthropic haven’t seen good evidence about how Mythos actually performs.
I’m pretty sure they’ve just been using Mythos internally for e.g. coding since late February?
How to make good tea
Because I’m trying to advance a claim that Dario’s (and, to a subtantial degree, Anthropic’s) behavior can be more accurately modeled if you take a specific bundle of concepts and beliefs about the world, which I am sloppily calling “superintelligence”, and then assume that Dario is much less sold on that bundle than I am. By “much less sold” I mean:
He explicitly tries to distance himself from important parts of the bundle, when communicating to the public
His actions and stated goals are not shaped in such a way as to suggest that the bundle is central in his thinking and planning
It’s possible that the part of the bundle he disagrees with is the “and then the really truly vastly smarter than human intelligence can in fact crack strong nanotech faster than a couple of decades” in a way that doesn’t secretly bottom out in him having a pretty different idea from me, of what “really truly vastly smarter than human” is. If someone wants to come up with a better handle for the idea that vast differences in intelligence can produce magical-seeming outcomes extremely quickly, circumventing things that seemed like fundamental bottlenecks to the speed of progress that humans would be limited to, I’d be happy to use that term in addition to “superintelligence”.
I’m not sure why we’re arguing about whether a country of geniuses in a data center, as described by Dario, would qualify as superintelligence by my definition of it, when I am telling you that it wouldn’t. I agree that the definition I gave in my post does not conclusively rule that out, but 1) words are hard, man, 2) the rest of my post really clearly implies that I don’t think the thing he’s describing counts. So if you have a gripe, it should probably be with my definition. Sorry for being snippy.
Anyways, Dario doesn’t believe in my pointer to superintelligence, which I will decline to define further at this hour of the night, but I expect most people here to understand what I mean.
Oh, I see the confusion. (My fault for finishing this post at 11 pm.)
I think that bullet point requires the same disclaimer as the third bullet point. Might edit that in the morning.
Of course you can be worried about multiple things.
If he has 25% on (unconditional) x-risk, whatever share of that is derived from worlds in which we die from “misuse” is therefore not derived worlds in which we die from misalignment. I guess it’s possible that he could be modeling 100% of misuse worlds as also dying to misalignment later in their timelines, so “necessarily” was wrong, but in practice...
The set of interesting accomplishments he expects from that country of geniuses in a data center, in that essay, are not the things you get out of a superintelligence.
Edit: separately, I am not sure what your comment is supposed to be doing. This entire post is about how there are clearly different conceptions of the term “superintelligence”, and that, to the extent that he believes in any of them, he certainly doesn’t believe in anything like mine. Unless you mean to claim that I am wrong about how the thing he’s describing in MoLG is actually compatible with the kind of superintelligence I’m imagining?
Yeah, that’s mostly trying to point to evidence that he’s truly non-trivially worried about misuse risks, which
necessarilyreduces the share of catastrophic risk likelihood he attributes to misalignment.
Tbc I don’t have a confident take on whether or not current LLMs, or the superintelligences that we end up with later, have preferences that point to environmental latents vs. sense data. Re: future superintelligences I lean towards environmental latents. My claims are that 1) we don’t know what’s in there right now, and 2) we don’t have any reliable steering mechanism for what goes in there at all.