https://ninapanickssery.com/
Views purely my own unless clearly stated otherwise
https://ninapanickssery.com/
Views purely my own unless clearly stated otherwise
I’m asking if you’re up for at least being willing to entertain the structure of “maybe, Ray will be right that there is a large-but-finite set of claims, and it’s possible to get enough certainty on each claim to at least put pretty significant bounds on how unaligned AI may play out
Certainly, I could be wrong! I don’t mean to:
Dismiss the possibility of misaligned AI related X-risk
Dismiss the possibility that your particular lines of argument make sense and I’m missing some things
And I think caution with AI development is warranted for a number of reasons beyond pure misalignment risk.
But it’s a little worrying when a community widely shares a strong belief in doom while implying that the required arguments are esoteric and require lots of subtle claims, each of which might have counterarguments, but which overall will eventually convince you. 1a3orn has a good essay about this: https://1a3orn.com/sub/essays-ai-doom-invincible.html.
I think having intuitions around general intelligences being dangerous is perfectly reasonable; I have them too. As a very risk-averse and pro-humanity person, I’d almost be tempted to press a button to peacefully prevent AI advancement purely on the basis of a tiny potential risk (for I think everyone dying is very, very, very bad, I am not disagreeing with that point at all). But no such button exists, and attempts to stop AI development have their own side-effects that could add up to more risks on net. And though that’s unfortunate, it doesn’t mean that we should spread a message of “we are definitely doomed unless we stop”. A large number of people believing they are doomed is not a free way to increase the chances of an AI slowdown or pause. It has a lot of negative side-effects. Many smart and caring people I know have put their lives on pause and made serious (in my opinion, bad) decisions on the basis that superintelligence will probably kill us, or if not there’ll be a guaranteed utopia. To be clear, I am not saying that we should believe or spread false things about AI risk being lower than it actually is so that people’s personal lives temporarily improve. But rather I am saying that exaggerating claims of doom or making arguments sound more certain than they are for consequentialist purposes is not free.
I agree things don’t automatically happen eventually just because they can. At least, not automatically on relevant timescales. (i.e. eventually infinite monkeys mashing keyboards will produce shakespeare, but, not for bazillions of years)
Not important to your general point, but here I guess you run into some issues with the definition of “can”. You could argue that if something doesn’t happen it means it couldn’t have happened (if the universe is deterministic). And so then yes, everything that can happen, actually happens. But that isn’t the sense in which people normally use the word “can”. Instead it’s reasonable to say “it’s possible my son’s first word is ‘Mama’”, “it’s possible my son’s first word is ‘Papa’”, both of these things can happen (i.e. they are not prohibited by any natural laws that we know of). But only one of these things can be true; in many situations we’d say that two mutually incompatible events “can happen”. And therefore it’s not just a matter of timescale.
The argument is:
If something can happen
and there’s a fairly strong reason to expect some process to steer towards that thing happening
and there’s not a reason to expect some other processes to steer towards that thing not happening
...then the thing probably happens eventually, on a somewhat reasonable timescale, all else equal.
Sure, I agree with that. I think this makes superintelligence much more likely than it otherwise would be (because it’s not prohibited by any laws of physics that we know of, and people are trying to build it, and no-one is effectively preventing it from being built). But the same argument doesn’t apply to misaligned superintelligence or other doom-related claims. In fact, the opposite is true.
Superintelligence not killing everyone is not prohibited by the laws of physics
People are trying to ensure superintelligence doesn’t kill everyone
No-one is trying to make superintelligence kill everyone
So you could apply a similarly-shaped argument to “prove” that aligned superintelligence is coming on a “somewhat reasonable timescale”.
Better prompt engineering, fine-tuning, interpretability, scaffolding, sampling.
I meant examples of concrete tasks that current models fail at as-is but you think could be elicited, e.g. with some general scaffolding.
I think you are may be failing to make an argument along the line of “But people are already working on this! Markets are efficient!”
Not quite. Though I am not completely confident, my claim comes from the experience of watching models fail at a bunch of tasks for reasons that seem raw-intelligence related rather than scaffolding or elicitation related. For example, ime current models struggle to answer questions about codebases with nontrivial logic they haven’t seen before or fix difficult bugs.
When you use current models you run into examples where you feel how dumb the model is, how shallow its thinking is, how much it’s relying on heuristics. Scaffolding etc. can only help so much.
Also elicitation techniques for tasks are often not general (e.g. coming up with the perfect prompt with detailed instructions), requiring human labor and intelligence to craft task-specific elicitation methods. This additional effort and information takes away from how much the AI is doing.
I disagree, I don’t think there’s a substantial elicitation overhang with current models. What is an example of something useful you think could in theory be done with current models but isn’t being elicited in favor of training larger models? (Spending an enormous amount of inference-time compute doesn’t count as that’s super inefficient)
Anthropic is trying pretty hard with Claude to build something that’s robustly aligned, and it’s just quite hard. When o3 or claude cheat on programming tasks, they get caught, and the consequences aren’t that dire. But when there are millions of iterations of AI-instances making choices, and when it is smarter than humanity, the amount of robustness you need is much much higher.
I agree with this. If the risk being discussed was “AI will be really capable but sometimes it’ll make mistakes when doing high-stakes tasks because it misgeneralized our objectives” I would wholeheartedly agree. But I think the risks here can be mitigated with “prosaic” scalable oversight/control approaches. And of course it’s not a solved problem. But that doesn’t mean that the current status quo is the AI misgeneralizing so badly that it doesn’t just reward hack on coding unit tests but also goes off and kills everyone. Claude, in its current state, isn’t not killing everyone just because it isn’t smart enough.
The answer is “because if it wasn’t, that wouldn’t be the limit, and the AI would notice it was behaving suboptimally, and figure out a way to change that.”
Not every AI will do that, automatically. But, if you’re deliberately pushing the AI to be a good problem solver, and if it ends up in a position where it is capable of improving it’s cognition, once it notices ‘improve my cognition’ as a viable option, there’s not a reason for it to stop.
Why are you equivocating between “improve my cognition”,”behave more optimally” and “resolve different drives into a single coherent goal (presumably one that is non-trivial, i.e. some target future world state)”. If “optimal” is synonymous with utility-maximizing, then the fact that utility-maximizers have coherent preferences is trivial. You can fit preferences and utility functions to basically anything.
Also, why do you think that insofar as a coherent, non-trivial goal emerges, it is likely to eventually result in humanity’s destruction. I find the arguments unconvincing here also; you can’t just appeal to some unjustified prior over the “space of goals” (whatever that means). Empirically, the opposite seems to be true. Though you can point to OOD misgeneralization cases like unit test reward hacking, in general LLMs are both very general and aligned enough to mostly want to do helpful and harmless stuff.
It sounds like a lot of your objection is maybe to the general argument “things that can happen, eventually will.” (in particular, when billions of dollars worth of investment are trying to push towards things-nearby-that-attractor happening).
Yes, I object to the “things that can happen, eventually will” line of reasoning. It proves way too much, including contradictory facts. You need to argue why one thing is more likely than another.
if that’s what you meant by Defense in Depth, as Joe said, the book’s argument is “we don’t know how
We will never “know how” if your standard is “provide an exact proof that the AI will never do anything bad”. We do know how to make AIs mostly do what we want, and this ability will likely improve with more research. Techniques in our toolbox include pretraining on human-written text (which elicits roughly correct concepts), instruction-following finetuning, RLHF, model-based oversight/RLAIF.
Overfitting networks are free to implement a very simple function— like the identity function or a constant function— outside the training set, whereas generalizing networks have to exhibit complex behaviors on unseen inputs. Therefore overfitting is simpler than generalizing, and it will be preferred by SGD.
You’re conflating the simplicity of a function in terms of how many parameters are required to specify it, in the abstract, and simplicity in terms of how many neural network parameters are fixed.
The actual question should be something like “how precisely does the neural network have to be specified in order for it to maintain low loss”.
An overfitted network is usually not simple because more of the parameter space is constrained by having to exactly fit the training data. There are fewer free remaining parameters. Whether or not those free remaining parameters implement a “simple” function or not is besides the point; if the loss is invariant to them, they don’t count as “part of the program” anyway. And a non-overfitted network will have more free dimensions like this, to which the loss is (near-)invariant.
Is there a good definition of non-myopic in spacetime?
Coherence is not about whether a system “can be well-modeled as a utility maximizer” for some utility function over anything at all, it’s about whether a system can be well-modeled as a utility maximizer for utility over some specific stuff.
The utility in the toy coherence theorem in this post is very explicitly over final states, and the theorem says nontrivial things mainly when the agent is making decisions at earlier times in order to influence that final state—i.e. the agent is optimizing the state “far away” (in time) from its current decision. That’s the prototypical picture in my head when I think of coherence. Insofar as an incoherent system can be well-modeled as a utility maximizer, its optimization efforts must be dominated by relatively short-range, myopic objectives. Coherence arguments kick in when optimization for long-range objectives dominates.
My understanding based on this is that your definition of “reasonable” as per my post is “non-myopic” or “concerned with some future world state”?
This is an argument for why AIs will be good at circumventing safeguards. I agree future AIs will be good at circumventing safeguards.
By “defense-in-depth” I don’t (mainly) mean stuff like “making the weights very hard to exfiltrate” and “monitor the AI using another AI” (though these things are also good to do). By “defense-in-depth” I mean at every step, make decisions and design choices that increase the likelihood of the model “wanting” (in the book sense) to not harm (or kill) humans (or to circumvent our safeguards).
My understanding is that Y&S think this is doomed because ~”at the limit of <poorly defined, handwavy stuff> the model will end up killing us [probably as a side-effect] anyway” but I don’t see any reason to believe this. Perhaps it stems from some sort of map-territory confusion. An AI having and optimizing various real-world preferences is a good map for predicting its behavior in many cases. And then you can draw conclusions about what a perfect agent with those preferences would do. But there’s no reason to believe your map always applies.
I don’t “get off the train” at any particular point, I just don’t see why any of these steps are particularly likely to occur. I agree they could occur, but I think a reasonable defense-in-depth approach could reduce the likelihood of each step enough that likelihood of the final outcome is extremely low.
I don’t think anyone said “coherent”
It sounds like your argument is the AI will start with with ‘psuedo-goals’ that conflict and will be eventually be driven to resolve them into a single goal so that it doesn’t ‘dutch-book itself’ i.e. lose resources because of conflicting preferences. So it does rely on some kind of coherence argument, or am I misunderstanding?
Think clearly about the current AI training approach trajectory
If you start by discussing what you expect to be the outcome of pretraining + light RLHF then you’re not talking about AGI or superintelligence or even the current frontier of how AI models are trained. Powerful, general AI requires serious RL on a diverse range of realistic environments, and the era of this has just begun. Many startups are working on building increasingly complex, diverse, and realistic training environments.
It’s kind of funny that so much LessWrong arguing has been around why a base model might start trying to take over the world. When that’s beyond the point. Of course we will eventually start RL’ing models on hard, real-world goals.
Example post / comment to illustrate what I mean.
Goal Directedness is pernicious. Corrigibility is anti-natural.
The way an AI would develop the ability to think extended, useful creative research thoughts that you might fully outsource to, is via becoming perniciously goal directed. You can’t do months or years of openended research without fractally noticing subproblems, figuring out new goals, and relentless finding new approaches to tackle them.
The fact that being very capable generally involves being good at pursuing various goals does not imply that a super-duper capable system will necessarily have its own coherent unified real-world goal that it relentlessly pursues. Every attempt to justify this seems to me like handwaving at unrigorous arguments or making enough assumptions that the point is near-circular.
Fair, you’re right, I didn’t realize or forgot that the evolution analogy was previously used in the way it is in your pasted quote.
I don’t see a big difference between
optimization processes do not lead to niceness in general
and
objective functions do not constrain generalization behavior enough
Okay but they’re not actually using those things as evidence for their claims about generalization in the limit
Of course, because those things themselves are the claims about generalization in the limit that require justification
which is explained through evolutionary metaphors
Evolutionary metaphors don’t constitute an argument, and also don’t reflect the authors’ tendency to update, seeing as they’ve been using evolutionary metaphors since the beginning
This review really misses the mark I think.
The word “paperclip” does not appear anywhere in the book
The word “mesaoptimizer” does not appear anywhere in the book
Sure, but the same arguments are being made in different words. I agree that avoiding rationalist jargon makes it a better read for laypeople, but it doesn’t change the validity of the argument or the extent to which it reflects newer evidence. The book is about a deceptive mesaoptimizer that relentlessly steers the world towards a target as meaningless to us as paperclips, at its core.
In general the book moves somewhat away from abstraction and comments more on the empirical strangeness of AI
The way in which it comments on the “empirical strangeness of AI” is very biased. For instance, it fails to mention the many ways in which today’s rather general AIs don’t engage in weird, maximizing behavior or pursue unpredictable goals. Instead it mentions a few cases where AI systems did things we didn’t expect, like glitch tokens, which is incredibly weak empirical evidence for their claims.
I suggest you skip discussing my review-of-a-review; it was published before the book came out after all.
Instead you could take a look at the review I wrote after the book was published or this more general post on the topic.
Could HGH supplementation in children improve IQ?
I think there’s some weak evidence that yes. In some studies where they give HGH for other reasons (a variety of developmental disorders, as well as cases when the child is unusually small or short), an IQ increase or other improved cognitive outcomes are observed. The fact that this occurs in a wide variety of situations indicates that it could be a general effect that could apply to healthy children.
Examples of studies (caveat: produced with the help of ChatGPT, I’m including null results also). Left column bolded when there’s a clear cognitive outcome improvement.
Treatment group | Observed cognitive / IQ effects of HGH | Study link |
---|---|---|
Children with isolated growth hormone deficiency; repeated head circumference and IQ testing during therapy | IQ increased in parallel with head-size catch-up (small case series, N=4). Exact IQ‐point gains not reported in the abstract. | Effect of hGH on head circumference and IQ in isolated growth hormone deficiency |
Short-stature children (growth hormone deficiency and idiopathic short stature), ages 5–16, followed 3 years during therapy | IQ and achievement scores: no change over 3 years (≈0 IQ-point mean change reported); behavior improved (e.g., total problems ↓, P<.001 in growth hormone deficiency; attention/social/thought problems each P=.001). | Behavior change after growth hormone treatment of children with short stature |
Children born small for gestational age, long-term randomized dose-response cohort (≈8 years of therapy) | Total IQ and “performal” IQ increased from below population norms to within normal range by follow-up (p<0.001). Precise IQ-point means not in abstract. | Intelligence and psychosocial functioning during long-term growth hormone therapy in children born small for gestational age |
Children born small for gestational age, randomized, double-blind dose-response trial (1 vs 2 mg/m²/day) | Total IQ and Block-Design (performance) scores increased (p<0.001). Head-size growth correlated positively with all IQ scores; untreated controls did not show head-size increases. Exact IQ-point changes not in abstract. | Effects of growth hormone treatment on cognitive function and head circumference in children born small for gestational age |
Prepubertal short children (mix of growth hormone deficiency and idiopathic short stature), randomized to fixed vs individualized dosing for 24 months | Full-scale IQ increased with a medium effect size (Cohen’s d ≈0.6) after 24 months; processing speed also improved (d ≈0.4). Exact IQ-point means not provided in abstract. | Growth Hormone Treatment Improves Cognitive Function in Short Children with Growth Hormone Deficiency |
Children born small for gestational age, randomized to high-dose growth hormone for 2 years vs no treatment | No cognitive benefit over 2 years: IQ unchanged in the treated group; in the untreated group, mean IQ rose (P<.05), but after excluding children with developmental problems, neither group changed significantly. Behavioral checklist scores: no significant change. | Effect of 2 years of high-dose growth hormone therapy on cognitive and psychosocial development in short children born small for gestational age |
Prepubertal children with Prader–Willi syndrome, randomized controlled trial (2 years) plus 4-year longitudinal follow-up on therapy | Prevents decline seen in untreated controls (vocabulary and similarities declined in controls at 2 years, P=.03–.04). Over 4 years on therapy: abstract reasoning (Similarities) and visuospatial skills (Block Design) increased (P=.01 and P=.03). Total IQ stayed stable on therapy vs decline in controls. | Beneficial Effects of Growth Hormone Treatment on Cognition in Children with Prader-Willi Syndrome: A Randomized Controlled Trial and Longitudinal Study |
Infants and young children with Prader–Willi syndrome (approximately 52-week therapy; earlier vs later start) | Mental development improved after 52 weeks; earlier initiation (<9 months) associated with greater mental-development gains than later start. Exact test scores vary by age tool; abstract does not list points. | Early recombinant human growth hormone treatment improves mental development in infants and young children with Prader–Willi syndrome |
Girls with Turner syndrome in a long-term, double-blind, placebo-controlled height trial (1–7 years of treatment) | No effect on cognitive function; the characteristic nonverbal profile unchanged by therapy. | Absence of growth hormone effects on cognitive function in girls with Turner syndrome |
Young children with Down syndrome (short clinical trial) | No effect on head circumference or mental or gross motor development during the trial period. | Growth hormone treatment in young children with Down’s syndrome |
Down syndrome cohort, ~15-year follow-up after early childhood growth hormone | No advantage in brief IQ scores at long-term follow-up; higher scores in multiple cognitive subtests (e.g., Leiter-R, WISC-III subtests) vs controls; larger adult head circumference in previously treated group. | Late effects of early growth hormone treatment in Down syndrome |
Saw this on mobile and then had to reload page (then it worked)
The mini website with animations is super cool! The LessWrong team’s design chops never fail to impress :)