porby 15 Oct 2022 0:17 UTC
37 points
15
on: Counterarguments to the basic AI x-risk case
Great post! I think this captures a lot of why I’m not ultradoomy (only, er, 45%-ish doomy, at the moment), especially A and B. I think it’s at least possible that our reality is on easymode, where muddling could conceivably put an AI into close enough territory to not trigger an oops.
I’d be even less doomy if I agreed with the counterarguments in C. Unfortunately, I can’t shake the suspicion that superintelligence is the kind of ridiculously powerful lever that would magnify small oopses into the largest possible oopses.
Hypothetically, if we took a clever human’s general capacity for problem solving, stripped it of limitations like getting bored or tired, got rid of its pesky intuitions around ethics, and sped it up by a factor of 1,000 times… I’d be very worried about what it would be able to do. Even without greater capacity for insight or an enhanced working memory, simply thinking really fast would be a broken superpower.
Such an entity might not be able to recreate the technology of modern civilization starting from scratch (both in resources and knowledge) in the stone age within 30 years, primarily due to physical interaction requirements. But starting from anything like modern civilization? That would get weird fast.
In other words, it seems like the intelligence range of humans- or even the range across animals and humans- is small compared to what is artificially possible even if we only consider speed. And it seems very likely at this point that a well-built artificial mind could have higher quality insights, too. MuZero certainly seems to within its domain. I don’t find much comfort in observable intelligence differences not always resulting in domination.

One implementation of regulatory GPU restrictions

porby4 Jun 2023 20:34 UTC

32 points

6 comments5 min readLW link

porby 24 Nov 2023 23:45 UTC
29 points
5
on: Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense
I’m not sure if I fall into the bucket of people you’d consider this to be an answer to. I do think there’s something important in the region of LLMs that, by vibes if not explicit statements of contradiction, seems incompletely propagated in the agent-y discourse even though it fits fully within it. I think I at least have a set of intuitions that overlap heavily with some of the people you are trying to answer.
In case it’s informative, here’s how I’d respond to this:
Well, I claim that these are more-or-less the same fact. It’s no surprise that the AI falls down on various long-horizon tasks and that it doesn’t seem all that well-modeled as having “wants/desires”; these are two sides of the same coin.
Mostly agreed, with the capability-related asterisk.
Because the way to achieve long-horizon targets in a large, unobserved, surprising world that keeps throwing wrenches into one’s plans, is probably to become a robust generalist wrench-remover that keeps stubbornly reorienting towards some particular target no matter what wrench reality throws into its plans.
Agreed in the spirit that I think this was meant, but I’d rephrase this: a robust generalist wrench-remover that keeps stubbornly reorienting towards some particular target will tend to be better at reaching that target than a system that doesn’t.
That’s subtly different from individual systems having convergent internal reasons for taking the same path. This distinction mostly disappears in some contexts, e.g. selection in evolution, but it is meaningful in others.
If an AI causes some particular outcome across a wide array of starting setups and despite a wide variety of obstacles, then I’ll say it “wants” that outcome “in the behaviorist sense”.
I think this frame is reasonable, and I use it.
it’s a little hard to imagine that you don’t contain some reasonably strong optimization that strategically steers the world into particular states.
Agreed.
that the wanting-like behavior required to pursue a particular training target X, does not need to involve the AI wanting X in particular.
Agreed.
“AIs need to be robustly pursuing some targets to perform well on long-horizon tasks”, but it does not say that those targets have to be the ones that the AI was trained on (or asked for). Indeed, I think the actual behaviorist-goal is very unlikely to be the exact goal the programmers intended, rather than (e.g.) a tangled web of correlates.
Agreed for a large subset of architectures. Any training involving the equivalent of extreme optimization for sparse/distant reward in a high dimensional complex context seems to effectively guarantee this outcome.
So, maybe don’t make those generalized wrench-removers just yet, until we do know how to load proper targets in there.
Agreed, don’t make the runaway misaligned optimizer.
I think there remains a disagreement hiding within that last point, though. I think the real update from LLMs is:
1. We have a means of reaching extreme levels of capability without necessarily exhibiting preferences over external world states. You can elicit such preferences, but a random output sequence from the pretrained version of GPT-N (assuming the requisite architectural similarities) has no realistic chance of being a strong optimizer with respect to world states. The model itself remains a strong optimizer, just for something that doesn’t route through the world.
2. It’s remarkably easy to elicit this form of extreme capability to guide itself. This isn’t some incidental detail; it arises from the core process that the model learned to implement.
3. That core process is learned reliably because the training process that yielded it leaves no room for anything else. It’s not a sparse/distant reward target; it is a profoundly constraining and informative target.
In other words, a big part of the update for me was in having a real foothold on loading the full complexity of “proper targets.”
I don’t think what we have so far constitutes a perfect and complete solution, the nice properties could be broken, paradigms could shift and blow up the golden path, it doesn’t rule out doom, and so on, but diving deeply into this has made many convergent-doom paths appear dramatically less likely to Late2023!porby compared to Mid2022!porby.

One path to coherence: conditionalization

porby29 Jun 2023 1:08 UTC

28 points

4 comments4 min readLW link

porby 16 Dec 2023 23:21 UTC
28 points
2
on: Why I think strong general AI is coming soon
It’s been over a year since the original post and 7 months since the openphil revision.
A top level summary:
1. My estimates for timelines are pretty much the same as they were.
2. My P(doom) has gone down overall (to about 30%), and the nature of the doom has shifted (misuse, broadly construed, dominates).
And, while I don’t think this is the most surprising outcome nor the most critical detail, it’s probably worth pointing out some context. From NVIDIA:
In two quarters, from Q1 FY24 to Q3 FY24, datacenter revenues went from $4.28B to $14.51B.
From the post:
In 3 years, if NVIDIA’s production increases another 5x …
Revenue isn’t a perfect proxy for shipped compute, but I think it’s safe to say we’ve entered a period of extreme interest in compute acquisition. “5x” in 3 years seems conservative.^[1] I doubt the B100 is going to slow this curve down, and competitors aren’t idle: AMD’s MI300X is within striking distance, and even Intel’s Gaudi 2 has promising results.
Chip manufacturing remains a bottleneck, but it’s a bottleneck that’s widening as fast as it can to catch up to absurd demand. It may still be bottlenecked in 5 years, but not at the same level of production.
On the difficulty of intelligence
I’m torn about the “too much intelligence within bounds” stuff. On one hand, I think it points towards the most important batch of insights in the post, but on the other hand, it ends with an unsatisfying “there’s more important stuff here! I can’t talk about it but trust me bro!”
I’m not sure what to do about this. The best arguments and evidence are things that fall into the bucket of “probably don’t talk about this in public out of an abundance of caution.” It’s not one weird trick to explode the world, but it’s not completely benign either.
Continued research and private conversations haven’t made me less concerned. I do know there are some other people who are worried about similar things, but it’s unclear how widely understood it is, or whether someone has a strong argument against it that I don’t know about.
So, while unsatisfying, I’d still assert that there are highly accessible paths to broadly superhuman capability on short timescales. Little of my forecast’s variance arises from uncertainty on this point; it’s mostly a question of when certain things are invented, adopted, and then deployed at sufficient scale. Sequential human effort is a big chunk; there are video games that took less time to build than the gap between this post’s original publication date and its median estimate of 2030.
On doom
When originally writing this, my model of how capabilities would develop was far less defined, and my doom-model was necessarily more generic.
A brief summary would be:
1. We have a means of reaching extreme levels of capability without necessarily exhibiting preferences over external world states. You can elicit such preferences, but a random output sequence from the pretrained version of GPT-N (assuming the requisite architectural similarities) has no realistic chance of being a strong optimizer with respect to world states. The model itself remains a strong optimizer, just for something that doesn’t route through the world.
2. It’s remarkably easy to elicit this form of extreme capability to guide itself. This isn’t some incidental detail; it arises from the core process that the model learned to implement.
3. That core process is learned reliably because the training process that yielded it leaves no room for anything else. It’s not a sparse/distant reward target; it is a profoundly constraining and informative target.
I’ve written more on the nice properties of some of these architectures elsewhere. I’m in the process of writing up a complementary post on why I think these properties (and using them properly) are an attractor in capabilities, and further, why I think some of the x-riskiest forms of optimization process are actively repulsive for capabilities. This requires some justification, but alas, the post will have to wait some number of weeks in the queue behind a research project.
The source of the doom-update is the correction of some hidden assumptions in my doom model. My original model was downstream of agent foundations-y models, but naive. It followed a process: set up a framework, make internally coherent arguments within that framework, observe highly concerning results, then neglect to notice where the framework didn’t apply.
Specifically, some of the arguments feeding into my doom model were covertly replacing instances of optimizers with hypercomputer-based optimizers^[2], because hey, once you’ve got an optimizer and you don’t know any bounds on it, you probably shouldn’t assume it’ll just turn out convenient for you, and hypercomputer-optimizers are the least convenient.
For example, this part:
Is that enough to start deeply modeling internal agents and other phenomena concerning for safety?
And this part:
AGI probably isn’t going to suffer from these issues as much. Building an oracle is probably still worth it to a company even if it takes 10 seconds for it to respond, and it’s still worth it if you have to double check its answers (up until oops dead, anyway).
With no justification, I imported deceptive mesaoptimizers and other “unbound” threats. Under the earlier model, this seemed natural.
I now think there are bounds on pretty much all relevant optimizing processes up and down the stack from the structure of learned mesaoptimizers to the whole capability-seeking industry. Those bounds necessarily chop off large chunks of optimizer-derived doom; many outcomes that previously seemed convergent to me now seem extremely hard to access.
As a result, “technical safety failure causes existential catastrophe” dropped in probability by around 75-90%, down to something like 5%-ish.^[3]
I’m still not sure how to navigate a world with lots of extremely strong AIs. As capability increases, outcome variance increases. With no mitigations, more and more organizations (or, eventually, individuals) will have access to destabilizing systems, and they would amplify any hostile competitive dynamics.^[4] The “pivotal act” frame gets imported even if none of the systems are independently dangerous.
I’ve got hope that my expected path of capabilities opens the door for more incremental interventions, but there’s a reason my total P(doom) hasn’t yet dropped much below 30%.
1. ^
  The reason why this isn’t an update for me is that I was being deliberately conservative at the time.
2. ^
  A hypercomputer-empowered optimizer can jump to the global optimum with brute force. There isn’t some mild greedy search to be incrementally shaped; if your specification is even slightly wrong in a sufficiently complex space, the natural and default result of a hypercomputer-optimizer is infinite cosmic horror.
3. ^
  It’s sometimes tricky to draw a line between “oh this was a technical alignment failure that yielded an AI-derived catastrophe, as opposed to someone using it wrong,” so it’s hard to pin down the constituent probabilities.
4. ^
  While strong AI introduces all sorts of new threats, its generality amplifies “conventional” threats like war, nukes, and biorisk, too. This could create civilizational problems even before a single AI could, in principle, disempower humanity.
What links here?
- Voting Results for the 2022 Review by Ben Pace (2 Feb 2024 20:34 UTC; 57 points)

porby 29 Nov 2023 4:06 UTC
26 points
5
on: How to Control an LLM’s Behavior (why my P(DOOM) went down)
Signal boosted! This is one of those papers that seems less known that it should be. It’s part of the reason why I’m optimistic about dramatic increases in the quality of “prosaic” alignment (in the sense of avoiding jailbreaks and generally behaving as expected) compared to RLHF, and I think it’s part of a path that’s robust enough to scale.
You can compress huge prompts into metatokens, too (just run inference with the prompt to generate the training data). And nest and remix metatokens together.
It’s also interesting in that it can preserve the constraints on learnable values during predictive training, unlike approaches equivalent to RL with sparse/distant rewards.
The fact that the distinctions it learns about the metatokens become better and better as more optimization pressure is applied is an interesting inversion of the usual doom-by-optimization story. Taking such a model to the extreme of optimization just makes it exceedingly good at distinguishing subtle details of what constitutes <nice> versus <authoritative_tone> versus <correct>. It’s an axis of progress in alignment that generalizes as the capability does; the capability is the alignment. I’m pretty certain that a model that has very thoroughly learned what “nice” means at the human level can meaningfully generalize it to contexts where it hasn’t seen it directly applied.^[1]
I’m also reasonably confident in finding some other paths to extremely similar effects on internal representations. I wouldn’t be surprised if we can decompose conditions into representational features to learn about what they mean at the learned feature level, then cobble together new inference-time conditions via representational intervention that would have equivalent effects to training new metatokens.
1. ^
  After all, ChatGPT4/DALLE3 can generate an image of a vacuum cleaner that “embodies the aspirational human trait of being kind to one another.” That seems like more of a reach than a hypothetical superintelligence figuring out that humans wouldn’t be okay with, say, a superscience plan that would blow up 25% of the earth’s crust.

FFMI Gains: A List of Vitalities

porby12 Jan 2023 4:48 UTC

25 points

1 comment7 min readLW link

porby 28 Sep 2022 16:49 UTC
25 points
7
in reply to: jacob_cannell’s comment on: Why I think strong general AI is coming soon
Out of curiosity:
1. What rough probability do you assign to a 10x improvement in efficiency for ML tasks (GPU or not) within 20 years?
2. What rough probability do you assign to a 100x improvement in efficiency for ML tasks (GPU or not) within 20 years?
My understanding is that we actually agree about the important parts of hardware, at least to the degree I think this question is even relevant to AGI at this point. I think we may disagree about the software side, I’m not sure.
I do agree I left a lot out of the hardware limits analysis, but largely because I don’t think it is enough to move the needle on the final conclusion (and the post is already pretty long!).

porby 2 Jun 2023 17:15 UTC
24 points
2
on: Think carefully before calling RL policies “agents”
I’m generally on board with attempts to have more precise options for referring to these concepts, and in this context I agree that policy as a term is more appropriate and that gradients from RL training don’t magically include more agent juice.
That said, I do think there is an important distinction between the tendencies of systems built with RL versus supervised learning that arises from reward sparsity.
In traditional RL, individual policy outputs aren’t judged in as much detail as in supervised learning. Even when comparing against RL with reward shaping, it is still likely going to be far less densely defined and constrained than, say, per-output predictive loss.
Since the target is smaller and more distant, traditional RL gives the optimizer more room to roam. I think it’s correct to say that most RL implementations will have a lot of reactive bits and pieces that are selected to form the final policy, but because learning instrumental behavior is effectively required for traditional RL to get anywhere at all, it’s more likely (than in predictive loss) that nonmyopic internal goal-like representations will be learned as a part of those instrumental behaviors.
Training on purely predictive loss, in contrast, is both densely informative and extremely constraining. Goals are less obviously convergently useful, and any internal goal representations that are learned need to fit within the bounds enforced by the predictive loss and should tend to be more local in nature as a result. Learned values that overstep their narrowly-defined usefulness get directly slapped by other predictive samples.
I think the greater freedom RL training tends to have, and the greater tendency to learn more broadly applicable internal goals to drive the required instrumental behavior, do make RL-trained systems feel more “agentic” even if it is not absolutely fundamental to the training process, nor even really related to the model’s coherence.
What links here?
- A potentially high impact differential technological development area by Noosphere89 (8 Jun 2023 14:33 UTC; 5 points)
- Noosphere89's comment on AI #14: A Very Good Sentence by Zvi (3 Jun 2023 14:42 UTC; 1 point)

porby 1 May 2023 0:45 UTC
22 points
6
on: Discussion about AI Safety funding (FB transcript)
Ways to turn $$$ into impact that aren’t happening:
...
4. Increasing independent alignment researcher salary to like 150k/year (depending on location) to enable better time money tradeoffs.
This is a bit surprising to me!
If we’re talking about reasonably proven high-skill researchers, 150,000 USD/year in a reasonable cost of living area (e.g. DFW area) is… still sufficiently uncompetitive that some people that could have contributed will definitely follow the money instead. With the implied skills and background, it’s not too hard to make a lot more money with less effort.
The implication that current funding tends to target even lower salaries than this is concerning.
Random thoughts:
- If this number is a rough average that includes junior researchers (e.g. just out of school), there might be no issue.
- I know funding at or above 150k/year is possible based on the public reports, so there isn’t a hard threshold, which is good.
- I could see an argument for preferring funding those who are already sufficiently deep that they are willing to take significant paycuts relative to industry because of risk/timelines/charitability/autonomy. That subset of people might have higher impact per dollar spent (and have lower rates of fraud), but I doubt that this is a survival maximizing strategy unless you have enough talent in that pool that you become funding constrained.
- I’m worried that the rate of funding across the field is lower than it should be (given my timelines, at least). Seems like the game’s going to end with a lot of unspent resources.

porby 3 Aug 2022 3:59 UTC
22 points
13
in reply to: aogara’s comment on: Two-year update on my personal AI timelines
I suspect Chinchilla’s implied data requirements aren’t going to be that much of a blocker for capability gain. It is an important result, but it’s primarily about the behavior of current backpropped transformer based LLMs.
The data inefficiency of many architectures was known before Chinchilla, but the industry worked around it because it wasn’t yet a bottleneck. After Chinchilla, it has become one of the largest architectural optimization targets. Given the increase in focus and the relative infancy of the research, I would guess the next two years will see the picking of some very juicy low hanging fruit. There are a lot of options floating nearby in conceptspace and there is a lot of room to grow; I’d be surprised if data limitations still feel as salient in 2025.

porby

On the difficulty of intelligence

On doom