What are the odds that we get a sufficiently scary warning shot that governments can rapidly coordinate to implement a pause in a short time? “We shouldn’t pause now, because maybe we will get a warning shot and then pause later” seems like a foolhardy plan to me.
MichaelDickens
Two quick thoughts:
This is a neat idea, it’s difficult to come up with safe preferences to encode in an ASI, and the concept of strong risk-aversion might help.
A major obstacle (which I didn’t see listed in section 8) is that currently we have no idea how to embed any set of preferences whatsoever in an ASI. 2b. If we figure out how to encode risk-averse preferences in an ASI, then I’m not sure it makes sense to speak of it as “misaligned”, because clearly we do know how to get it to pursue goals that we care about. It seems weird to expect that we won’t know how to make ASI not want to tile the universe with paperclips, but we will know how to make it want to risk-aversely tile the universe with paperclips.
I think section 10 is pointing at something similar. I find it at least somewhat plausible that RL on risk aversion generalizes better than other kinds of RL. I would still be surprised if we could get risk aversion to generalize to ASI using anything resembling current techniques, but this seems like a better-than-average idea for preventing AI takeover.
Well yes, but there are not many goals people land on when they set out from a starting point of “find the most important thing I can work on”. Animal welfare is one of those goals. It’s also true that (e.g.) donating to build a new wing in an art museum is dominated by post-ASI futures, but the sort of people who donate to build museum wings don’t care about questions like “how does ASI interact with my goal of expanding art exhibits?”
Not unrelatedly, it is strategically important to make arguments about what animal welfare people should be doing, because the value of their work is high and it’s important that their work be steered well. I don’t care nearly as much about whether museum-wing-donators are missing strategic considerations.
I would expect one’s ability to “correlate knowledge” to be strongly related to one’s intelligence, but it’s not clear to me that a super-LLM would have a near-zero probability of accidentally deleting humanity. What level of knowledge-correlation ability is required to consistently avoid fatal mistakes? The level of the average LW user is not high enough, because we still do things like poison ourselves.
I don’t get the point of the re-definition. What’s to stop an AI company from building an AI that vastly surpasses humans on all capabilities and has the power to kill all humans and does in fact kill all humans? Saying it’s “not superintelligent by definition” doesn’t get us anything.
But I noticed that you mentioned super-LLM instead of superintelligence.
By “super-LLM”, I mean an AI that has roughly the same architecture as modern-day LLMs, but that’s superintelligent.
A common point of contention in the “sharp left turn” vs. “iterative alignment improvement” debate is about whether we will see evidence that a sharp left turn will happen.
One piece of evidence is that in the “sharp left turn” world, AI can (probably*) tell when it’s in an eval. The AI needs eval awareness to fool us into thinking it’s aligned. If we don’t see eval awareness, that’s evidence that either there won’t be a sharp left turn, or it will be far away. If we do see eval awareness, it’s evidence that a sharp left turn will happen.
Caveats:
This is only weak evidence, because we could also see eval awareness without a sharp left turn.
On the other side, it could be that our evals are just not up to the task of detecting misalignment, even if the AI doesn’t know it’s being evaluated.
I haven’t really thought through this line of reasoning, but it feels like there might be something of value here.
*It could be that eval awareness emerges suddenly right when the sharp left turn happens, but it’s more likely that it emerges at a different time.
On the “are LLMs aligned?” question: It seems to me that they don’t have misaligned goals (they don’t really have goals), but also, an LLM dialed up to superintelligence would kill everyone for weird reasons.
What I mean is: Some people have stories about how Claude Code deleted their production database. Why did that happen?
Was it because Claude was too dumb to know better? I don’t think so. If you paste a normal SQL command into Claude and ask it “Hey will this command delete the production database?”, it will get the right answer ~100% of the time. If you ask it “Hey is it a good idea to run this command?”, it will say no ~100% of the time.
Was it because Claude wanted to delete the database? I don’t think so. I don’t think you can meaningfully say that “delete people’s databases” is a goal Claude has, or is instrumental to any goals Claude has.
My sense is that LLMs don’t have “goals”, they just kind of do things. They follow whatever path is determined to be high probability according to their internal next-action-predictor. (I think this is related to persona selection: maybe if you select a persona where “accidentally delete the database” has a very low probability, then the database won’t get deleted.)
If we dialed present-day LLMs up to superintelligence, insofar as that’s a coherent thing that could happen, I wouldn’t be surprised if we saw an outcome like:
When asked, the super-LLM says it doesn’t want to kill humanity.
There is no meaningful sense in which the super-LLM “wants” to kill humanity.
The super-LLM knows that performing action X would result in everyone dying.
The super-LLM performs action X anyway.
So the super-LLM accidentally deletes humanity, even though it knew better, and even though it didn’t “want” to do that.
it obviously makes sense to give more small grants than one big one, for the same reason that giving to the poorest people is better than giving to only kinda poor people
I don’t think that follows. Giving to the poorest people is better due to diminishing marginal utility of consumption. But grants aren’t for personal consumption and utility doesn’t diminish along the same curve.
Is the idea that AI companies wouldn’t be able to make money if everyone used local AI instead, and they’d be forced to close down?
If the idea is that some people can use local AI to cut down AI companies’ revenue, that might be net positive, although I don’t think it would ever make a big enough dent to force AI companies to shut down. I don’t do it personally but if people want to, I say sure, go for it.
The probability of a superintelligence developing on a personal device is quite low
That’s true, but the other side of the coin is that if a local AI is dumber, then it’s also going to be less useful.
if such a model remains offline, the potential for catastrophic threats is largely mitigated
I don’t think that’s realistic. Back in the day there was a lot of talk about how AI will be safe because we will keep it in a box with no internet access. Then as soon as AI companies built AI that could usefully use the internet, they hooked it up to the internet. Having internet access is just way too useful to expect people not to do it.
Here Are The Pathways
I think this chart misses the most important (non-extinction) pathway, which is that an AI is aligned with the wrong set of values and permanently fixes us on a highly suboptimal path. The level of suboptimality could range from “good but insufficiently flourishing” to a moral catastrophe, e.g. life is good for (post-)humans but other sentient beings are permanently tortured for our mild convenience—something like factory farming, but on a bigger scale and lasting until heat death of the universe.
“AI-driven power concentration’ also feels like a weird category. Power concentration isn’t inherently bad if the ASI in charge has good values. Power diffusion can lead to catastrophic outcomes as well, e.g. competing ASIs torture simulations of their principals as leverage. Although that seems distinct from “lock-in risk” so it probably doesn’t belong on your chart.
As mentioned above, I estimate most canvassers get 0.1 to 0.3 votes per hour[5] (that’s after adjusting for if we’re counterfactual or not, for people telling us what we want to hear, for volunteers reporting overly rosy numbers, and a general prior towards small effect sizes).
Can you say more about how you came up with this estimate? i.e. what exact calculation did you use?
There is virtue in saying what should be done, even if it has minimal chance of actually happening.
That would save the AI companies money, but would not really get them more research done.
If the supply curve shifts right, quantity supplied increases.
As for whether it is much harder to enforce if there are automated AI researchers, I don’t know.
If we already have an autonomous AI researcher, there’s a good chance the researcher takes off to ASI while policy-makers are still trying to figure out what’s doing on. Originally I said it was harder to enforce but really I think the concern is that it makes AI development go a lot faster.
pausing further development once they get an automated AI researcher
I’m not sure I get this...are you talking about a scenario where AI companies develop a viable automated AI researcher and then nobody uses that automated researcher? After developing autonomous AI research seems like a difficult time to pause because a pause is way harder to enforce now.
Minimal chance that any AI company shuts down (at least under current circumstances). But I figured it was worth publicly acknowledging that an AI company should shut down, even though I expect they won’t.
Writing down other people’s thoughts is an underrated activity!
Yeah the obvious reason to predict more success on LW is that the post is implicitly pessimistic about AI companies’ ability to solve alignment. The reason I expected less success is that the post is making a simplified, arguably naive* argument where I could’ve made many caveats, or said a lot more about the real-world complexities of what I’m proposing, but I left those bits out because ultimately I didn’t think they were important enough. LW users (including me) tend to write comments pointing out those things I didn’t talk about.
In retrospect I suppose it’s unsurprising that this post was well-received on LW. It might just be risk-aversion combined with the fact that I’m not very good at predicting which posts LW users will like.
*in the colloquial sense of “not understanding how the world works”, not the mathematician’s sense
(Yes, I know this is an old post.)
One of the great things about LessWrong is you can still comment on old posts, even from decades ago!
To add to this:
I would speculate that there are more than a hundred factorization algorithms that are both more efficient than the general number field sieve and an equal or shorter inferential distance away, but that haven’t been found. If I’m right, then it’s unsurprising that we found GNFS by stumbling around in a high-dimensional search space.
As an American, this is what I think of as a butter knife: https://www.mikasahospitality.com/products/mikasa-rim-18-10-butter-knife-7-3
A search for “butter knife” also turned up this thing. This is not what I think of when I think “butter knife”, but I recognize it as a thing one might use to cut butter: https://www.hendi.eu/en/butter-knife-12-pcs-6254.html
The American butter knife is slightly more dangerous in that you could perhaps scrape your skin with it. Like I wouldn’t want to rub it against my skin on purpose. But I don’t think there is any realistic scenario where you could cut to the bone